Cleaning Data with Pandas

A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.

The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.

Import Libraries:

Import pandasNumPy, and Python’s regular expression operations library (re).

import pandas as pd
import numpy as np
import re

Import Files:

Use read_csv to load the files you wish to clean.

twt_arc = pd.read_csv('twitter_archive.csv')
img_pred = pd.read_csv('image_predictions.csv')
twt_counts = pd.read_csv('tweet_counts.csv')

Create Copies:

Create copies of the original files using copy before cleaning just in case you need to restore some of the original contents.

twt_arc_clean = twt_arc.copy()
img_pred_clean = img_pred.copy()
twt_counts_clean = twt_counts.copy()

Merge Data:

Combine specific files using the merge function.

In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.

df1 = pd.merge(twt_arc_clean, img_pred_clean, how='left')
df2 = pd.merge(df1, twt_counts, how='left')

Drop Columns:

Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.

The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.

First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.

df2_clean = df2_clean[df2_clean['retweeted_status_id'].isnull()]

df2_clean = df2_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 
                          'retweeted_status_timestamp'], axis='columns')

Change Data Types:

Use astype by listing the preferred data type as the argument.

The Tweet IDs were uploaded as integers, so I convert them to objects.

df2_clean.tweet_id = df2_clean.tweet_id.astype(object)

Use to_datetime to convert a column to datetime by entering the selected column as the argument.

Time stamps were objects instead of datetime objects. I create a new column called ‘time’ and delete the old ‘timestamp’ column.

df2_clean['time'] = pd.to_datetime(df2_clean['timestamp'])

df2_clean = df2_clean.drop('timestamp', 1)

Replace Text:

Use the replace function and list the old value to replace followed by the new value.

Text entries for this data set had the shortened spelling of ampersand instead of the symbol itself.

df2_clean['text'] = df2_clean['text'].replace('&', '&')

Combine and Map Columns:

First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.

Next, use unique to verify all the possible combinations to re-map from the result.

Then, use map to replace row entries with preferred values.

In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.

df2_clean['dog_type'] = df2_clean[df2_clean.columns[6:10]].apply(lambda x:                                                                    
                ','.join(x.dropna().astype(str)), axis=1)


df2_clean['dog_type'] ={'None,None,None,None': np.nan, 

Remove HTML Tags:

Write a function to remove HTML tags using re. Compile the tags by specifying ‘<.*?>’, and use sub to replace the compiled tags with empty spaces.

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

df2_clean['source'] = df2_clean['source'].apply(remove_html_tags)


Reducing Plastic Use

Reducing Plastic Use
Various pieces of plastic trash debris are strewn alongside seaweed and rocks on a beach.
Assorted plastic trash on the beach at Pelican Cove Park in Rancho Palos Verdes, CA, 2017.

In the spirit of this year’s Earth Day theme (‘End Plastic Pollution’), I researched the fate of plastic. The Environmental Protection Agency (EPA) prepared a report for 2014 municipal waste stream data for the United States. Plastic products were either recycled, burned for energy production, or sent to landfills. I used pandas to look at the data and Matplotlib to create a graph. I included percentages for each fate and compared the categories of total plastics, containers and packaging, durable goods, and nondurable goods.

A graph compares different types of plastic products and their fate in the municipal waste stream.
Percentages of total plastics and plastic types that get recycled, burned for energy, or sent to a landfill, according to the EPA.

The EPA data shows a majority of plastic products reported in the waste stream were sent to landfills. Obviously, not all plastic waste actually reaches a recycling facility or landfill. Roadsides, waterways, and beaches are all subject to plastic pollution. Decreasing personal use of plastic products can help reduce the overall production of waste.

Here are some ideas for cutting back on plastic use:

  • Bring reusable shopping bags to every store.
    • Utilize cloth bags for all purchases.
    • Opt for reusable produce bags for fresh fruit and vegetables instead of store-provided plastic ones.
  • Ditch party plasticware.
    • Buy an assortment of silverware from a thrift store for party use.
    • Snag a set of used glassware for drinks instead of buying single-use plastic cups.
  • Use Bee’s Wrap instead of plastic wrap.
    • Bee’s Wrap is beeswax covered cloth for food storage. It works exactly the same as plastic wrap, but it can be used over and over.
  • Choose glassware instead of plastic zip-locked bags for storing food.
    • Glass containers like Pyrex can be used in place of single-use plastic storage bags.
  • Say ‘no’ to plastic straws.
    • Get in the habit of refusing a straw at restaurants when you go out.
    • Bring a reusable straw made out of bamboo, stainless steel, or glass to your favorite drink spot.


To check out the code for the figure I created, here’s the repository for it.