A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.
The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.
Import Libraries:
Import pandas, NumPy, and Python’s regular expression operations library (re).
import pandas as pd import numpy as np import re
Import Files:
Use read_csv to load the files you wish to clean.
twt_arc = pd.read_csv('twitter_archive.csv') img_pred = pd.read_csv('image_predictions.csv') twt_counts = pd.read_csv('tweet_counts.csv')
Create Copies:
Create copies of the original files using copy before cleaning just in case you need to restore some of the original contents.
twt_arc_clean = twt_arc.copy() img_pred_clean = img_pred.copy() twt_counts_clean = twt_counts.copy()
Merge Data:
Combine specific files using the merge function.
In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.
df1 = pd.merge(twt_arc_clean, img_pred_clean, how='left') df2 = pd.merge(df1, twt_counts, how='left')
Drop Columns:
Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.
The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.
First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.
df2_clean = df2_clean[df2_clean['retweeted_status_id'].isnull()] df2_clean = df2_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp'], axis='columns')
Change Data Types:
Use astype by listing the preferred data type as the argument.
The Tweet IDs were uploaded as integers, so I convert them to objects.
df2_clean.tweet_id = df2_clean.tweet_id.astype(object)
Use to_datetime to convert a column to datetime by entering the selected column as the argument.
Time stamps were objects instead of datetime objects. I create a new column called ‘time’ and delete the old ‘timestamp’ column.
df2_clean['time'] = pd.to_datetime(df2_clean['timestamp']) df2_clean = df2_clean.drop('timestamp', 1)
Replace Text:
Use the replace function and list the old value to replace followed by the new value.
Text entries for this data set had the shortened spelling of ampersand instead of the symbol itself.
df2_clean['text'] = df2_clean['text'].replace('&', '&')
Combine and Map Columns:
First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.
Next, use unique to verify all the possible combinations to re-map from the result.
Then, use map to replace row entries with preferred values.
In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.
df2_clean['dog_type'] = df2_clean[df2_clean.columns[6:10]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1) df2_clean['dog_type'].unique() df2_clean['dog_type'] = df2_clean.dog_type.map({'None,None,None,None': np.nan, 'doggo,None,None,None':'doggo', 'None,None,None,puppo':'puppo', 'None,None,pupper,None':'pupper', 'None,floofer,None,None':'floofer', 'doggo,None,None,puppo':'doggo/puppo', 'doggo,floofer,None,None':'doggo/floofer', 'doggo,None,pupper,None':'doggo/pupper'})
Remove HTML Tags:
Write a function to remove HTML tags using re. Compile the tags by specifying ‘<.*?>’, and use sub to replace the compiled tags with empty spaces.
def remove_html_tags(text): clean = re.compile('<.*?>') return re.sub(clean, '', text) df2_clean['source'] = df2_clean['source'].apply(remove_html_tags)