Cleaning Data with Pandas

A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.

The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.

Import Libraries:

Import pandasNumPy, and Python’s regular expression operations library (re).

import pandas as pd
import numpy as np
import re

Import Files:

Use read_csv to load the files you wish to clean.

twt_arc = pd.read_csv('twitter_archive.csv')
img_pred = pd.read_csv('image_predictions.csv')
twt_counts = pd.read_csv('tweet_counts.csv')

Create Copies:

Create copies of the original files using copy before cleaning just in case you need to restore some of the original contents.

twt_arc_clean = twt_arc.copy()
img_pred_clean = img_pred.copy()
twt_counts_clean = twt_counts.copy()

Merge Data:

Combine specific files using the merge function.

In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.

df1 = pd.merge(twt_arc_clean, img_pred_clean, how='left')
df2 = pd.merge(df1, twt_counts, how='left')

Drop Columns:

Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.

The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.

First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.

df2_clean = df2_clean[df2_clean['retweeted_status_id'].isnull()]

df2_clean = df2_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 
                          'retweeted_status_timestamp'], axis='columns')

Change Data Types:

Use astype by listing the preferred data type as the argument.

The Tweet IDs were uploaded as integers, so I convert them to objects.

df2_clean.tweet_id = df2_clean.tweet_id.astype(object)

Use to_datetime to convert a column to datetime by entering the selected column as the argument.

Time stamps were objects instead of datetime objects. I create a new column called ‘time’ and delete the old ‘timestamp’ column.

df2_clean['time'] = pd.to_datetime(df2_clean['timestamp'])

df2_clean = df2_clean.drop('timestamp', 1)

Replace Text:

Use the replace function and list the old value to replace followed by the new value.

Text entries for this data set had the shortened spelling of ampersand instead of the symbol itself.

df2_clean['text'] = df2_clean['text'].replace('&', '&')

Combine and Map Columns:

First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.

Next, use unique to verify all the possible combinations to re-map from the result.

Then, use map to replace row entries with preferred values.

In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.

df2_clean['dog_type'] = df2_clean[df2_clean.columns[6:10]].apply(lambda x:                                                                    
                ','.join(x.dropna().astype(str)), axis=1)


df2_clean['dog_type'] ={'None,None,None,None': np.nan, 

Remove HTML Tags:

Write a function to remove HTML tags using re. Compile the tags by specifying ‘<.*?>’, and use sub to replace the compiled tags with empty spaces.

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

df2_clean['source'] = df2_clean['source'].apply(remove_html_tags)


How to Choose an Online Data Science Course

Multiple factors can play a role in your decision process when selecting an online data science course. It is important to remember that no two educational resources are exactly the same. I recommend carefully considering your needs and learning goals, and feel free to give multiple websites a try before making a decision.

Here’s a quick overview of the major components of three educational resources I have been using to learn data science. This is based on my experiences with Codecademy, DataCamp, and Udacity. There are plenty of other educational websites to chose from, including Coursera and Udemy.

Languages for Data SciencePython and SQLPython, R, and SQLPython, R, and SQL
FormatInteractive Lessons and ExercisesInteractive Lessons, Exercises, and VideosVideos and Exercises
Unique FeaturesNo VideosAvailable via Mobile AppVideos Feature Industry Professionals
Helpful ResourcesHints, 'Get Help' Live Chat for Pro Users, and Community ForumHints, Show Answer Options, and Community ForumCommunity Forum
Free ContentFree CoursesFree Courses and Access to First Section of All Premium LessonsFree Courses
Premium Program CostsCodecademy Pro:
$15.99 - $19.99
per month
DataCamp Membership:
$25 - $29 per month
Data Analyst Nanodegree: $200 per month
Features of Premium CoursesQuizzes, Advisers, and ProjectsProjects and Course Completion CertificatesProject Review and Career Services

Pick a Language

Two of the most popular languages for data science are Python and R. Another language called SQL (Structured Query Language) is also helpful to know because you can use it to work with specific data in a database. Python and R are both widely used, so I recommend trying out each language if you’re aiming to focus on just one. Depending on your preferences, the offerings of Codecademy, DataCamp, and Udacity may play a role in your decision. Codecademy offers Python and SQL. DataCamp has lessons in Python, SQL, and R, with career tracks for data scientists with Python and R. Udacity has a selection of courses that cater to all three languages. At the end of the day, choosing a language depends on how you seek to use your data science skills.

Learning Style

Test out different websites and make sure you enjoy the format of lessons before committing to one, and especially before paying for a subscription or program. If video lessons play to your strengths, I recommend using Udacity. Course videos are instructed by a wide range of data science industry professionals. This offers a unique perspective as to how people use data science in specific career areas.

Websites like Codecademy and DataCamp are designed for hands-on, visual learners. Both websites offer a console with instant feedback when you run lines of code. Codecademy, unlike DataCamp and Udacity, does not include video lessons in the curriculum. If you prefer reading at your own pace and executing lines of code without trying to absorb a video lecture, Codecademy might be right for you. DataCamp provides video introductions before coding lessons and tasks. Also, DataCamp offers an app for on-the-go coding lessons. However, the preferred format for learning with DataCamp is on the computer.

Helpful Resources 

There are tools in all three websites that help you if you get stuck on a problem. Codecademy and DataCamp offer hints specific to assigned tasks, as well as access to community forums where users can post questions for others to answer. Codecademy also offers live chat assistance for Pro members, where a tutor will review code in real time. DataCamp features an option to show the answer code for an assigned task, if you are still having trouble after reviewing a hint. The format of Udacity does not involve an interactive console, so when your code is incorrect, the best place to find help is on their community forums.

Free Content

Codecademy, DataCamp, and Udacity all offer free courses that can cater to your interests in data science. Free lessons on each website are self-paced and designed to adapt to your schedule and lifestyle.

Premium Programs

Each website offers the option to pay for access to additional content and benefits.

  • Codecademy Pro offers three levels of subscription: one month($19.99), six months($17.99 per month), and a year($15.99 per month). There’s also an option for Pro Intensive courses, such as Intro to Data Analysis, that cost $199 each. Membership benefits include quizzes and projects.
  • DataCamp membership is in the form of a monthly plan($29 per month) or a yearly plan($25 per month). Members gain unlimited access to all programs.
  • Udacity offers a Data Analyst Nanodegree program with 2 three-month terms. Term 1($499) and term 2($699) result in a cost of about $200 per month for six months. Benefits of this program include project feedback and exclusive career services.

DataCamp’s membership offers the most flexibility out of these three platforms because premium lessons are at your own pace. For Codecademy members, the Intro to Data Analysis Pro Intensive has an outlined course time frame of 4 months. You can work ahead as much as you would like depending on your schedule. Udacity’s Data Analyst Nanodegree program is made up of two 3-month terms, for a total of 6 months estimated to complete the program.