Cleaning Data with Pandas

A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.

The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.

Import Libraries:

Import pandasNumPy, and Python’s regular expression operations library (re).

import pandas as pd
import numpy as np
import re

Import Files:

Use read_csv to load the files you wish to clean.

twt_arc = pd.read_csv('twitter_archive.csv')
img_pred = pd.read_csv('image_predictions.csv')
twt_counts = pd.read_csv('tweet_counts.csv')

Create Copies:

Create copies of the original files using copy before cleaning just in case you need to restore some of the original contents.

twt_arc_clean = twt_arc.copy()
img_pred_clean = img_pred.copy()
twt_counts_clean = twt_counts.copy()

Merge Data:

Combine specific files using the merge function.

In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.

df1 = pd.merge(twt_arc_clean, img_pred_clean, how='left')
df2 = pd.merge(df1, twt_counts, how='left')

Drop Columns:

Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.

The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.

First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.

df2_clean = df2_clean[df2_clean['retweeted_status_id'].isnull()]

df2_clean = df2_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 
                          'retweeted_status_id','retweeted_status_user_id',                                                        
                          'retweeted_status_timestamp'], axis='columns')

Change Data Types:

Use astype by listing the preferred data type as the argument.

The Tweet IDs were uploaded as integers, so I convert them to objects.

df2_clean.tweet_id = df2_clean.tweet_id.astype(object)

Use to_datetime to convert a column to datetime by entering the selected column as the argument.

Time stamps were objects instead of datetime objects. I create a new column called ‘time’ and delete the old ‘timestamp’ column.

df2_clean['time'] = pd.to_datetime(df2_clean['timestamp'])

df2_clean = df2_clean.drop('timestamp', 1)

Replace Text:

Use the replace function and list the old value to replace followed by the new value.

Text entries for this data set had the shortened spelling of ampersand instead of the symbol itself.

df2_clean['text'] = df2_clean['text'].replace('&', '&')

Combine and Map Columns:

First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.

Next, use unique to verify all the possible combinations to re-map from the result.

Then, use map to replace row entries with preferred values.

In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.

df2_clean['dog_type'] = df2_clean[df2_clean.columns[6:10]].apply(lambda x:                                                                    
                ','.join(x.dropna().astype(str)), axis=1)

df2_clean['dog_type'].unique()

df2_clean['dog_type'] = df2_clean.dog_type.map({'None,None,None,None': np.nan, 
                'doggo,None,None,None':'doggo',
                'None,None,None,puppo':'puppo', 
                'None,None,pupper,None':'pupper',
                'None,floofer,None,None':'floofer', 
                'doggo,None,None,puppo':'doggo/puppo',
                'doggo,floofer,None,None':'doggo/floofer', 
                'doggo,None,pupper,None':'doggo/pupper'})

Remove HTML Tags:

Write a function to remove HTML tags using re. Compile the tags by specifying ‘<.*?>’, and use sub to replace the compiled tags with empty spaces.

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

df2_clean['source'] = df2_clean['source'].apply(remove_html_tags)

 

Project Overview: FoodPact

A fork and knife surround a plate that has earth on it and FoodPact is written below.
Logo for FoodPact program

A few months ago I began a project with my brother to create a calculator for the environmental footprint of food. It’s called FoodPact to merge food and ecological impact. It’s a work in progress and I’m excited to share the code for it.

Data sources to inform the calculator include:

  • Water footprint data for crops from a 2011 study by M.M. Mekonnen and A. Y. Hoekstra.
  • Greenhouse gas emissions data from Business for Social Responsibility (BSR) and Environmental Protection Agency (EPA) documents on transport via boatrail, and freight.
  • Food waste data from the United States Department of Agriculture (USDA) Economic Research Service (ERS)
  • Global food import data from the USDA Foreign Agricultural Service’s Global Agricultural Trade System (GATS).
  • Country centroid data from a President and Fellows of Harvard College 2015 data file.
  • US city locations from  SimpleMaps.

We used a Bootstrap Bootswatch for the web application’s layout and Flask as the microframework.

Python packages used in the program include:

  • Pandas to create more refined dataframes for use within the application
  • NumPy for equations
  • geopy for calculating great circle distance between latitudes and longitudes
  • Matplotlib and pyplot for creating graphs

The whole point of the program is to take a user’s location, food product, and the product’s country of origin to generate the estimated distance the food traveled, the approximate amount of carbon dioxide that travel generated, and the water requirements for the product.

Conversions include: cubic metric tons to gallons of water, tons of crops to pounds, and grams of carbon dioxide per kilometer to pounds per mile.

Selected graphics from FoodPact project:

One limitation of the calculator is that the values for carbon dioxide consider either full travel by ship, train, or truck and not a combination of the three methods. Emissions refer to the amount it takes to ship a twenty-foot equivalent (TEU) container full of the food product across the world. The country of origin considers the centroid and not the exact location of food production. Similarly, the list of cities displays the 5 most populated cities in that given state. The only exception is New York, for which I considered New York City close enough in latitude and longitude to account for Brooklyn, Queens, Manhattan, the Bronx, and Staten Island.

The data referenced in the calculator is meant to give a relative idea of the inputs required to generate and transport food products to give perspective to consumers. Ideally, the calculator will encourage conversations about the food system and inspire people to reduce their personal food waste.

Reducing Plastic Use

Reducing Plastic Use
Various pieces of plastic trash debris are strewn alongside seaweed and rocks on a beach.
Assorted plastic trash on the beach at Pelican Cove Park in Rancho Palos Verdes, CA, 2017.

In the spirit of this year’s Earth Day theme (‘End Plastic Pollution’), I researched the fate of plastic. The Environmental Protection Agency (EPA) prepared a report for 2014 municipal waste stream data for the United States. Plastic products were either recycled, burned for energy production, or sent to landfills. I used pandas to look at the data and Matplotlib to create a graph. I included percentages for each fate and compared the categories of total plastics, containers and packaging, durable goods, and nondurable goods.

A graph compares different types of plastic products and their fate in the municipal waste stream.
Percentages of total plastics and plastic types that get recycled, burned for energy, or sent to a landfill, according to the EPA.

The EPA data shows a majority of plastic products reported in the waste stream were sent to landfills. Obviously, not all plastic waste actually reaches a recycling facility or landfill. Roadsides, waterways, and beaches are all subject to plastic pollution. Decreasing personal use of plastic products can help reduce the overall production of waste.

Here are some ideas for cutting back on plastic use:

  • Bring reusable shopping bags to every store.
    • Utilize cloth bags for all purchases.
    • Opt for reusable produce bags for fresh fruit and vegetables instead of store-provided plastic ones.
  • Ditch party plasticware.
    • Buy an assortment of silverware from a thrift store for party use.
    • Snag a set of used glassware for drinks instead of buying single-use plastic cups.
  • Use Bee’s Wrap instead of plastic wrap.
    • Bee’s Wrap is beeswax covered cloth for food storage. It works exactly the same as plastic wrap, but it can be used over and over.
  • Choose glassware instead of plastic zip-locked bags for storing food.
    • Glass containers like Pyrex can be used in place of single-use plastic storage bags.
  • Say ‘no’ to plastic straws.
    • Get in the habit of refusing a straw at restaurants when you go out.
    • Bring a reusable straw made out of bamboo, stainless steel, or glass to your favorite drink spot.

 

To check out the code for the figure I created, here’s the repository for it.

How Does Environmental Science Relate to Computer Programming?

How Does Environmental Science Relate to Computer Programming?

This is a question I have received quite frequently in recent weeks. Computer programming languages can be used to make scientific analysis much easier. This applies directly to environmental science because there is a wealth of data within the world of ecological studies. Coding offers a way for scientists to automate repetitive  tasks using lines of code, and therefore freeing up time for other work. This can result in the creation of new software that can be utilized by scientists across disciplines.

Statistics and environmental science go hand in hand. Science experiments involve a natural order of determining a hypothesis, establishing test methods, collecting data, analyzing data, and drawing conclusions.

The data analysis portion is where statistical models are important. Oftentimes, scientists want to know whether the results of their experiments hold statistical significance. This means proving that the trends observed in data are not just a result of some sort of mistake in the experiment design or execution. Computer programming languages such as Python can help scientists execute the statistical analysis of data by writing code to analyze their data. Python packages like NumPy provide a basis for computational analysis and Python libraries like SciPy offer modules such as scipy.stats that offer the ability to perform hypothesis tests. These include T-Tests and Analysis of Variance (ANOVA) tests on numerical data and the Chi Square test for categorical data. Packages in R such as car offer a function for ANOVA tables, but R Studio itself includes functions such as t.test to analyze data. Programming languages offer packages for creating graphs and visuals to display analytical tests, such as Matplotlib in Python and ggplot2 in R.

Sections of environmental science, such as conservation biology, can benefit from programming because of different computer models. As a college student, the first software I was introduced to that was created specifically for use in conservation science was a population viability analysis (PVA) software called Vortex. Population viability measures the likelihood of a group of organisms to thrive or decline under a certain set of circumstances. The Vortex software allows users to adjust the circumstances for populations in areas such as genetic diversity, number of organisms, and mortality rate. I used the software in a classroom setting while studying in Peru, and I performed various tests to see what factors would be detrimental to the population of a theoretical species. This tool is one of many that can be of assistance for environmental science professionals who can use PVA to inform management decisions for threatened species.

A treeline in the lowland Amazon rain forest along the Madre de Dios River in Peru, taken February 2015. Advancements in computer programming can provide tools to help increase scientific understanding of biologically rich areas, such as the Amazon.

Within the field of environmental science, computer programming can be a great advantage because it allows scientists to analyze data in efficient ways that can make everyday tasks easier. The utilization of programming languages and modeling software offers opportunities to put computers to use where humans would have otherwise performed repetitive tasks. This can provide scientists with more time to make discoveries and inform decisions to make the world a better place.