In this tutorial, I review ways to take raw categorical survey data and create new variables for analysis and visualizations with Python using pandas and GeoPy. I’ll show how to make new pandas columns from encoding complex responses, geocoding locations, and measuring distances.
Here’s the associated GitHub repository for this workshop, which includes the data set and a Jupyter Notebook for the code.
Thanks to the St. Lawrence Eastern Lake Ontario Partnership for Regional Invasive Species Management (SLELO PRISM), I was able to use boat launch steward data from 2016 for this virtual workshop. The survey data was collected by boat launch stewards around Lake Ontario in upstate New York. Boaters were asked a series of survey questions and their watercrafts were inspected for aquatic invasive species.
This tutorial was originally designed for the Syracuse Women in Machine Learning and Data Science (Syracuse WiMLDS) Meetup group.
Here’s an overview of how to use newsgrab to get news headlines from Google News. Then, the data can be analyzed using the spaCy natural language processing library.
The motivation behind newgrab was to pull data on New York colleges to compare headlines about how institutions were being affected by COVID-19. I used the College Navigator from the National Center for Education Statistics to get a list of 4-year colleges in New York to use as the search data.
I had trouble finding a clean way to scrape headlines from Google News. My brother Randy helped me use Javascript and playwright to write the code for newsgrab.
Run a Search with newsgrab
First, install newsgrab globally through npm from the command line.
npm install -g newsgrab
Run a line with the package name and specify the file path (if outside current working directory) of a line-separated list of desired search terms. For my example, I used the names of New York colleges.
newsgrab ny_colleges.txt
The output of newsgrab is a JSON file called output and will follow the array structure below:
Afterwards, the output can be handled with Python.
Analyze the JSON Data with spaCy
Import the necessary packages for handling the data. These include: json, pandas, matplotlib, seaborn, re, and spaCy. Specific modules to import are the json_normalize module from pandas and the counter module from collections.
import json
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sb
import re
import spacy
from collections import Counter
Bring in one of the pre-trained models from spaCy. I use the model called en_core_web_sm. There are other options in their docs for English models, as well as those for different languages.
nlp = spacy.load("en_core_web_sm")
Read in the JSON data as a list and then normalize it with pandas. Specify the record path as ‘results’ and the meta as ‘search_term’ to correspond with the JSON array data structure from the output file.
with open('output.json',encoding="utf8") as raw_file1:
list1 = json.load(raw_file1)
search_data = pd.json_normalize(list1, record_path='results', meta='search_term',record_prefix='results')
Gather all separate data through spaCy. I wanted to pull noun chunks, named entities, and tokens from my results column. For the token output, I use the attributes of rule-based matching to specify that I want all tokens except for stop words or punctuation. Then, each output is put into a column of the main dataframe.
noun_chunks = []
named_entity = []
tokens = []
for doc in nlp.pipe(df['results_lower'].astype('unicode').values, batch_size=50,
n_process=5):
if doc.is_parsed:
noun_chunks.append([chunk.text for chunk in doc.noun_chunks])
named_entity.append([ent.text for ent in doc.ents])
tokens.append([token.text for token in doc if not token.is_stop and not token.is_punct])
else:
noun_chunks.append(None)
named_entity.append(None)
tokens.append(None)
df['results_noun_chunks'] = noun_chunks
df['results_named_entities'] = named_entity
df['results_tokens_clean'] = tokens
Process Tokens
Take the tokens column and flatten it into a list. Perform some general data cleaning like removing special characters and taking out line breaks and the remnants of ampersands. Then, use the counter module to get a frequency count of each of the words in the list.
word_frequency = Counter(string_list_of_words)
Raw output from the counter module shows tokens and their associated value counts in the total text.
Before analyzing the list, I also remove the tokens for my list of original search terms to keep it more focused on the terms outside of these. Then, I create a dataframe of the top results and plot those with seaborn.
A countplot shows all keyword tokens with value counts over 21 for the college news headline data.
Process Noun Chunks
Perform some cleaning to separate the noun chunks lists per each individual search term. I remove excess characters after converting the output to strings, and then use the explode function from pandas to separate them.
Then, create a variable for the value count of each of the noun chunks, turn that into a dictionary, then map it to the dataframe for the following result.
A dataframe shows headlines, search terms, noun chunks, and new columns for separated noun chunks and associated value counts.
Then, I sort the values in a new dataframe in descending order, remove duplicates, and narrow down to the top 20 noun chunks with frequencies above 10 to graph in a countplot.
A countplot shows all noun chunks with value counts over 9 for the college news headline data.
Process Named Entities
Cleaning the named entity outputs for each headline is nearly the same in process as cleaning the noun chunks. The lists are converted to strings, are cleaned, and use the explode function to separate individually. The outputs for named entities can be customized depending on desired type.
After separating the individual named entities, I use spaCy to identify the type of each and create a new column for these.
named_entity_type = []
for doc in nlp.pipe(named['named_entity'].astype('unicode').values, batch_size=50,
n_process=5):
if doc.is_parsed:
named_entity_type.append([ent.label_ for ent in doc.ents])
else:
named_entity_type.append(None)
named['named_entities_type'] = named_entity_type
Then, I get the value counts for the named entities and append these to a dictionary. I map the dictionary to the named entity column, and put the result in a new column.
As seen in the snippet of the full dataframe below, the model for identifying named entity values and types is not always accurate. There is documentation for training spaCy’s models for those interested in increased accuracy.
A dataframe shows headlines, search terms, named entities, and new columns for separated named entities, their type, and associated value counts.
From the dataframe, I narrow down the entity types to exclude cardinal and ordinal types to take out any numbers that may have high frequencies within the headlines. Then, I get the top named entity types with frequencies over 6 to graph.
A countplot shows all non-numerical named entities with value counts over 6 for the college news headline data.
For full details and cleaning steps to create the visualizations above, please reference below for the associated gist from Github.
Here’s an overview of how to map the coordinates of cities mentioned in song lyrics using Python. In this example, I used Lana Del Rey’s lyrics for my data and focused on United States cities. The full code for this is in a Jupyter Notebook on my GitHub under the lyrics_map repository.
A map with Lana Del Rey’s Lust for Life album booklet.
Gather Bulk Song Lyrics Data
First, create an account with Genius to obtain an API key. This is used for making requests to scrape song lyrics data from a desired artist. Store the key in a text file. Then, follow the tutorial steps from this blog post by Nick Pai and reference the API key text file within the code.
You can customize the code to cater to a certain artist and number of songs. To be safe, I put in a request for lyrics from 300 songs.
Find Cities and Countries in the Data
After getting the song lyrics in a text file, open the file and use geotext to grab city names. Append these to a new pandas dataframe.
Use GeoText to gather country mentions and put these in a column. Then, clean the raw output and create a new dataframe querying only on the United States.
Personally, I focus only on United States cities to reduce errors from geotext reading common words such as ‘Born’ as foreign city names.
The results from geotext city and country mentions in a dataframe, with a cleaned country column.
In my example, I corrected Newport and Venice to include ‘Beach’. I understand this can be cumbersome with larger datasets, but I did not see it imperative to automate this task for my example.
city_mentions = city_mentions.replace(to_replace ='Newport', value ='Newport Beach')
city_mentions = city_mentions.replace(to_replace ='Venice', value ='Venice Beach')
Next, save a list and a dataframe with value counts for each city to be used later for the map. Reset the index as well to have the two columns as city and mentions.
Use GeoPy to geocode the cities from the unique list, which pulls associated coordinates and location data. The user agent needs to be specified to avoid an error. Create a dataframe from this output.
chrome_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
geolocator = Nominatim(timeout=10,user_agent=chrome_user_agent)
lat_lon = []
for city in unique_list:
try:
location = geolocator.geocode(city)
if location:
lat_lon.append(location)
except GeocoderTimedOut as e:
print("Error: geocode failed on input %s with message %s"%
(city, e))
city_data = pd.DataFrame(lat_lon, columns=['raw_data','raw_data2'])
city_data = city_data[['raw_data2', 'raw_data']]
This yields one column as the latitude and longitude and another with comma separated location data.
The raw output of GeoPy’s geocode function in a pandas dataframe, showing the coordinates and associated location fields in a list.
Reduce the Geocode Data to Desired Columns
I cleaned my data to have only city names and associated coordinates. The output from GeoPy allows for more information such as county and state, if desired.
To split the location data (raw_data) column, convert it to a string and then split it and create a new column (city) from the first indexed object.
A dataframe with the outputs from GeoPy geocoder with one new column for string split city names.
Then, convert the coordinates column (raw_data2) into a string type to remove the parentheses and finally split on the comma.
#change the coordinates to a string
city_data['raw_data2'] = city_data['raw_data2'].astype(str)
#split the coordinates using the comma as the delimiter
city_data[['lat','lon']] = city_data.raw_data2.str.split(",",expand=True,)
#remove the parentheses
city_data['lat'] = city_data['lat'].map(lambda x:x.lstrip('()'))
city_data['lon'] = city_data['lon'].map(lambda x:x.rstrip('()'))
Convert the latitude and longitude columns back to floats because this is the usable type for plotly.
The final merged dataframe with city, latitude, longitude, and number of song mentions.
Create an account with MapBox to obtain an API key to plot my song lyric locations in a Plotly Express bubble map. Alternatively, it is also possible to generate the map without an API key if you have Dash installed. Customize the map for visibility by adjusting variables such as the color scale, the zoom extent, and the data that appears when hovering over the data.
px.set_mapbox_access_token(open("mapbox_token.txt").read())
df = px.data.carshare()
fig = px.scatter_mapbox(merged, lat='lat', lon='lon', color='mentions', size='mentions',
color_continuous_scale=px.colors.sequential.Agsunset, size_max=40, zoom=3,
hover_data=['city'])
fig.update_layout(
title={
'text': 'US Cities Mentioned in Lana Del Rey Songs',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
#save graph as html
with open('plotly_graph.html', 'w') as f:
f.write(fig.to_html(include_plotlyjs='cdn'))
I used the Spotify Web API to pull the top songs from my personal account. I’ll go over how to get the fifty most popular songs from a user’s Spotify account using spotipy, clean the data, and produce visualizations in Python.
Top 50 Spotify Songs
Top 50 songs from my personal Spotify account, extracted using the Spotify API.
Song
Artist
Album
Popularity
1
Borderline
Tame Impala
Borderline
77
2
Groceries
Mallrat
In the Sky
64
3
Fading
Toro y Moi
Outer Peace
48
4
Fanfare
Magic City Hippies
Hippie Castle EP
57
5
Limestone
Magic City Hippies
Hippie Castle EP
59
6
High Steppin'
The Avett Brothers
Closer Than Together
51
7
I Think Your Nose Is Bleeding
The Front Bottoms
Ann
43
8
Die Die Die
The Avett Brothers
Emotionalism (Bonus Track Version)
44
9
Spice
Magic City Hippies
Modern Animal
42
10
Bleeding White
The Avett Brothers
Closer Than Together
53
11
Prom Queen
Beach Bunny
Prom Queen
73
12
Sports
Beach Bunny
Sports
65
13
February
Beach Bunny
Crybaby
51
14
Pale Beneath The Tan (Squeeze)
The Front Bottoms
Ann
43
15
12 Feet Deep
The Front Bottoms
Rose
49
16
Au Revoir (Adios)
The Front Bottoms
Talon Of The Hawk
50
17
Freelance
Toro y Moi
Outer Peace
57
18
Spaceman
The Killers
Day & Age (Bonus Tracks)
62
19
Destroyed By Hippie Powers
Car Seat Headrest
Teens of Denial
51
20
Why Won't They Talk To Me?
Tame Impala
Lonerism
59
21
Fallingwater
Maggie Rogers
Heard It In A Past Life
71
22
Funny You Should Ask
The Front Bottoms
Talon Of The Hawk
48
23
You Used To Say (Holy Fuck)
The Front Bottoms
Going Grey
47
24
Today Is Not Real
The Front Bottoms
Ann
41
25
Father
The Front Bottoms
The Front Bottoms
43
26
Broken Boy
Cage The Elephant
Social Cues
60
27
Wait a Minute!
WILLOW
ARDIPITHECUS
80
28
Laugh Till I Cry
The Front Bottoms
Back On Top
47
29
Nobody's Home
Mallrat
Nobody's Home
56
30
Apocalypse Dreams
Tame Impala
Lonerism
60
31
Fill in the Blank
Car Seat Headrest
Teens of Denial
56
32
Spiderhead
Cage The Elephant
Melophobia
57
33
Tie Dye Dragon
The Front Bottoms
Ann
47
34
Summer Shandy
The Front Bottoms
Back On Top
43
35
At the Beach
The Avett Brothers
Mignonette
51
36
Motorcycle
The Front Bottoms
Back On Top
41
37
The New Love Song
The Avett Brothers
Mignonette
42
38
Paranoia in B Major
The Avett Brothers
Emotionalism (Bonus Track Version)
49
39
Aberdeen
Cage The Elephant
Thank You Happy Birthday
54
40
Losing Touch
The Killers
Day & Age (Bonus Tracks)
51
41
Four of a Kind
Magic City Hippies
Hippie Castle EP
46
42
Cosmic Hero (Live at the Tramshed, Cardiff, Wa...
Car Seat Headrest
Commit Yourself Completely
34
43
Locked Up
The Avett Brothers
Closer Than Together
49
44
Bull Ride
Magic City Hippies
Hippie Castle EP
49
45
The Weight of Lies
The Avett Brothers
Emotionalism (Bonus Track Version)
51
46
Heat Wave
Snail Mail
Lush
60
47
Awkward Conversations
The Front Bottoms
Rose
42
48
Baby Drive It Down
Toro y Moi
Outer Peace
47
49
Your Love
Middle Kids
Middle Kids EP
29
50
Ordinary Pleasure
Toro y Moi
Outer Peace
58
Using Spotipy and the Spotify Web API
First, I created an account with Spotify for Developers and created a client ID from the dashboard. This provides both a client ID and client secret for your application to be used when making requests to the API.
Next, from the application page, in ‘Edit Settings’, in Redirect URIs, I add http://localhost:8888/callback . This will come in handy later when logging into a specific Spotify account to pull data.
Then, I write the code to make the request to the API. This will pull the data and put it in a JSON file format.
I import the following libraries:
Python’s OS library to facilitate the client ID, client secret, and redirect API for the code using the computer’s operating system. This will temporarily set the credentials in the environmental variables.
Spotipy to provide an authorization flow for logging in to a Spotify account and obtain current top tracks for export.
import os
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
Next, I define the client ID and secret to what has been assigned to my application from the Spotify API. Then, I set the environmental variables to include the the client ID, client secret, and the redirect URI.
Then, I work through the authorization flow from the Spotipy documentation. The first time this code is run, the user will have to provide their Sptofy username and password when prompted in the web browser.
In the results section, I specify the information to pull. The arguments I provide indicate 50 songs as the limit, the index of the first item to return, and the time range. The time range options, as specified in Spotify’s documentation, are:
short_term : approximately last 4 weeks of listening
medium_term : approximately last 6 months of listening
long_term : last several years of listening
For my query, I decided to use the medium term argument because I thought that would give the best picture of my listening habits for the past half year. Lastly, I create a list to append the results to and then write them to a JSON file.
if token:
sp = spotipy.Spotify(auth=token)
results = sp.current_user_top_tracks(limit=50,offset=0,time_range='medium_term')
for song in range(50):
list = []
list.append(results)
with open('top50_data.json', 'w', encoding='utf-8') as f:
json.dump(list, f, ensure_ascii=False, indent=4)
else:
print("Can't get token for", username)
After compiling this code into a Python file, I run it from the command line. The output is top50_data.JSON which will need to be cleaned before using it to create visualizations.
Cleaning JSON Data for Visualizations
The top song data JSON file output is nested according to different categories, as seen in the sample below.
Before cleaning the JSON data and creating visualizations in a new file, I import json, pandas, matplotlib, and seaborn. Next, I load the JSON file with the top 50 song data.
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
with open('top50_data.json') as f:
data = json.load(f)
I create a full list of all the data to start. Next, I create lists where I will append the specific JSON data. Using a loop, I access each of the items of interest for analysis and append them to the lists.
Using the DataFrame, I create two visualizations. The first is a count plot using seaborn to show how many top songs came from each artist represented in the top 50 tracks.
descending_order = top50['artist'].value_counts().sort_values(ascending=False).index
ax = sb.countplot(y = top50['artist'], order=descending_order)
sb.despine(fig=None, ax=None, top=True, right=True, left=False, trim=False)
sb.set(rc={'figure.figsize':(6,7.2)})
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_title('Songs per Artist in Top 50', fontsize=16, fontweight='heavy')
sb.set(font_scale = 1.4)
ax.axes.get_xaxis().set_visible(False)
ax.set_frame_on(False)
y = top50['artist'].value_counts()
for i, v in enumerate(y):
ax.text(v + 0.2, i + .16, str(v), color='black', fontweight='light', fontsize=14)
plt.savefig('top50_songs_per_artist.jpg', bbox_inches="tight")
A countplot shows the number of songs per artists in the top 50 tracks from greatest to least.
The second graph is a seaborn box plot to show the popularity of songs within individual artists represented.
A boxplot shows the different levels of song popularity per artist in top 50 Spotify tracks.
Further Considerations
For future interactions with the Spotify Web API, I would like to complete requests that pull top song data for each of the three term options and compare them. This would give a comprehensive view of listening habits and could lead to pulling further information from each artist.
To get some web scraping practice, I wanted to obtain a large list of animal names. Colorado State University has a list of links to this data as a part of the Warren and Genevieve Garst Photographic Collection. The data is stored in table format. Here’s how to scrape hypertext data from HTML tables using Beautiful Soup.
Inspect the Data
First, visit the web page and inspect the data you would like to scrape. On Window, this means either right-clicking a desired element and selecting ‘Inspect’ or hitting Ctrl+Shift+I to open up the browser’s developer tools.
Screen that appears after right-clicking an element on the web page.
Inspected element in the Developer Tools window showing the HTML behind the page.
After inspecting the element, we see that it is in an HTML table and each row holds an entry for an animal name.
Initial dataframe preview of animal data after scraping.
Clean the Data
Based on our output, I want to refine the dataframe so the row entries are in a position to be split into two columns. First, I remove rows in index locations 0 through 2.
Dataframe following row drop to leave only animal names and genus species.
Then, I drop the escape characters in the front and end of each cell entry using the lstrip() and rstrip() functions. I split the remaining column into two columns based on ‘Animal’ and ‘Genus Species’ by using str.split() to separate the row.
The United States Department of Agriculture PLANTS database provides general information about plant species across the country. Given 3 states, I wanted to visualize which plant families are present in each and which state(s) hold the most species in each family. To accomplish this task, I used Python’s pandas, matplotlib, and seaborn libraries for analysis.
Various plant species in Death Valley National Park, California.
Initial Setup
Before beginning, I import pandas, matplotlib, and seaborn.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
Gathering Data
I pulled data sets from the USDA website for New York, Idaho, and California. The default encoding is in Latin-1 for exported text files. When importing into pandas, the encoding must be specified to work properly.
I double check the files have been loaded correctly into dataframe format using head().
Initial import of New York state plants list categorized by USDA symbol, synonym symbol, scientific name with author, national common name, and family.
Cleaning Data
The major point of interest in the imported dataframe is the ‘Family’ column. I create a new dataframe organized by this column and returning the count from each row.
ny_fam = ny_list.groupby('Family').count()
Initial dataframe for New York plant data grouped by taxonomic family.
Next, I remove the unwanted columns. I’ve chosen only to keep the ‘Symbol’ column as a representation of count because this variable is required for every plant instance.
ny_fam_1 = ny_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
Then, I change the column name from ‘Symbol’ to ‘{State} Count’ to lend itself for merging the dataframes without confusion.
Column before (left) and after (right) a name change.
I complete the same process for the California and Idaho data.
ca_fam = ca_list.groupby('Family').count()
ca_fam_1 = ca_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
ca_fam_1 = ca_fam_1.rename(columns = {"Symbol":"CA Count"})
id_fam = id_list.groupby('Family').count()
id_fam_1 = id_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
id_fam_1 = id_fam_1.rename(columns = {"Symbol":"ID Count"})
Reset the index to prepare the data frames for outer merges based on column names. The index is set to ‘Family’ as default, from the initial data frame creation using the count() function. To discourage any unwanted changes, I create a copy of each data frame as I go.
New York dataframe before (left) and after (right) the index was reset to make the plant families a column.
Merging Data
To preserve all the plant species regardless of presence in each individual state, I perform outer merges. This will allow for areas without data to be filled with zeros after the family counts are combined.
Plant family table with counts from each state, formatted to drop decimals and replace NaNs with zeros.
Creating a New Column
I added a column to aid in visualizations. I created a function to return the state with the highest presence of each plant family based on the existing columns.
Shows the count of plant families with the highest concentration in each state.
Further Considerations
There are many factors that play into plant family diversity. The comparison of plant families in New York, California, and Idaho was purely out of curiosity. Further investigations should take into account each state’s ecosystem types and land usage and ownership that may influence species diversity.
I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.
Preliminary Data Cleaning
For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.
Bivariate count plots of student test scores across parental levels of education.
Bivariate Analysis
Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.
base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
Violin plots that show average student test scores base on level of test preparation.
ed_order = ['some high school', 'high school', 'some college',
'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Box plots that show reading scores across varying levels of parental education.
Multivariate Analysis
A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.
ed_order = ['some high school', 'high school', 'some college',
'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
A swarm plot that shows student test scores, test preparation level, and the highest levels of parental education.
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);
Multivariate box plots showing test scores in Math, Reading, and Writing based on student gender and test preparation level.
A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.
The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.
Import Libraries:
Import pandas, NumPy, and Python’s regular expression operations library (re).
In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.
Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.
The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.
First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.
First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.
Next, use unique to verify all the possible combinations to re-map from the result.
Then, use map to replace row entries with preferred values.
In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.
Assorted plastic trash on the beach at Pelican Cove Park in Rancho Palos Verdes, CA, 2017.
In the spirit of this year’s Earth Day theme (‘End Plastic Pollution’), I researched the fate of plastic. The Environmental Protection Agency (EPA) prepared a report for 2014 municipal waste stream data for the United States. Plastic products were either recycled, burned for energy production, or sent to landfills. I used pandas to look at the data and Matplotlib to create a graph. I included percentages for each fate and compared the categories of total plastics, containers and packaging, durable goods, and nondurable goods.
Percentages of total plastics and plastic types that get recycled, burned for energy, or sent to a landfill, according to the EPA.
The EPA data shows a majority of plastic products reported in the waste stream were sent to landfills. Obviously, not all plastic waste actually reaches a recycling facility or landfill. Roadsides, waterways, and beaches are all subject to plastic pollution. Decreasing personal use of plastic products can help reduce the overall production of waste.
Here are some ideas for cutting back on plastic use:
Bring reusable shopping bags to every store.
Utilize cloth bags for all purchases.
Opt for reusable produce bags for fresh fruit and vegetables instead of store-provided plastic ones.
Ditch party plasticware.
Buy an assortment of silverware from a thrift store for party use.
Snag a set of used glassware for drinks instead of buying single-use plastic cups.