How to Scrape Hypertext from Tables Using Beautiful Soup

To get some web scraping practice, I wanted to obtain a large list of animal names. Colorado State University has a list of links to this data as a part of the Warren and Genevieve Garst Photographic Collection. The data is stored in table format. Here’s how to scrape hypertext data from HTML tables using Beautiful Soup.

Inspect the Data

First, visit the web page and inspect the data you would like to scrape. On Window, this means either right-clicking a desired element and selecting ‘Inspect’ or hitting Ctrl+Shift+I to open up the browser’s developer tools.

A screenshot of the window that appears after right-clicking an element on Windows before inspection.
Screen that appears after right-clicking an element on the web page.
Screenshot of the DevTools window showing the selected element and its surrounding HTML that we'll need to scrape the data.
Inspected element in the Developer Tools window showing the HTML behind the page.

After inspecting the element, we see that it is in an HTML table and each row holds an entry for an animal name.

Scrape the Data

Before beginning, import the packages we’ll need now (requests and Beautiful Soup) and later on (pandas).

import pandas as pd
import requests
from bs4 import BeautifulSoup

Then, we’ll use a request to gather the HTML from the webpage.

page = requests.get('https://lib2.colostate.edu/wildlife/atoz.php?letter=ALL')

Next, we’ll create a Beautiful Soup object referencing the page variable.

soup = BeautifulSoup(page.text, 'html.parser')

Using the object we just created, let’s gather all the row data by appending it a new list.

rows = soup.find_all('tr')
list_animals = []
for row in rows:         
    instance = row.get_text()
    list_animals.append(instance)

Afterwards, I create a pandas dataframe with the list we just generated and use the head() function to preview the output.

list_of_animals = pd.DataFrame(list_animals)
print(list_of_animals.head(10))
Preview of output from initial scraping of animal names and associated genus and species classification.
Initial dataframe preview of animal data after scraping.

Clean the Data

Based on our output, I want to refine the dataframe so the row entries are in a position to be split into two columns. First, I remove rows in index locations 0 through 2.

list1 = list1.drop([list1.index[0], list1.index[1], list1.index[2]])
Dataframe following row drop to leave only animal names and genus species.

Then, I drop the escape characters in the front and end of each cell entry using the lstrip() and rstrip() functions. I split the remaining column into two columns based on ‘Animal’ and ‘Genus Species’ by using str.split() to separate the row.

list1['v1'] = list1[0].map(lambda x: x.lstrip('\n').rstrip('\n'))
list1[['v1','v2']] = list1['v1'].str.split('\n',expand=True)
A table shows the dataframe with the original column, a new column with animal names, and another column with genus and species.
Separation of original column based on animal name and genus species.

Next, I reset the index and drop the index and original column. I rename columns ‘v1’ and ‘v2’ to their appropriate names.

list1 = list1.reset_index()
list1 = list1.drop(columns = ['index', 0])
list1 = list1.rename(columns={"v1": "common_name", "v2": "genus_species"})
A table shows the final dataframe resulting from cleaning with two columns as properly titled 'common_name' and 'genus_species'.
Final dataframe with columns named for common names and genus species.

Lastly, I save the dataframe in a comma-separated values file for later use.

animal_names_list = list1.to_csv(r'Path\...\animal_names_list.csv')

Analysis of USDA Plant Families Data Using Pandas

The United States Department of Agriculture PLANTS database provides general information about plant species across the country. Given 3 states, I wanted to visualize which plant families are present in each and which state(s) hold the most species in each family. To accomplish this task, I used Python’s pandas, matplotlib, and seaborn libraries for analysis.

Different groups of plants, including wildflowers, are arranged in a rock bed in the desert.
Various plant species in Death Valley National Park, California.

Initial Setup

Before beginning, I import pandas, matplotlib, and seaborn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

Gathering Data

I pulled data sets from the USDA website for New York, Idaho, and California. The default encoding is in Latin-1 for exported text files. When importing into pandas, the encoding must be specified to work properly.

ny_list = pd.read_csv('ny_list.txt', encoding='latin-1')
ca_list = pd.read_csv('ca_list.txt', encoding='latin-1')
id_list = pd.read_csv('id_list.txt', encoding='latin-1')

I double check the files have been loaded correctly into dataframe format using head().

A preview of a table shows 5 entries of various plants that are present in New York state, including their USDA symbol, synonym symbol, scientific name, common name, and plant family.
Initial import of New York state plants list categorized by USDA symbol, synonym symbol, scientific name with author, national common name, and family.

Cleaning Data

The major point of interest in the imported dataframe is the ‘Family’ column. I create a new dataframe organized by this column and returning the count from each row.

ny_fam = ny_list.groupby('Family').count()
A table shows plants grouped by taxonomic family for New York state.
Initial dataframe for New York plant data grouped by taxonomic family.

Next, I remove the unwanted columns. I’ve chosen only to keep the ‘Symbol’ column as a representation of count because this variable is required for every plant instance.

ny_fam_1 = ny_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)

Then, I change the column name from ‘Symbol’ to ‘{State} Count’ to lend itself for merging the dataframes without confusion.

ny_fam_1 = ny_fam_1.rename(columns = {"Symbol":"NY Count"})
Two tables show a before and after image of a table where the column named 'Symbol' changes to 'NY Count.'
Column before (left) and after (right) a name change.

I complete the same process for the California and Idaho data.

ca_fam = ca_list.groupby('Family').count()
ca_fam_1 = ca_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
ca_fam_1 = ca_fam_1.rename(columns = {"Symbol":"CA Count"})
id_fam = id_list.groupby('Family').count()
id_fam_1 = id_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
id_fam_1 = id_fam_1.rename(columns = {"Symbol":"ID Count"})

Reset the index to prepare the data frames for outer merges based on column names. The index is set to ‘Family’ as default, from the initial data frame creation using the count() function. To discourage any unwanted changes, I create a copy of each data frame as I go.

ny_fam_2 = ny_fam_1.copy()
ny_fam_2 = ny_fam_2.reset_index()
ca_fam_2 = ca_fam_1.copy()
ca_fam_2 = ca_fam_2.reset_index()
id_fam_2 = id_fam_1.copy()
id_fam_2 = id_fam_2.reset_index()
Two tables show a before and after image of New York state plant family data. The first table shows plant families as the index and the second table shows plant families as a column, with a new numerical index.
New York dataframe before (left) and after (right) the index was reset to make the plant families a column.

Merging Data

To preserve all the plant species regardless of presence in each individual state, I perform outer merges. This will allow for areas without data to be filled with zeros after the family counts are combined.

combo1 = pd.merge(ny_fam_2, ca_fam_2, how='outer')
combo2 = pd.merge(combo1, id_fam_2, how='outer')
A table shows the combined plant family data for New York, California, and Idaho.
Plant family table with counts from each state, before formatting.
pd.options.display.float_format = '{:,.0f}'.format
combo2 = combo2.fillna(0)
A table shows combined data for plant families in New York, California, and Idaho where the numbers are formatted to show zero decimal places and any instances of no data are replaced by zeroes.
Plant family table with counts from each state, formatted to drop decimals and replace NaNs with zeros.

Creating a New Column

I added a column to aid in visualizations. I created a function to return the state with the highest presence of each plant family based on the existing columns.

def presence(row):
    if row['NY Count'] > row['CA Count'] and row['NY Count'] > row['ID Count']:
        return 'NY'
    elif row['CA Count'] > row['NY Count'] and row['CA Count'] > row['ID Count']:
        return 'CA'
    elif row['ID Count'] > row['NY Count'] and row['ID Count'] > row['CA Count']:
        return 'ID'
    elif row['NY Count'] == row['CA Count'] and row['NY Count'] > row['ID Count']:
        return 'CA/NY'
    elif row['CA Count'] == row['ID Count'] and row['CA Count'] > row['NY Count']:
        return 'CA/ID'
    elif row['ID Count'] == row['NY Count'] and row['ID Count'] > row['CA Count']:
        return 'ID/NY'
    else:
        return 'Same'
    
combo2['Highest Presence'] = combo2.apply(presence, axis=1)
A table of plant families from New York, California, and Idaho shows counts for each family and a new column that names which state has the highest presence of each plant family.
Table with added column to indicate highest presence of species count within each plant family.

Below is the full table of all plant families in the dataframe.

FamilyNY CountCA CountID CountHighest Presence
0Acanthaceae770CA/NY
1Acarosporaceae1118ID
2Aceraceae462924NY
3Acoraceae524NY
4Actinidiaceae300NY
5Adoxaceae200NY
6Agavaceae4480CA
7Aizoaceae5420CA
8Alismataceae645333NY
9Amaranthaceae788238CA
10Amblystegiaceae6400NY
11Anacardiaceae464722CA
12Andreaeaceae300NY
13Annonaceae300NY
14Anomodontaceae800NY
15Apiaceae190372257CA
16Apocynaceae486042CA
17Aquifoliaceae2130NY
18Araceae26183NY
19Araliaceae2668NY
20Aristolochiaceae2893NY
21Asclepiadaceae506720CA
22Aspleniaceae2376NY
23Asteraceae2,0573,8582,260CA
24Aulacomniaceae700NY
25Azollaceae440CA/NY
26Bacidiaceae110CA/NY
27Balsaminaceae10611ID
28Bartramiaceae1100NY
29Berberidaceae215925CA
30Betulaceae864353NY
31Bignoniaceae9110CA
32Blechnaceae5117CA
33Boraginaceae127478263CA
34Brachytheciaceae5900NY
35Brassicaceae4681,123774CA
36Bruchiaceae400NY
37Bryaceae2220NY
38Buddlejaceae450CA
39Butomaceae202ID/NY
40Buxaceae500NY
41Buxbaumiaceae500NY
42Cabombaceae663CA/NY
43Cactaceae1023878CA
44Callitrichaceae161811CA
45Calycanthaceae720NY
46Campanulaceae7614657CA
47Cannabaceae13810NY
48Cannaceae400NY
49Capparaceae184926CA
50Caprifoliaceae18210781NY
51Caryophyllaceae338506414CA
52Celastraceae24184NY
53Ceratophyllaceae855NY
54Cercidiphyllaceae200NY
55Chenopodiaceae245404245CA
56Cistaceae44270NY
57Cladoniaceae522NY
58Clethraceae400NY
59Climaciaceae500NY
60Clusiaceae521512NY
61Commelinaceae37140NY
62Convolvulaceae6813017CA
63Cornaceae553436NY
64Crassulaceae6223057CA
65Cucurbitaceae41496CA
66Cupressaceae3913030CA
67Cuscutaceae315631CA
68Cyperaceae1,016733663NY
69Dennstaedtiaceae855NY
70Diapensiaceae800NY
71Dicranaceae2000NY
72Dioscoreaceae600NY
73Dipsacaceae17108NY
74Ditrichaceae920NY
75Droseraceae8116CA
76Dryopteridaceae1216571NY
77Ebenaceae770CA/NY
78Elaeagnaceae161012NY
79Elatinaceae6144CA
80Empetraceae1360NY
81Entodontaceae800NY
82Ephemeraceae500NY
83Equisetaceae483651ID
84Ericaceae236310110CA
85Eriocaulaceae730NY
86Euphorbiaceae11123366CA
87Fabaceae6041,855871CA
88Fagaceae96910NY
89Fissidentaceae2211NY
90Flacourtiaceae200NY
91Fontinalaceae900NY
92Fumariaceae312820NY
93Funariaceae2100NY
94Gentianaceae72162122CA
95Geraniaceae348025CA
96Ginkgoaceae200NY
97Grimmiaceae233CA/ID
98Grossulariaceae4615072CA
99Haemodoraceae700NY
100Haloragaceae372820NY
101Hamamelidaceae820NY
102Hippocastanaceae1420NY
103Hippuridaceae222Same
104Hydrangeaceae245314CA
105Hydrocharitaceae454430NY
106Hydrophyllaceae1533689CA
107Hylocomiaceae800NY
108Hymeneliaceae114ID
109Hymenophyllaceae200NY
110Hypnaceae2000NY
111Iridaceae4611426CA
112Isoetaceae393028NY
113Juglandaceae52102NY
114Juncaceae162177143CA
115Juncaginaceae92213CA
116Lamiaceae399413182CA
117Lardizabalaceae200NY
118Lauraceae12110NY
119Lecanoraceae311NY
120Lemnaceae274517CA
121Lentibulariaceae372317NY
122Leucobryaceae200NY
123Liliaceae263741243CA
124Limnanthaceae3363CA
125Linaceae365220CA
126Lycopodiaceae114947NY
127Lygodiaceae200NY
128Lythraceae252616CA
129Magnoliaceae2000NY
130Malvaceae6628462CA
131Marsileaceae21512CA
132Melastomataceae1200NY
133Meliaceae330CA/NY
134Menispermaceae200NY
135Menyanthaceae1073NY
136Mniaceae1000NY
137Molluginaceae493CA
138Monotropaceae193121CA
139Moraceae21165NY
140Myricaceae2250NY
141Najadaceae252110NY
142Nelumbonaceae950NY
143Nyctaginaceae301578CA
144Nymphaeaceae502330NY
145Oleaceae39490CA
146Onagraceae237661314CA
147Ophioglossaceae605348NY
148Orchidaceae285175173NY
149Orobanchaceae227451CA
150Orthotrichaceae1100NY
151Osmundaceae1000NY
152Oxalidaceae705847NY
153Paeoniaceae242CA
154Papaveraceae389826CA
155Parmeliaceae111Same
156Pedaliaceae11258CA
157Phytolaccaceae470CA
158Pinaceae5211365CA
159Plantaginaceae647833CA
160Platanaceae980NY
161Plumbaginaceae16220CA
162Poaceae1,9272,3471,507CA
163Podostemaceae300NY
164Polemoniaceae35637279CA
165Polygalaceae29170NY
166Polygonaceae331917435CA
167Polypodiaceae9149CA
168Polytrichaceae2300NY
169Pontederiaceae17176CA/NY
170Portulacaceae24203120CA
171Potamogetonaceae11583103NY
172Pottiaceae3421NY
173Primulaceae67104102CA
174Pteridaceae1611139CA
175Pyrolaceae546567ID
176Ranunculaceae288434412CA
177Resedaceae780CA
178Rhamnaceae1820220CA
179Rosaceae1,305803609NY
180Rubiaceae15723072CA
181Ruppiaceae11177CA
182Rutaceae18120NY
183Salicaceae284352383ID
184Salviniaceae530NY
185Santalaceae959ID/NY
186Sapindaceae5373CA
187Sarraceniaceae11160CA
188Saururaceae230CA
189Saxifragaceae55219234ID
190Scheuchzeriaceae555Same
191Schistostegaceae202ID/NY
192Schizaeaceae200NY
193Scrophulariaceae4301,146556CA
194Selaginellaceae61514CA
195Sematophyllaceae600NY
196Simaroubaceae474CA
197Smilacaceae2230NY
198Solanaceae11824463CA
199Sparganiaceae242224ID/NY
200Sphagnaceae4231NY
201Staphyleaceae220CA/NY
202Sterculiaceae2300CA
203Styracaceae790CA
204Symplocaceae500NY
205Taxaceae1042NY
206Teloschistaceae111Same
207Tetraphidaceae200NY
208Theliaceae300NY
209Thelypteridaceae26913NY
210Thuidiaceae300NY
211Thymelaeaceae620NY
212Tiliaceae2200NY
213Trapaceae500NY
214Tropaeolaceae220CA/NY
215Typhaceae6105CA
216Ulmaceae231910NY
217Urticaceae526132CA
218Valerianaceae135927CA
219Verbenaceae531267CA
220Violaceae17612079NY
221Viscaceae12446CA
222Vitaceae55184NY
223Vittariaceae200NY
224Xyridaceae1500NY
225Zannichelliaceae555Same
226Zosteraceae580CA
227Zygophyllaceae5315CA
228Aloaceae020CA
229Aponogetonaceae020CA
230Arecaceae0150CA
231Basellaceae040CA
232Bataceae020CA
233Burseraceae020CA
234Caulerpaceae020CA
235Crossosomataceae0189CA
236Cyatheaceae030CA
237Cymodoceaceae050CA
238Datiscaceae020CA
239Elaeocarpaceae040CA
240Ephedraceae0160CA
241Fouquieriaceae030CA
242Frankeniaceae060CA
243Garryaceae0110CA
244Gracilariaceae020CA
245Gunneraceae030CA
246Halymeniaceae020CA
247Krameriaceae080CA
248Lennoaceae060CA
249Loasaceae08429CA
250Melianthaceae020CA
251Myoporaceae020CA
252Myrtaceae0320CA
253Parkeriaceae040CA
254Passifloraceae060CA
255Pittosporaceae080CA
256Punicaceae020CA
257Rafflesiaceae020CA
258Scouleriaceae033CA/ID
259Simmondsiaceae040CA
260Stereocaulaceae020CA
261Tamaricaceae0123CA
262Ulvaceae030CA
263Verrucariaceae002ID

Visualizing the Data

I created a count plot using seaborn to show which states, or state combinations, have the highest variety within each plant family.

base_color = sb.color_palette()[2]
sb.countplot(data=combo4, x='Highest Presence', color="#B6D1BE", order=combo4['Highest Presence'].value_counts().index)
n_points = combo4.shape[0]
cat_counts = combo4['Highest Presence'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count+5, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Highest Concentration of Plant Families by State', fontsize=14, y=1.05)
plt.ylim(0, 140)
sb.despine();
A vertical bar graph show the highest concentration of plant families organized by state, where the concentrations are high to low as follows: California, New York, California/New York, Idaho, all the states tie, Idaho/New York, and California/Idaho.
Shows the count of plant families with the highest concentration in each state.

Further Considerations

There are many factors that play into plant family diversity. The comparison of plant families in New York, California, and Idaho was purely out of curiosity. Further investigations should take into account each state’s ecosystem types and land usage and ownership that may influence species diversity.

How to Mindfully Summarize Data Insights

How to Mindfully Summarize Data Insights

It can be difficult to transparently present key insights in a world saturated with fake news and click-bait. Study results are often distilled into share-worthy titles that get read and taken as gospel without anyone actually reading further, so be conscious of how you choose to share data findings. Here are a few ways to stay humble while summarizing your data, based on a headline from the Girlgaze newsletter about an article from Gay Times.

Shot of newsletter article summary from Girlgaze.
Screenshot of Girlgaze newsletter from June 29, 2019 with added underlining and highlighting.
  1. Avoid sweeping generalizations that encourage assumptions about entire populations. This headline implies the sexual identity of all young people, instead of a small portion of youth in existence. A quick change would be to add ‘Surveyed’ after ‘Youth’ to reinforce that this was not a large sample size (‘over 2,000 adults‘).
  2. Name the party or parties responsible for sponsoring the collection of data. This communication successfully names the commissioner of the study early in the description. This allows the audience to have full transparency about involved parties who may influence a study’s outcomes.
  3. Avoid the use of definitive language. The last sentence in this quick summary is too precise in saying ‘clear indication‘ and should say what the ‘much more fluid perspective‘ is in comparison to. Studies use a sample from a population to provide meaningful insights. Samples are not meant to determine a definitive stance for every member of a population. Surveys take into account an array of potential variables, and language such as ‘clear indication‘ would imply researchers have explored every possible avenue of bias.

Other ways to communicate transparently with an audience include sharing links to raw data, naming potential sources of error, and making suggestions for future method improvements. Providing an audience with every opportunity to explore your data and understand methods empowers people to consume insights responsibly.

Data Exploration with Student Test Scores

Data Exploration with Student Test Scores

I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.

Preliminary Data Cleaning

For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
labels = ['gender', 'race/ethnicity', 'par_level_educ', 'lunch', 'test_prep', 'math', 'reading', 'writing']
tests = pd.read_csv('StudentsPerformance.csv', header=0, names=labels)
tests.info()
Output of info() for student test scores.
tests.head(10)
Output of head() for student test scores.

Univariate Analysis

Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.

plt.figure(figsize=[10,4])
plt.subplot(1, 3, 1)
plt.hist(data=tests, x='math', bins=20)
plt.title('Math')
plt.xlim(0,100)
plt.ylim(0,160)
plt.ylabel('Number of Students', fontsize=12)
plt.subplot(1, 3, 2)
plt.hist(data=tests, x='reading', bins=20)
plt.title('Reading')
plt.xlim(0,100)
plt.ylim(0,160)
plt.subplot(1, 3, 3)
plt.hist(data=tests, x='writing', bins=20)
plt.title('Writing')
plt.xlim(0,100)
plt.ylim(0,160)
plt.suptitle('Test Scores', fontsize=16, y=1.0);
Histograms showing the spread of student test scores across all topics.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
base_color = sb.color_palette()[9]
sb.countplot(data=tests, x='par_level_educ', color=base_color, order=ed_order)
n_points = tests.shape[0]
cat_counts = tests['par_level_educ'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count-35, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Parental Education Level of Student Test Takers');
Bivariate count plots of student test scores across parental levels of education.

Bivariate Analysis

Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.

base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
Violin plots that show average student test scores base on level of test preparation.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Box plots that show reading scores across varying levels of parental education.

Multivariate Analysis

A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.

ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
A swarm plot that shows student test scores, test preparation level, and the highest levels of parental education.
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);
Multivariate box plots showing test scores in Math, Reading, and Writing based on student gender and test preparation level.

Statistics Principles for Data Analysis

Statistics Principles for Data Analysis
Four traditional dice on a game board

Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.

  • Central Tendency
    • Mean = the average of a distribution
    • Median = a distribution’s midpoint
    • Mode = the variable which occurs most often in a distribution
  • Variability
    • The distribution of data, also known as spread
    • Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
      • Represented through boxplots graphically
    • Summarized through quartiles:
      • Q1: median of all values to left of Q2
      • Q2: median (50th percentile) of all values in distribution
      • Q3: median of all values to right of Q2
    • Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
    • Standard deviation: square root of the variance (s^2)
  • Normal Distribution: bell-curve distribution of data
  • Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
    • Compare claims
    • Null hypothesis: statement challenged in significance testing
      • Example: There is not a difference between means.
    • Alternative hypothesis: statement suspected as true instead of the null hypothesis
      • Example: The means are not the same.
    • Accept or reject null hypothesis based on a certain p-value.
    • p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
    • Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
  • Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
  • Probability: The proportion of times an outcome would occur given many repeated tests.
  • Correlation
    • A measure of the linear relationship between two quantitative variables, based on direction and strength.
    • Examples: strong, weak, or no correlation; positive or negative
    • Represented by r
    • r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
  • Regression
    • Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
    • Multiple linear: statistical model with more than one explanatory variable
  • Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
  • Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
  • Analysis of Variance (ANOVAs)
    • One-way: Compare population means based on 1 independent variable
    • Two-way: Compare population means classified based on 2 independent variables

Source:

Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.

See here for an updated version of textbook

Data Visualizations for Spending Habits

Data Visualizations for Spending Habits

I am not a huge fan of bank-generated visuals to analyze my spending habits. My bank breaks up expenses into murky categories such as bills and utilities, shopping, other, and un-categorized. As a result, I began tracking all my expenses in a spreadsheet to better capture data. This includes month, vendor, amount, and expense category for every purchase. To explore this data, I used a heat map and a waffle chart.

Here’s a look at how to use Python to create heat maps and waffle charts for spending habit data.

Setup

Libraries and packages I utilized for my spending data include pandas, NumPy, matplotlib, seaborn and pywaffle. I upload my data from a CSV and turn it into a dataframe using pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from pywaffle import Waffle
%matplotlib inline

df = pd.read_csv('2018_expenses.csv')

Heat Map of Number of Purchases

Heat maps display numerical trends using a sequential scale of color intensity. Below, I graph number of purchases across spending categories and organize the information by month.

I customize the graph to have annotations on each block and color the area using the seaborn color palette ‘rocket’ set to be reversed in shading by adding ‘_r’. An additional argument (‘annot_kws’) adjusts the text size of the annotations. Other features clean up the axes labels and name the tick labels accordingly.

ct_counts=df.groupby(['category','month']).size()
ct_counts=ct_counts.reset_index(name='count')
ct_counts=ct_counts.pivot(index='month', columns='category', values='count')
plt.figure(figsize=(12, 6))
ax=sb.heatmap(ct_counts, annot=True, cmap='rocket_r', annot_kws={"size":14})
plt.title('Number of Monthly Expenses', fontsize=16)
plt.xticks(rotation=0)
plt.xlabel('')
plt.ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
ax.set_xticklabels(['Business', 'Education', 'Entertainment', 'Food', 'Health', 'Other', 'Transportation'])
ax.set_yticklabels(['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], rotation=0);

The heat map shows my most frequent purchases are on food, entertainment, other, and transportation. Categories such as business and health have the lowest frequency of purchases. Across months, purchases are fairly similar with only between 1 and 9 purchases per spending category.

Waffle Chart of Total Spending

Waffle charts display values in congregated squares and are an alternative to pie charts. Formatting options allow for a varied number of columns and rows. I stuck with ten columns with ten rows, with each square to represent one percent out of a total one hundred. This allows for a simple breakdown of where my money went. I created a dictionary of values for this example, rather than using my pandas dataframe.

 data={'Education':63, 'Entertainment':9, 'Other':8, 'Health':7, 'Food':6, 'Transportation':4, 'Business':3}

fig=plt.figure(
FigureClass=Waffle, 
rows=10, 
values=data, 
colors=('indianred', 'lightsalmon', 'peachpuff', 'darkkhaki', 
'cadetblue','paleturquoise', 'lightsteelblue'), 
title={'label': 'Total Spending by Category', 'loc': 'center'}, 
labels=["{0} ({1}%)".format(k, v) for k, v in data.items()],
legend={'loc': 'upper left', 'bbox_to_anchor': (1.1, 1)}
)
fig.set_tight_layout(False)
plt.show();

The waffle chart shows most of my spending was on education and the other categories were all under 10% of total expenses.

Cleaning Data with Pandas

A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.

The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.

Import Libraries:

Import pandasNumPy, and Python’s regular expression operations library (re).

import pandas as pd
import numpy as np
import re

Import Files:

Use read_csv to load the files you wish to clean.

twt_arc = pd.read_csv('twitter_archive.csv')
img_pred = pd.read_csv('image_predictions.csv')
twt_counts = pd.read_csv('tweet_counts.csv')

Create Copies:

Create copies of the original files using copy before cleaning just in case you need to restore some of the original contents.

twt_arc_clean = twt_arc.copy()
img_pred_clean = img_pred.copy()
twt_counts_clean = twt_counts.copy()

Merge Data:

Combine specific files using the merge function.

In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.

df1 = pd.merge(twt_arc_clean, img_pred_clean, how='left')
df2 = pd.merge(df1, twt_counts, how='left')

Drop Columns:

Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.

The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.

First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.

df2_clean = df2_clean[df2_clean['retweeted_status_id'].isnull()]

df2_clean = df2_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 
                          'retweeted_status_id','retweeted_status_user_id',                                                        
                          'retweeted_status_timestamp'], axis='columns')

Change Data Types:

Use astype by listing the preferred data type as the argument.

The Tweet IDs were uploaded as integers, so I convert them to objects.

df2_clean.tweet_id = df2_clean.tweet_id.astype(object)

Use to_datetime to convert a column to datetime by entering the selected column as the argument.

Time stamps were objects instead of datetime objects. I create a new column called ‘time’ and delete the old ‘timestamp’ column.

df2_clean['time'] = pd.to_datetime(df2_clean['timestamp'])

df2_clean = df2_clean.drop('timestamp', 1)

Replace Text:

Use the replace function and list the old value to replace followed by the new value.

Text entries for this data set had the shortened spelling of ampersand instead of the symbol itself.

df2_clean['text'] = df2_clean['text'].replace('&', '&')

Combine and Map Columns:

First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.

Next, use unique to verify all the possible combinations to re-map from the result.

Then, use map to replace row entries with preferred values.

In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.

df2_clean['dog_type'] = df2_clean[df2_clean.columns[6:10]].apply(lambda x:                                                                    
                ','.join(x.dropna().astype(str)), axis=1)

df2_clean['dog_type'].unique()

df2_clean['dog_type'] = df2_clean.dog_type.map({'None,None,None,None': np.nan, 
                'doggo,None,None,None':'doggo',
                'None,None,None,puppo':'puppo', 
                'None,None,pupper,None':'pupper',
                'None,floofer,None,None':'floofer', 
                'doggo,None,None,puppo':'doggo/puppo',
                'doggo,floofer,None,None':'doggo/floofer', 
                'doggo,None,pupper,None':'doggo/pupper'})

Remove HTML Tags:

Write a function to remove HTML tags using re. Compile the tags by specifying ‘<.*?>’, and use sub to replace the compiled tags with empty spaces.

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

df2_clean['source'] = df2_clean['source'].apply(remove_html_tags)

 

Project Overview: FoodPact

A fork and knife surround a plate that has earth on it and FoodPact is written below.
Logo for FoodPact program

A few months ago I began a project with my brother to create a calculator for the environmental footprint of food. It’s called FoodPact to merge food and ecological impact. It’s a work in progress and I’m excited to share the code for it.

Data sources to inform the calculator include:

  • Water footprint data for crops from a 2011 study by M.M. Mekonnen and A. Y. Hoekstra.
  • Greenhouse gas emissions data from Business for Social Responsibility (BSR) and Environmental Protection Agency (EPA) documents on transport via boatrail, and freight.
  • Food waste data from the United States Department of Agriculture (USDA) Economic Research Service (ERS)
  • Global food import data from the USDA Foreign Agricultural Service’s Global Agricultural Trade System (GATS).
  • Country centroid data from a President and Fellows of Harvard College 2015 data file.
  • US city locations from  SimpleMaps.

We used a Bootstrap Bootswatch for the web application’s layout and Flask as the microframework.

Python packages used in the program include:

  • Pandas to create more refined dataframes for use within the application
  • NumPy for equations
  • geopy for calculating great circle distance between latitudes and longitudes
  • Matplotlib and pyplot for creating graphs

The whole point of the program is to take a user’s location, food product, and the product’s country of origin to generate the estimated distance the food traveled, the approximate amount of carbon dioxide that travel generated, and the water requirements for the product.

Conversions include: cubic metric tons to gallons of water, tons of crops to pounds, and grams of carbon dioxide per kilometer to pounds per mile.

Selected graphics from FoodPact project:

One limitation of the calculator is that the values for carbon dioxide consider either full travel by ship, train, or truck and not a combination of the three methods. Emissions refer to the amount it takes to ship a twenty-foot equivalent (TEU) container full of the food product across the world. The country of origin considers the centroid and not the exact location of food production. Similarly, the list of cities displays the 5 most populated cities in that given state. The only exception is New York, for which I considered New York City close enough in latitude and longitude to account for Brooklyn, Queens, Manhattan, the Bronx, and Staten Island.

The data referenced in the calculator is meant to give a relative idea of the inputs required to generate and transport food products to give perspective to consumers. Ideally, the calculator will encourage conversations about the food system and inspire people to reduce their personal food waste.

The Role of Open Access

The Role of Open Access

Open access research provides an opportunity for the public to learn and use data as needed for free, but it is not overwhelmingly common. For researchers outside of academia, trying to pull together useful data can be difficult when considering accessibility barriers.

About two months ago, I began looking for data to create a model of biological inputs and energy requirements in the United States food system. Open data resources such as FAOSTAT, the Economic Research Service,  and Bureau of Transportation Statistics provided helpful figures on land use, food imports, and food transportation values. Aside from these resources, a lot of information I wanted to reference in building a model came from scientific papers that require journal subscriptions or charge a per-article fee.

Three articles that may have been helpful in my research illustrate the cost of access:

Upon closer investigation, Appetite claims it ‘supports open access’ but charges authors $3000 to make articles available to everyone, according to publisher Elsevier. Clearly, providing affordable open access options doesn’t seem like a priority for publishers.

There may have been useful data in the articles mentioned above. However, I won’t find out because I’m sticking with open access resources for my food systems project.

Public government databases are great, but specific science studies may hold more value to independent researchers. Journals like PLOS ONE lead the way in open access articles for those looking for specific research to compliment information from public databases. A 2016 article by Paul Basken in The Chronicle of Higher Education called ‘As an Open-Access Megajournal Cedes Some Ground, a Movement Gathers Steam’ shows a rise in open access papers, but I got the figures via Boston College because the article itself is ‘premium content for subscribers.’

Rise of published open access articles over time between 2008 and 2015. Data from: Basken, P. 2016. As an open-access megajournal cedes some ground, a movement gathers steam. The Chronicle for Higher Education, 62(19), 5-5.

Charging fees for accessibility can create an elitist barrier between academia and those who want to learn more about certain topics. I’m not proposing that everyone would take advantage of open access research articles if there were cheaper publishing options, or no access fees. If more studies were open access, it would create more opportunities for members of the public to digest scientific studies on their own terms.

There’s immense value in the open-source, collaborative culture of the tech community that I hope spills over into academia. I’m optimistic about a continued increase in open access publications in the science community. For now, I’m looking forward to creating open source projects that take advantage of public data.

Data Analysis and UFO Reports

Data Analysis and UFO Reports

Data analysis and unidentified flying object (UFO) reports go hand-in-hand. I attended a talk by author Cheryl Costa who analyzes records of UFO sightings and explores their patterns. Cheryl and her wife Linda Miller Costa co-authored a book that compiles UFO reports called UFO Sightings Desk Reference: United States of America 2001-2015.

Records of UFO sightings are considered citizen science because people voluntarily report their experiences. This is similar to wildlife sightings recorded on websites like eBird that help illustrate bird distributions across the world. People report information about UFO sighting events including date, time, and location.

A dark night sky with the moon barely visible and trees below.
Night sky along the roadside outside Wayquecha Biological Field Station in Peru, taken April 2015.

Cheryl spoke about gathering data from two main online databases, MUFON (Mutual UFO Network) and NUFORC (National UFO Reporting Network). NUFORC’s database is public and reports can be sorted by date, UFO shape, and state. MUFON’s database requires a paid membership to access the majority of their data. This talk was not a session to discuss conspiracy theories, but a chance to look at trends in citizen science reports.

The use of data analysis on UFO reports requires careful consideration of potential bias and reasonable explanations for numbers in question. For example, a high volume of reports in the summer could be because more people are spending time outside and would be more likely to notice something strange in the sky.

This talk showed me that conclusions may be temptingly easy to draw when looking at UFO data as a whole, but speculations should be met with careful criticism. The use of the scientific method when approaching ufology, or the study of UFO sightings, seems key for a field often met with overwhelming skepticism.

I have yet to work with any open-source data on UFO reports, but this talk reminded me of the importance of a methodical approach to data analysis. Data visualization for any field of study starts with asking questions, being mindful of outside factors, and being able to communicate messages within large data sets to any audience.