Sentiment Analysis of Product Reviews with Python Using NLTK

Sentiment Analysis of Product Reviews with Python Using NLTK

Here is a brief overview of how to use the Python package Natural Language Toolkit (NLTK) for sentiment analysis with Amazon food product reviews. This is a basic way to use text classification on a dataset of words to help determine whether a review is positive or negative. The following is a snippet of a more comprehensive tutorial I put together for a workshop for the Syracuse Women in Machine Learning and Data Science group.

Data

The data for this tutorial comes from the Grocery and Gourmet Food Amazon reviews set from Jianmo Ni found at Amazon Review Data (2018). Out of the review categories to choose from, this set seemed like it would have a diverse range of people’s sentiment about food products. The data set itself is fairly large, so I use a smaller subset of 20,000 reviews in the example below.

A data frame preview shows the categories available from the reviews data set.
A preview of the full Groceries and Gourmet Food reviews data set from Amazon shows the available data features.

Steps to clean the main data using pandas are detailed in the Jupyter Notebook. The reviews are categorized on an overall rating scale of 1 to 5, with 1 being the lowest approval and 5 being the highest. I split the data so that reviews set as a 1 or 2 is labeled as negative and those set as 4 or 5 as positive. I omit ratings of 3 for this exercise because they could vary between negative and positive.

Prepare Data for Classification

Import the necessary packages. The steps below assume the data has already been cleaned using pandas.

import pandas as pd
import random
import string
import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords
from nltk import classify
from nltk import NaiveBayesClassifier

Load in the cleaned data from a CSV from a data folder using pandas.

reviews = pd.read_csv('data/combined_reviews.csv')

The main cleaned dataframe has three columns: overview, reviewText, and reaction. The overview column has the numeric review rating, the reviewText column has the product reviews in strings, and the reaction column is marked with ‘positive’ or ‘negative’. Each row represents an individual review.

A condensed dataframe shows three columns: overall rating, review text, and reaction.
The cleaned pandas dataframe shows the three columns for overall rating, review text, and reaction type for the product reviews.

Reduce the main pandas dataframe to a smaller group using the sample function from the random package and a lambda function on the reaction column. I use an even split of 20,000 reviews.

sample_df = reviews.groupby('reaction').apply(lambda x: x.sample(n=10000)).reset_index(drop = True)

Use this sample dataframe to create a list for each sentiment type. Use the loc function from pandas to specify each entry that has ‘positive’ or ‘negative’ in the reaction column, respectively. Then, use the pandas tolist() function to convert the dataframe to a list type.

pos_df = sample_df.loc[sample_df['reaction'] == 'positive']
pos_list = pos_df['reviewText'].tolist()

neg_df = sample_df.loc[sample_df['reaction'] == 'negative']
neg_list = neg_df['reviewText'].tolist()

With these lists, use the lower() function and list comprehension to make each review lowercase. This reduces variance in the types of forms a word with various syntax can have.

pos_list_lowered = [word.lower() for word in pos_list] 
neg_list_lowered = [word.lower() for word in neg_list]

Turn the lists into string types to more easily separate words and prepare for more cleaning. For this text classification, we will consider the frequency of words in each type of review.

pos_list_to_string = ' '.join([str(elem) for elem in pos_list_lowered])  
neg_list_to_string = ' '.join([str(elem) for elem in neg_list_lowered])

To eliminate noise in the data, stop words (examples: ‘and’, ‘how’, ‘but’) should be removed, along with punctuation. Use NLTK’s built-in function for stop words to specify a variable for both stop words and punctuation.

stop = set(stopwords.words('english') + list(string.punctuation))

Create a variable for the tokenizer. Tokenizing will separate all the words in the list based on a specific variable. In this example, I chose to use a whitespace tokenizer. This means words will be separated based on whitespace.

tokenizer = WhitespaceTokenizer()

Use list comprehension on the positive and negative word lists to tokenize any word that is not a stop word or a punctuation item.

filtered_pos_list = [w for w in tokenizer.tokenize(pos_list_to_string) if w not in stop] 

filtered_neg_list = [w for w in tokenizer.tokenize(neg_list_to_string) if w not in stop]

Remove any punctuation that may be leftover if it was attached to a word itself.

filtered_pos_list2 = [w.strip(string.punctuation) for w in filtered_pos_list]
filtered_neg_list2 = [w.strip(string.punctuation) for w in filtered_neg_list]

As an optional sidebar, use NLTK’s Frequency Distribution function to check some of the most common words and their number of appearances in the respective reviews.

fd_pos = nltk.FreqDist(filtered_pos_list2) 
fd_neg = nltk.FreqDist(filtered_neg_list2)
A frequency distribution for positive food product reviews shows common words and their counts.
A list shows individual words pulled from positive food product reviews and their relative frequency in the sample set.

Create a function to make the feature sets for text classification. This will take the lists and create dictionaries with the proper labels.

def word_features(words):
     return dict([(word, True) for word in words.split()])

Label the sets of word features and combine into one set to be split for training and testing for sentiment analysis.

positive_features = [(word_features(f), 'pos') for f in filtered_pos_list2]
negative_features = [(word_features(f), 'neg') for f in filtered_neg_list2]

labeledwords = positive_features + negative_features

Randomly shuffle the list of words before use in the classifier to reduce the likelihood of bias toward a given feature label.

random.shuffle(labeledwords)

Training and Testing the Text Classifier for Sentiment

Create a training set and a test set from the list. From NLTK, call upon the Naïve Bayes Classifier model and specify the training set will train the model for sentiment analysis.

train_set, test_set = labeledwords[2000:], labeledwords[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

Calculate the accuracy of the model.

print(nltk.classify.accuracy(classifier, test_set))

Provide some test example reviews for proof of concept and print the results.

print(classifier.classify(word_features('I hate this product, it tasted weird')))

Use NLTK to show the most informative features of the text classifier. This generates a list based on certain features and shows the likelihood that they point to a specific classification of positive or negative review.

classifier.show_most_informative_features(15)
NLTK's output for most informative features shows a list of words, their feature labels, and the likelihood of their occurrence in each review classification.
Output from NLTK’s most informative features for the Naïve Bayes Classifier.

Further Steps

This was an overview of sentiment analysis with NLTK. There are opportunities to increase the accuracy of the classification model. One example would be to use part-of-speech tagging to train the model using descriptive adjectives or nouns. Another idea to pursue would be to use the results of the frequency distribution and select the most common positive and negative words to train the model.

The full GitHub repository tutorial for this can be found here.

How to Build an Inventory App with Tkinter

How to Build an Inventory App with Tkinter
A app shows features to edit and show an inventory database.

Here’s how to build an inventory app connected to a SQLite database using Python and tkinter. This is a basic GUI (graphical user interface) to view, edit, and calculate specific inventory sums. The example below is for an inventory of supplies to compliment small-scale shop keeping tasks. View the Github repository here.

1. Set up the initial SQLite database with desired column names.

First, we’ll have to create a SQLite database to connect to if one does not already exist. Import sqlite3 and contextlib to start.

import sqlite3

Create a connection to a database. In this instance, a new database will be created if one does not already exist with this name.

connection = sqlite3.connect("inventory.db")
cursor = connection.cursor()

Establish a cursor with the connection which we use to execute the creation of desired database columns. After ‘CREATE TABLE’, provide a name for the table, in this case it is ‘items’. In parentheses, list the desired column names followed by the data types to store each in. The full list of data type options for SQLite can be found here.

cursor.execute("CREATE TABLE items (name TEXT, quantity INTEGER, price INTEGER)")

Commit the changes to the database, and close the connection.

connection.commit()
connection.close() 

2. Create the main window.

Import the necessary packages (tkinter and sqlite3).

from tkinter import *
import sqlite3

Form the initial window for the application. Specify the dimensions using geometry and the title which will be in the header for the window.

window = Tk()
window.geometry("400x450")
window.title("Inventory Summary")

3. Create the entry fields, labels, functions, and buttons to access the database.

Add a Record to the Database

Create entry boxes and associated labels for the database columns (name, quantity, and price). The entry function creates an entry field within the specified window. The label function gives a label to the feature.

item_name = Entry(window, width=20)
item_name.grid(row=0, column=1, pady=2, sticky=W)
item_quantity = Entry(window, width=20)
item_quantity.grid(row=1, column=1, pady=2, sticky=W)
item_price = Entry(window, width=20)
item_price.grid(row=2, column=1, pady=2, sticky=W)

item_name_label = Label(window, text='Name ')
item_name_label.grid(row=0, column=0, pady=2, sticky=E)
item_quantity_label = Label(window,  text='Quantity ')
item_quantity_label.grid(row=1, column=0, pady=2, sticky=E)
item_price_label = Label(window, text ='Price ($) ')
item_price_label.grid(row=2,column=0, pady=2, sticky=E)

Write a function to carry out adding the new record to the database. Each function follows the same basic format where we create a connection to the database and set up the cursor. We use insert for the values entered in the form, close the connection, and then clear out the entries.

def submit():
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    cursor.execute("INSERT INTO items(name,quantity,price) VALUES (?,?,?)",(item_name.get(),item_quantity.get(),item_price.get()))
    connection.commit()
    connection.close()
    item_name.delete(0, END)
    item_quantity.delete(0, END)
    item_price.delete(0, END)

Create a button to click to add the record to the database.

submit_btn = Button(window, text="Add Record to Database", command=submit)
submit_btn.grid(row=3, column=0, columnspan=2, pady=2)

Show Records

Create the function to print out the records in the database. This selects all columns, including their original IDs from the SQLite table. Provide formatting for the display of data, including customizations for the price to come out in standard United State Dollar (USD).

def query():
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    cursor.execute("SELECT *, oid FROM items")
    records = cursor.fetchall()
    print(records)
    print_records = ''
    for record in records:
        print_records += str(record[0]) + ", " + str(record[1]) + " items, $" + "{:.2f}".format(float(record[2])) + ", ID" + "\t" + str(record[3]) +"\n"
    query_label = Label(window, text=print_records)
    query_label.grid(row=5, column=0, columnspan=2)
    connection.commit()
    connection.close()

Create a button to show the database’s records.

query_btn = Button(window, text="Show Records", command=query)
query_btn.grid(row=4, column=0, columnspan=2, pady=2)

Update a Record

The update record feature runs based on the ID specified in the ‘Select ID’ field.

select_box=Entry(window, width=20)
select_box.grid(row=6, column=1, pady=2, sticky=W)

select_box_label = Label(window, text='Select ID ')
select_box_label.grid(row=6, column=0, pady=2, sticky=E)

Then, I create two functions: one for actually updating the database, and another for creating the separate window where this action takes place. The separate window incorporates many of the same elements we already have in the primary window.

def update():
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    record_id = select_box.get()

    cursor.execute(
        'UPDATE items SET name=?, quantity=?, price=? WHERE oid=?',
        (item_name_editor.get(),item_quantity_editor.get(),item_price_editor.get(),record_id)
    )
    connection.commit()
    connection.close()
    editor.destroy()

def edit():
    global editor
    editor = Tk()
    editor.geometry("450x125")
    editor.title("Edit Inventory")
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    record_id = select_box.get()

    cursor.execute("SELECT * FROM items WHERE oid=?",(record_id))
    records = cursor.fetchall()

    global item_name_editor
    global item_quantity_editor
    global item_price_editor

    item_name_editor = Entry(editor, width=20)
    item_name_editor.grid(row=0, column=1, sticky=W)
    item_quantity_editor = Entry(editor, width=20)
    item_quantity_editor.grid(row=1, column=1, sticky=W)
    item_price_editor = Entry(editor, width=20)
    item_price_editor.grid(row=2, column=1, sticky=W)

    item_name_label_editor = Label(editor, text='Name ')
    item_name_label_editor.grid(row=0, column=0, sticky=E)
    item_quantity_label_editor = Label(editor,  text='Quantity ')
    item_quantity_label_editor.grid(row=1, column=0, sticky=E)
    item_price_label_editor = Label(editor, text ='Price ($) ')
    item_price_label_editor.grid(row=2,column=0, sticky=E)

    for record in records:
        item_name_editor.insert(0, record[0])
        item_quantity_editor.insert(0, record[1])
        item_price_editor.insert(0, record[2])
    save_btn = Button(editor, text="Save Record", command=update)
    save_btn.grid(row=11, column=0, columnspan=2, pady=10, padx=10, ipadx=145)
    connection.commit()
    connection.close()

Create the button for updating records.

edit_btn = Button(window, text="Update Record", command=edit)
edit_btn.grid(row=11, column=0, columnspan=2, pady=2)

Delete a Record

This function runs based on the ID specified in the ‘Select ID’ form.

def delete():
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    cursor.execute("DELETE from items WHERE oid=?",(select_box.get()))
    connection.commit()
    connection.close()

Create the button to remove a record from the database.

delete_btn = Button(window, text="Delete Record", command=delete)
delete_btn.grid(row=12, column=0, columnspan=2, pady=2)

4. Create a calculator button to inform updates.

One feature I wanted was a calculator for the price of a quantity within the total price of an item. For example, if I used 3 of item A, how much from the total price for that inventory would I potentially deduct. This uses the same ‘Select ID’ field mentioned above.

def calc_price():
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    cursor.execute("SELECT * FROM items WHERE oid=?",(select_box.get()))
    records = cursor.fetchall()
    price_sum = []
    for record in records:
        price_sum.append(round((record[2] / record[1]) * int(price_calc.get()),2))
    global calc_sum_label
    calc_sum_label = Label(window, text=price_sum)
    calc_sum_label.grid(row=9, column=0, columnspan=2, pady=2)
    connection.commit()
    connection.close()

After running the ‘Calculate Price Sum’ button, the output must be cleared each time before making another calculation.

def clear_output():
    connection = sqlite3.connect("inventory.db")
    cursor = connection.cursor()
    calc_sum_label.destroy()
    connection.commit()
    connection.close()

Create the entry and label for the price sum function.

price_calc=Entry(window, width=20)
price_calc.grid(row=7, column=1, pady=2, sticky=W)

price_calc_label = Label(window, text='Quantity for Price Sum ')
price_calc_label.grid(row=7, column=0, pady=2, sticky=E)

Create the ‘Caclulate Price Sum’ button and the ‘Clear Output’ button.

calculate_price_btn = Button(window, text="Calculate Price Sum", command=calc_price)
calculate_price_btn.grid(row=8, column=0, columnspan=2, pady=2)

clear_output_btn = Button(window, text="Clear Output", command=clear_output)
clear_output_btn.grid(row=10, column=0, columnspan=2, pady=2)

Layout with Tkinter

There are two methods for layout with tkinter: grid and pack. I use the grid method, which allows the app to be designed using column and row placement. I use additional arguments like column span to use more than one column for placement, and sticky to keep items to either the west or east sides of the specified columns.

Transforming Categorical Survey Data with pandas and GeoPy

Transforming Categorical Survey Data with pandas and GeoPy

In this tutorial, I review ways to take raw categorical survey data and create new variables for analysis and visualizations with Python using pandas and GeoPy. I’ll show how to make new pandas columns from encoding complex responses, geocoding locations, and measuring distances.

Here’s the associated GitHub repository for this workshop, which includes the data set and a Jupyter Notebook for the code.

Thanks to the St. Lawrence Eastern Lake Ontario Partnership for Regional Invasive Species Management (SLELO PRISM), I was able to use boat launch steward data from 2016 for this virtual workshop. The survey data was collected by boat launch stewards around Lake Ontario in upstate New York. Boaters were asked a series of survey questions and their watercrafts were inspected for aquatic invasive species.

This tutorial was originally designed for the Syracuse Women in Machine Learning and Data Science (Syracuse WiMLDS) Meetup group.

Analyze News Headlines with newsgrab and spaCy

Analyze News Headlines with newsgrab and spaCy

Here’s an overview of how to use newsgrab to get news headlines from Google News. Then, the data can be analyzed using the spaCy natural language processing library.

The motivation behind newgrab was to pull data on New York colleges to compare headlines about how institutions were being affected by COVID-19. I used the College Navigator from the National Center for Education Statistics to get a list of 4-year colleges in New York to use as the search data.

I had trouble finding a clean way to scrape headlines from Google News. My brother Randy helped me use Javascript and playwright to write the code for newsgrab.

Run a Search with newsgrab

First, install newsgrab globally through npm from the command line.

npm install -g newsgrab

Run a line with the package name and specify the file path (if outside current working directory) of a line-separated list of desired search terms. For my example, I used the names of New York colleges.

newsgrab ny_colleges.txt

The output of newsgrab is a JSON file called output and will follow the array structure below:

[{"search_term":"term1","results":["result1","result2","result3"]},{"search_term":"term2","results":["result1","result2","result3"]}...]

Afterwards, the output can be handled with Python.

Analyze the JSON Data with spaCy

Import the necessary packages for handling the data. These include: json, pandas, matplotlib, seaborn, re, and spaCy. Specific modules to import are the json_normalize module from pandas and the counter module from collections.

import json
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sb
import re
import spacy
from collections import Counter

Bring in one of the pre-trained models from spaCy. I use the model called en_core_web_sm. There are other options in their docs for English models, as well as those for different languages.

nlp = spacy.load("en_core_web_sm")

Read in the JSON data as a list and then normalize it with pandas. Specify the record path as ‘results’ and the meta as ‘search_term’ to correspond with the JSON array data structure from the output file.

with open('output.json',encoding="utf8") as raw_file1:
    list1 = json.load(raw_file1)

search_data = pd.json_normalize(list1, record_path='results', meta='search_term',record_prefix='results')

Gather all separate data through spaCy. I wanted to pull noun chunks, named entities, and tokens from my results column. For the token output, I use the attributes of rule-based matching to specify that I want all tokens except for stop words or punctuation. Then, each output is put into a column of the main dataframe.

noun_chunks = []
named_entity = []
tokens = []

for doc in nlp.pipe(df['results_lower'].astype('unicode').values, batch_size=50,
                        n_process=5):
    if doc.is_parsed:
        noun_chunks.append([chunk.text for chunk in doc.noun_chunks])
        named_entity.append([ent.text for ent in doc.ents])
        tokens.append([token.text for token in doc if not token.is_stop and not token.is_punct])
    else:
        noun_chunks.append(None)
        named_entity.append(None)       
        tokens.append(None)
        
df['results_noun_chunks'] = noun_chunks
df['results_named_entities'] = named_entity
df['results_tokens_clean'] = tokens

Process Tokens

Take the tokens column and flatten it into a list. Perform some general data cleaning like removing special characters and taking out line breaks and the remnants of ampersands. Then, use the counter module to get a frequency count of each of the words in the list.

word_frequency = Counter(string_list_of_words)
A raw output from the counter in collections shows words and their associated frequency in the text.
Raw output from the counter module shows tokens and their associated value counts in the total text.

Before analyzing the list, I also remove the tokens for my list of original search terms to keep it more focused on the terms outside of these. Then, I create a dataframe of the top results and plot those with seaborn.

A horizontal countplot shows descending value counts for the top tokens found in the text.
A countplot shows all keyword tokens with value counts over 21 for the college news headline data.

Process Noun Chunks

Perform some cleaning to separate the noun chunks lists per each individual search term. I remove excess characters after converting the output to strings, and then use the explode function from pandas to separate them.

Then, create a variable for the value count of each of the noun chunks, turn that into a dictionary, then map it to the dataframe for the following result.

A pandas dataframe shows news headlines, noun chunks, and separated noun segments and value counts.
A dataframe shows headlines, search terms, noun chunks, and new columns for separated noun chunks and associated value counts.

Then, I sort the values in a new dataframe in descending order, remove duplicates, and narrow down to the top 20 noun chunks with frequencies above 10 to graph in a countplot.

A horizontal countplot shows descending value counts for the top noun chunks found in the text.
A countplot shows all noun chunks with value counts over 9 for the college news headline data.

Process Named Entities

Cleaning the named entity outputs for each headline is nearly the same in process as cleaning the noun chunks. The lists are converted to strings, are cleaned, and use the explode function to separate individually. The outputs for named entities can be customized depending on desired type.

After separating the individual named entities, I use spaCy to identify the type of each and create a new column for these.

named_entity_type = []

for doc in nlp.pipe(named['named_entity'].astype('unicode').values, batch_size=50,
                        n_process=5):
    if doc.is_parsed:
        named_entity_type.append([ent.label_ for ent in doc.ents])
    else:
        named_entity_type.append(None)        

named['named_entities_type'] = named_entity_type

Then, I get the value counts for the named entities and append these to a dictionary. I map the dictionary to the named entity column, and put the result in a new column.

As seen in the snippet of the full dataframe below, the model for identifying named entity values and types is not always accurate. There is documentation for training spaCy’s models for those interested in increased accuracy.

A pandas dataframe shows news headlines, named entities, and separated named entities, named entity type, and value counts.
A dataframe shows headlines, search terms, named entities, and new columns for separated named entities, their type, and associated value counts.

From the dataframe, I narrow down the entity types to exclude cardinal and ordinal types to take out any numbers that may have high frequencies within the headlines. Then, I get the top named entity types with frequencies over 6 to graph.

A horizontal countplot shows descending value counts for the top non-numerical named entities found in the text.
A countplot shows all non-numerical named entities with value counts over 6 for the college news headline data.

For full details and cleaning steps to create the visualizations above, please reference below for the associated gist from Github.

Additional Resources

Natural Langauge Processing with Python and spaCy by Yuli Vasiliev

Natural Language Processing with spaCy in Python by Taranjeet Singh

Mapping Song Lyric Locations in Python

Here’s an overview of how to map the coordinates of cities mentioned in song lyrics using Python. In this example, I used Lana Del Rey’s lyrics for my data and focused on United States cities. The full code for this is in a Jupyter Notebook on my GitHub under the lyrics_map repository.

A Lana Del Rey album booklet on a map
A map with Lana Del Rey’s Lust for Life album booklet.

Gather Bulk Song Lyrics Data

First, create an account with Genius to obtain an API key. This is used for making requests to scrape song lyrics data from a desired artist. Store the key in a text file. Then, follow the tutorial steps from this blog post by Nick Pai and reference the API key text file within the code.

You can customize the code to cater to a certain artist and number of songs. To be safe, I put in a request for lyrics from 300 songs.

Find Cities and Countries in the Data

After getting the song lyrics in a text file, open the file and use geotext to grab city names. Append these to a new pandas dataframe.

places = GeoText(content)
cities_from_text = places.cities
city_mentions = pd.DataFrame(cities_from_text, columns=['city'])

Use GeoText to gather country mentions and put these in a column. Then, clean the raw output and create a new dataframe querying only on the United States.

Personally, I focus only on United States cities to reduce errors from geotext reading common words such as ‘Born’ as foreign city names.

A three column dataframe shows city and two country columns.
The results from geotext city and country mentions in a dataframe, with a cleaned country column.
f = lambda x: GeoText(x).country_mentions
origin = city_mentions['city'].apply(f)
city_mentions['country_raw'] = origin

fn = lambda x: list(x)[0]
city_mentions['country'] = city_mentions['country_raw'].apply(fn)

city_mentions = city_mentions[city_mentions['country'] == 'US']

Afterwards, remove the country columns and manually clean the city data. I removed city names that seemed inaccurate.

city_mentions.drop(columns=['country_raw', 'country'], inplace=True)

cities_to_remove = ['Paris','Mustang','Palm','Bradley','Sunset','Pontiac','Green','Paradise',
    'Mansfield','Eden','Crystal','Monroe','Columbia','Laredo','Joplin','Adrian',
    'York','Golden','Oklahoma','Kansas','Coachella','Kokomo','Woodstock']

city_mentions = city_mentions[~city_mentions['city'].isin(cities_to_remove)]

In my example, I corrected Newport and Venice to include ‘Beach’. I understand this can be cumbersome with larger datasets, but I did not see it imperative to automate this task for my example.

city_mentions = city_mentions.replace(to_replace ='Newport', value ='Newport Beach')
city_mentions = city_mentions.replace(to_replace ='Venice', value ='Venice Beach')

Next, save a list and a dataframe with value counts for each city to be used later for the map. Reset the index as well to have the two columns as city and mentions.

city_val_counts = city_mentions['city'].value_counts()
city_counts = pd.DataFrame(city_val_counts)

city_counts = city_counts.reset_index()
city_counts.columns = ['city', 'mentions']
A two column dataframe shows cities and number of mentions.
A pandas dataframe shows city and number of song mentions.

Then, create a list of the unique city values.

unique_list = (city_mentions['city'].unique().tolist())

Geocode the City Names

Use GeoPy to geocode the cities from the unique list, which pulls associated coordinates and location data. The user agent needs to be specified to avoid an error. Create a dataframe from this output.

chrome_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
geolocator = Nominatim(timeout=10,user_agent=chrome_user_agent)

lat_lon = []
for city in unique_list: 
    try:
        location = geolocator.geocode(city)
        if location:
            lat_lon.append(location)
    except GeocoderTimedOut as e:
        print("Error: geocode failed on input %s with message %s"%
             (city, e))

city_data = pd.DataFrame(lat_lon, columns=['raw_data','raw_data2'])
city_data = city_data[['raw_data2', 'raw_data']]

This yields one column as the latitude and longitude and another with comma separated location data.

A two column dataframe showing coordinates and location data such as city, county, zip code and state
The raw output of GeoPy’s geocode function in a pandas dataframe, showing the coordinates and associated location fields in a list.

Reduce the Geocode Data to Desired Columns

I cleaned my data to have only city names and associated coordinates. The output from GeoPy allows for more information such as county and state, if desired.

To split the location data (raw_data) column, convert it to a string and then split it and create a new column (city) from the first indexed object.

city_data['city'] = city_data['raw_data'].str.split(',').str[0]
A three column datadrame shows two columns of geocoded output and one for city names.
A dataframe with the outputs from GeoPy geocoder with one new column for string split city names.

Then, convert the coordinates column (raw_data2) into a string type to remove the parentheses and finally split on the comma.

#change the coordinates to a string
city_data['raw_data2'] = city_data['raw_data2'].astype(str)

#split the coordinates using the comma as the delimiter
city_data[['lat','lon']] = city_data.raw_data2.str.split(",",expand=True,)

#remove the parentheses
city_data['lat'] = city_data['lat'].map(lambda x:x.lstrip('()'))
city_data['lon'] = city_data['lon'].map(lambda x:x.rstrip('()'))

Convert the latitude and longitude columns back to floats because this is the usable type for plotly.

city_data = city_data.astype({'lat': 'float64', 'lon': 'float64'})

Next, drop all the unneeded columns.

city_data.drop(['raw_data2', 'raw_data'], axis = 1, inplace=True)

Drop any duplicates and end up with a clean set of city, latitude, and longitude.

city_data.drop_duplicates(keep=False,inplace=True)
A three column dataframe shows city, latitude, and longitude.
The cleaned dataframe for the city, latitude, and longitude.

Create the Final Merged DataFrame and Map

Merge the city coordinates dataframe and city mentions dataframe using a left join on city names.

merged = pd.merge(city_data, city_counts, on='city', how='left')
A four column dataframe shows city names, latitude, longitude, and number of mentions
The final merged dataframe with city, latitude, longitude, and number of song mentions.

Create an account with MapBox to obtain an API key to plot my song lyric locations in a Plotly Express bubble map. Alternatively, it is also possible to generate the map without an API key if you have Dash installed. Customize the map for visibility by adjusting variables such as the color scale, the zoom extent, and the data that appears when hovering over the data.

px.set_mapbox_access_token(open("mapbox_token.txt").read())
df = px.data.carshare()
fig = px.scatter_mapbox(merged, lat='lat', lon='lon', color='mentions', size='mentions',
                  color_continuous_scale=px.colors.sequential.Agsunset, size_max=40, zoom=3, 
                        hover_data=['city'])
fig.update_layout(
    title={
        'text': 'US Cities Mentioned in Lana Del Rey Songs',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

#save graph as html
with open('plotly_graph.html', 'w') as f:
    f.write(fig.to_html(include_plotlyjs='cdn'))

Improving Visualizations of Hierarchical Qualitative Data

Improving Visualizations of Hierarchical Qualitative Data

Visualizing qualitative data can be difficult if care is not taken for hierarchical characteristics. Variables representing levels of feelings can be presented in a horizontal range to improve comprehension. The online bank, Simple, includes a poll in its newsletter to account holders and often asks for levels of confidence with financial topics. Here’s how to present hierarchical qualitative data in a few different ways based on visualizations from Simple’s monthly newsletter.

To represent qualitative data, careful consideration should be given to:

  • Graph Type
  • Logical Order of Data
  • Color Scheme

Original Graphs

Graph 1

In September, Simple’s poll question was: “How confident do you feel making big purchases in today’s financial environment?” Here is the visualization that accompanied it.

A pie graph created by Simple bank shows levels of confidence for account holder confidence making big purchases in today's financial climate.
Simple’s pie chart of its September survey results for: “How confident do you feel making big purchases in today’s financial environment?”

Although the legend is presented in a sensible high-to-low order, this graph is pretty confusing. The choice of a pie chart muddles the range of emotions being presented. The viewer’s eye, if moving clockwise, hits ‘Not at all Confident’ at about the same time as ‘Very Confident’. The color palette has no inherent significance for the survey responses. It does not travel on an easily understood color spectrum of high to low.

Graph 2

In November, Simple’s poll question was: “How do you feel about the money you’ll be spending this holiday season?” Below is the graph that illustrated these results.

A bar chart shows
Simple’s bar chart of November survey results for: “How do you feel about the money you’ll be spending this holiday season?”

Simple’s graph shows various emotions, but does not show them in any particular order, whether by percentage or type of feeling. Similar to the pie chart, the color palette does not have any particular significance.

Improved Graphs

Using Python and matplotlib’s horizontal stacked bar chart, I created different representations of the survey data for big purchase confidence and feelings about holiday spending. A bar chart presents results for viewers to read logically from left to right.

Graph 1

A horizontal bar chart shows Simple's survey results from high to low confidence levels for making big purchases in today's financial climate.
A horizontal stacked bar chart shows a variation of Simple’s September survey results.

I associated the levels of confidence with a green to red spectrum to signify the range of positive to negative feelings. Another variation could have been a monochrome spectrum where a dark shade moving to a lighter shades would signify decreasing confidence.

Graph 2

A horizontal stacked bar chart shows a range of emotions for holiday spending.
A horizontal stacked bar chart shows a variation of Simple’s November survey results.

I arranged the emotions from negative to positive feelings so they could show a spectrum. The color palette reflects the movements from troubled to excited by moving from red to green.

References

The survey data, as mentioned, comes from Simple‘s monthly newsletter.

This article from matplotlib on discrete distribution provided me with the base for these graphs. The main distinction is that I only included one bar to achieve the singular spectrum of survey results. I found variations of tree maps and waffle plots did not divide sections horizontally in rectangles as well as the stacked bar plot would.

Code

Visual #1 – September Survey Data

category_names1 = ['very \nconfident', 'somewhat \nconfident', 'mixed \nfeelings', 'not really \nconfident', 'not at all \nconfident']
results1 = {'': [14,16,30,19,21]}

def survey1(results, category_names):

    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('RdYlGn_r')(
        np.linspace(0.15, 0.85, data.shape[1]))

    fig, ax = plt.subplots(figsize=(12, 4))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c))+'%', ha='center', va='center',
                    color=text_color, fontsize=20, fontweight='bold',
                   fontname='Gill Sans MT')
    ax.legend(ncol=len(category_names), bbox_to_anchor=(0.007, 1),
              loc='lower left',prop={'family':'Gill Sans MT', 'size':'15'})
    ax.axis('off')
    return fig, ax

survey1(results1, category_names1)

plt.suptitle(t ='How confident do you feel making big purchases in today\'s financial environment?', x=0.515, y=1.16, 
    fontsize=22, style='italic', fontname='Gill Sans MT')
#plt.savefig('big_purchase_confidence.jpeg', bbox_inches = 'tight')
plt.show()

Visual #2 – November Survey Data

category_names2 = ['in a pickle','worried','fine','calm','excited']
results2 = {'': [14,32,16,29,9]}


def survey2(results, category_names):

    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('RdYlGn')(
        np.linspace(0.15, 0.85, data.shape[1]))

    fig, ax = plt.subplots(figsize=(10.5, 4))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, 
category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c))+'%', ha='center', va='center',
                    color=text_color, fontsize=20, fontweight='bold', fontname='Gill Sans MT')
    ax.legend(ncol=len(category_names), bbox_to_anchor=(- 0.01, 1),
              loc='lower left', prop={'family':'Gill Sans MT', 'size':'16'})
    ax.axis('off')
    return fig, ax


survey2(results2, category_names2)
plt.suptitle(t ='How do you feel about the money you\'ll be spending this holiday season?', x=0.509, y=1.1, fontsize=22,
            style='italic', fontname='Gill Sans MT')
#plt.savefig('holiday_money.jpeg', bbox_inches = 'tight')
plt.show()

Spotify Web API: How to Pull and Clean Top Song Data using Python

Spotify Web API: How to Pull and Clean Top Song Data using Python

I used the Spotify Web API to pull the top songs from my personal account. I’ll go over how to get the fifty most popular songs from a user’s Spotify account using spotipy, clean the data, and produce visualizations in Python.

Top 50 Spotify Songs

Top 50 songs from my personal Spotify account, extracted using the Spotify API.
 SongArtistAlbumPopularity
1BorderlineTame ImpalaBorderline77
2GroceriesMallratIn the Sky64
3FadingToro y MoiOuter Peace48
4FanfareMagic City HippiesHippie Castle EP57
5LimestoneMagic City HippiesHippie Castle EP59
6High Steppin'The Avett BrothersCloser Than Together51
7I Think Your Nose Is BleedingThe Front BottomsAnn43
8Die Die DieThe Avett BrothersEmotionalism (Bonus Track Version)44
9SpiceMagic City HippiesModern Animal42
10Bleeding WhiteThe Avett BrothersCloser Than Together53
11Prom QueenBeach BunnyProm Queen73
12SportsBeach BunnySports65
13FebruaryBeach BunnyCrybaby51
14Pale Beneath The Tan (Squeeze)The Front BottomsAnn43
1512 Feet DeepThe Front BottomsRose49
16Au Revoir (Adios)The Front BottomsTalon Of The Hawk50
17FreelanceToro y MoiOuter Peace57
18SpacemanThe KillersDay & Age (Bonus Tracks)62
19Destroyed By Hippie PowersCar Seat HeadrestTeens of Denial51
20Why Won't They Talk To Me?Tame ImpalaLonerism59
21FallingwaterMaggie RogersHeard It In A Past Life71
22Funny You Should AskThe Front BottomsTalon Of The Hawk48
23You Used To Say (Holy Fuck)The Front BottomsGoing Grey47
24Today Is Not RealThe Front BottomsAnn41
25FatherThe Front BottomsThe Front Bottoms43
26Broken BoyCage The ElephantSocial Cues60
27Wait a Minute!WILLOWARDIPITHECUS80
28Laugh Till I CryThe Front BottomsBack On Top47
29Nobody's HomeMallratNobody's Home56
30Apocalypse DreamsTame ImpalaLonerism60
31Fill in the BlankCar Seat HeadrestTeens of Denial56
32SpiderheadCage The ElephantMelophobia57
33Tie Dye DragonThe Front BottomsAnn47
34Summer ShandyThe Front BottomsBack On Top43
35At the BeachThe Avett BrothersMignonette51
36MotorcycleThe Front BottomsBack On Top41
37The New Love SongThe Avett BrothersMignonette42
38Paranoia in B MajorThe Avett BrothersEmotionalism (Bonus Track Version)49
39AberdeenCage The ElephantThank You Happy Birthday54
40Losing TouchThe KillersDay & Age (Bonus Tracks)51
41Four of a KindMagic City HippiesHippie Castle EP46
42Cosmic Hero (Live at the Tramshed, Cardiff, Wa...Car Seat HeadrestCommit Yourself Completely34
43Locked UpThe Avett BrothersCloser Than Together49
44Bull RideMagic City HippiesHippie Castle EP49
45The Weight of LiesThe Avett BrothersEmotionalism (Bonus Track Version)51
46Heat WaveSnail MailLush60
47Awkward ConversationsThe Front BottomsRose42
48Baby Drive It DownToro y MoiOuter Peace47
49Your LoveMiddle KidsMiddle Kids EP29
50Ordinary PleasureToro y MoiOuter Peace58

Using Spotipy and the Spotify Web API

First, I created an account with Spotify for Developers and created a client ID from the dashboard. This provides both a client ID and client secret for your application to be used when making requests to the API.

Next, from the application page, in ‘Edit Settings’, in Redirect URIs, I add http://localhost:8888/callback . This will come in handy later when logging into a specific Spotify account to pull data.

Then, I write the code to make the request to the API. This will pull the data and put it in a JSON file format.

I import the following libraries:

  • Python’s OS library to facilitate the client ID, client secret, and redirect API for the code using the computer’s operating system. This will temporarily set the credentials in the environmental variables.
  • Python’s json library to encode the data.
  • Spotipy to provide an authorization flow for logging in to a Spotify account and obtain current top tracks for export.
import os
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util

Next, I define the client ID and secret to what has been assigned to my application from the Spotify API. Then, I set the environmental variables to include the the client ID, client secret, and the redirect URI.

cid ="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" 
secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

os.environ['SPOTIPY_CLIENT_ID']= cid
os.environ['SPOTIPY_CLIENT_SECRET']= secret
os.environ['SPOTIPY_REDIRECT_URI']='http://localhost:8888/callback'

Then, I work through the authorization flow from the Spotipy documentation. The first time this code is run, the user will have to provide their Sptofy username and password when prompted in the web browser.

username = ""
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
scope = 'user-top-read'
token = util.prompt_for_user_token(username, scope)

if token:
    sp = spotipy.Spotify(auth=token)
else:
    print("Can't get token for", username)

In the results section, I specify the information to pull. The arguments I provide indicate 50 songs as the limit, the index of the first item to return, and the time range. The time range options, as specified in Spotify’s documentation, are:

  • short_term : approximately last 4 weeks of listening
  • medium_term : approximately last 6 months of listening
  • long_term : last several years of listening

For my query, I decided to use the medium term argument because I thought that would give the best picture of my listening habits for the past half year. Lastly, I create a list to append the results to and then write them to a JSON file.

if token:
    sp = spotipy.Spotify(auth=token)
    results = sp.current_user_top_tracks(limit=50,offset=0,time_range='medium_term')
    for song in range(50):
        list = []
        list.append(results)
        with open('top50_data.json', 'w', encoding='utf-8') as f:
            json.dump(list, f, ensure_ascii=False, indent=4)
else:
    print("Can't get token for", username)

After compiling this code into a Python file, I run it from the command line. The output is top50_data.JSON which will need to be cleaned before using it to create visualizations.

Cleaning JSON Data for Visualizations

The top song data JSON file output is nested according to different categories, as seen in the sample below.

 "artists": [
                    {
                        "external_urls": {
                            "spotify": "https://open.spotify.com/artist/5PbpKlxQE0Ktl5lcNABoFf"
                        },
                        "href": "https://api.spotify.com/v1/artists/5PbpKlxQE0Ktl5lcNABoFf",
                        "id": "5PbpKlxQE0Ktl5lcNABoFf",
                        "name": "Car Seat Headrest",
                        "type": "artist",
                        "uri": "spotify:artist:5PbpKlxQE0Ktl5lcNABoFf"
                    }
                ],
                "disc_number": 1,
                "duration_ms": 303573,
                "explicit": true,
                "href": "https://api.spotify.com/v1/tracks/5xy3350chgFfFcdTET4xz3",
                "id": "5xy3350chgFfFcdTET4xz3",
                "is_local": false,
                "name": "Destroyed By Hippie Powers",
                "popularity": 51,
                "preview_url": "https://p.scdn.co/mp3-preview/cd1a18f3f7c8ada17bb54c55524ef42e80719d1f?cid=39e9cdce36dc45e589ce5b564c0594a2",
                "track_number": 3,
                "type": "track",
                "uri": "spotify:track:5xy3350chgFfFcdTET4xz3"
            },

Before cleaning the JSON data and creating visualizations in a new file, I import json, pandas, matplotlib, and seaborn. Next, I load the JSON file with the top 50 song data.

import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

with open('top50_data.json') as f:
  data = json.load(f)

I create a full list of all the data to start. Next, I create lists where I will append the specific JSON data. Using a loop, I access each of the items of interest for analysis and append them to the lists.

list_of_results = data[0]["items"]
list_of_artist_names = []
list_of_artist_uri = []
list_of_song_names = []
list_of_song_uri = []
list_of_durations_ms = []
list_of_explicit = []
list_of_albums = []
list_of_popularity = []

for result in list_of_results:
    result["album"]
    this_artists_name = result["artists"][0]["name"]
    list_of_artist_names.append(this_artists_name)
    this_artists_uri = result["artists"][0]["uri"]
    list_of_artist_uri.append(this_artists_uri)
    list_of_songs = result["name"]
    list_of_song_names.append(list_of_songs)
    song_uri = result["uri"]
    list_of_song_uri.append(song_uri)
    list_of_duration = result["duration_ms"]
    list_of_durations_ms.append(list_of_duration)
    song_explicit = result["explicit"]
    list_of_explicit.append(song_explicit)
    this_album = result["album"]["name"]
    list_of_albums.append(this_album)
    song_popularity = result["popularity"]
    list_of_popularity.append(song_popularity)

Then, I create a pandas DataFrame, name each column and populate it with the above lists, and export it as a CSV for a backup copy.

all_songs = pd.DataFrame(
    {'artist': list_of_artist_names,
     'artist_uri': list_of_artist_uri,
     'song': list_of_song_names,
     'song_uri': list_of_song_uri,
     'duration_ms': list_of_durations_ms,
     'explicit': list_of_explicit,
     'album': list_of_albums,
     'popularity': list_of_popularity
     
    })

all_songs_saved = all_songs.to_csv('top50_songs.csv')

Using the DataFrame, I create two visualizations. The first is a count plot using seaborn to show how many top songs came from each artist represented in the top 50 tracks.

descending_order = top50['artist'].value_counts().sort_values(ascending=False).index
ax = sb.countplot(y = top50['artist'], order=descending_order)

sb.despine(fig=None, ax=None, top=True, right=True, left=False, trim=False)
sb.set(rc={'figure.figsize':(6,7.2)})

ax.set_ylabel('')    
ax.set_xlabel('')
ax.set_title('Songs per Artist in Top 50', fontsize=16, fontweight='heavy')
sb.set(font_scale = 1.4)
ax.axes.get_xaxis().set_visible(False)
ax.set_frame_on(False)

y = top50['artist'].value_counts()
for i, v in enumerate(y):
    ax.text(v + 0.2, i + .16, str(v), color='black', fontweight='light', fontsize=14)
    
plt.savefig('top50_songs_per_artist.jpg', bbox_inches="tight")
A countplot shows artists in descending song counts in total top tracks from Spotify.
A countplot shows the number of songs per artists in the top 50 tracks from greatest to least.

The second graph is a seaborn box plot to show the popularity of songs within individual artists represented.

popularity = top50['popularity']
artists = top50['artist']

plt.figure(figsize=(10,6))

ax = sb.boxplot(x=popularity, y=artists, data=top50)
plt.xlim(20,90)
plt.xlabel('Popularity (0-100)')
plt.ylabel('')
plt.title('Song Popularity by Artist', fontweight='bold', fontsize=18)
plt.savefig('top50_artist_popularity.jpg', bbox_inches="tight")
A graph shows the varying levels of song popularity per artist in top tracks from Spotify.
A boxplot shows the different levels of song popularity per artist in top 50 Spotify tracks.

Further Considerations

For future interactions with the Spotify Web API, I would like to complete requests that pull top song data for each of the three term options and compare them. This would give a comprehensive view of listening habits and could lead to pulling further information from each artist.

How to Scrape Hypertext from Tables Using Beautiful Soup

To get some web scraping practice, I wanted to obtain a large list of animal names. Colorado State University has a list of links to this data as a part of the Warren and Genevieve Garst Photographic Collection. The data is stored in table format. Here’s how to scrape hypertext data from HTML tables using Beautiful Soup.

Inspect the Data

First, visit the web page and inspect the data you would like to scrape. On Window, this means either right-clicking a desired element and selecting ‘Inspect’ or hitting Ctrl+Shift+I to open up the browser’s developer tools.

A screenshot of the window that appears after right-clicking an element on Windows before inspection.
Screen that appears after right-clicking an element on the web page.
Screenshot of the DevTools window showing the selected element and its surrounding HTML that we'll need to scrape the data.
Inspected element in the Developer Tools window showing the HTML behind the page.

After inspecting the element, we see that it is in an HTML table and each row holds an entry for an animal name.

Scrape the Data

Before beginning, import the packages we’ll need now (requests and Beautiful Soup) and later on (pandas).

import pandas as pd
import requests
from bs4 import BeautifulSoup

Then, we’ll use a request to gather the HTML from the webpage.

page = requests.get('https://lib2.colostate.edu/wildlife/atoz.php?letter=ALL')

Next, we’ll create a Beautiful Soup object referencing the page variable.

soup = BeautifulSoup(page.text, 'html.parser')

Using the object we just created, let’s gather all the row data by appending it a new list.

rows = soup.find_all('tr')
list_animals = []
for row in rows:         
    instance = row.get_text()
    list_animals.append(instance)

Afterwards, I create a pandas dataframe with the list we just generated and use the head() function to preview the output.

list_of_animals = pd.DataFrame(list_animals)
print(list_of_animals.head(10))
Preview of output from initial scraping of animal names and associated genus and species classification.
Initial dataframe preview of animal data after scraping.

Clean the Data

Based on our output, I want to refine the dataframe so the row entries are in a position to be split into two columns. First, I remove rows in index locations 0 through 2.

list1 = list1.drop([list1.index[0], list1.index[1], list1.index[2]])
Dataframe following row drop to leave only animal names and genus species.

Then, I drop the escape characters in the front and end of each cell entry using the lstrip() and rstrip() functions. I split the remaining column into two columns based on ‘Animal’ and ‘Genus Species’ by using str.split() to separate the row.

list1['v1'] = list1[0].map(lambda x: x.lstrip('\n').rstrip('\n'))
list1[['v1','v2']] = list1['v1'].str.split('\n',expand=True)
A table shows the dataframe with the original column, a new column with animal names, and another column with genus and species.
Separation of original column based on animal name and genus species.

Next, I reset the index and drop the index and original column. I rename columns ‘v1’ and ‘v2’ to their appropriate names.

list1 = list1.reset_index()
list1 = list1.drop(columns = ['index', 0])
list1 = list1.rename(columns={"v1": "common_name", "v2": "genus_species"})
A table shows the final dataframe resulting from cleaning with two columns as properly titled 'common_name' and 'genus_species'.
Final dataframe with columns named for common names and genus species.

Lastly, I save the dataframe in a comma-separated values file for later use.

animal_names_list = list1.to_csv(r'Path\...\animal_names_list.csv')

Analysis of USDA Plant Families Data Using Pandas

The United States Department of Agriculture PLANTS database provides general information about plant species across the country. Given 3 states, I wanted to visualize which plant families are present in each and which state(s) hold the most species in each family. To accomplish this task, I used Python’s pandas, matplotlib, and seaborn libraries for analysis.

Different groups of plants, including wildflowers, are arranged in a rock bed in the desert.
Various plant species in Death Valley National Park, California.

Initial Setup

Before beginning, I import pandas, matplotlib, and seaborn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

Gathering Data

I pulled data sets from the USDA website for New York, Idaho, and California. The default encoding is in Latin-1 for exported text files. When importing into pandas, the encoding must be specified to work properly.

ny_list = pd.read_csv('ny_list.txt', encoding='latin-1')
ca_list = pd.read_csv('ca_list.txt', encoding='latin-1')
id_list = pd.read_csv('id_list.txt', encoding='latin-1')

I double check the files have been loaded correctly into dataframe format using head().

A preview of a table shows 5 entries of various plants that are present in New York state, including their USDA symbol, synonym symbol, scientific name, common name, and plant family.
Initial import of New York state plants list categorized by USDA symbol, synonym symbol, scientific name with author, national common name, and family.

Cleaning Data

The major point of interest in the imported dataframe is the ‘Family’ column. I create a new dataframe organized by this column and returning the count from each row.

ny_fam = ny_list.groupby('Family').count()
A table shows plants grouped by taxonomic family for New York state.
Initial dataframe for New York plant data grouped by taxonomic family.

Next, I remove the unwanted columns. I’ve chosen only to keep the ‘Symbol’ column as a representation of count because this variable is required for every plant instance.

ny_fam_1 = ny_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)

Then, I change the column name from ‘Symbol’ to ‘{State} Count’ to lend itself for merging the dataframes without confusion.

ny_fam_1 = ny_fam_1.rename(columns = {"Symbol":"NY Count"})
Two tables show a before and after image of a table where the column named 'Symbol' changes to 'NY Count.'
Column before (left) and after (right) a name change.

I complete the same process for the California and Idaho data.

ca_fam = ca_list.groupby('Family').count()
ca_fam_1 = ca_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
ca_fam_1 = ca_fam_1.rename(columns = {"Symbol":"CA Count"})
id_fam = id_list.groupby('Family').count()
id_fam_1 = id_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
id_fam_1 = id_fam_1.rename(columns = {"Symbol":"ID Count"})

Reset the index to prepare the data frames for outer merges based on column names. The index is set to ‘Family’ as default, from the initial data frame creation using the count() function. To discourage any unwanted changes, I create a copy of each data frame as I go.

ny_fam_2 = ny_fam_1.copy()
ny_fam_2 = ny_fam_2.reset_index()
ca_fam_2 = ca_fam_1.copy()
ca_fam_2 = ca_fam_2.reset_index()
id_fam_2 = id_fam_1.copy()
id_fam_2 = id_fam_2.reset_index()
Two tables show a before and after image of New York state plant family data. The first table shows plant families as the index and the second table shows plant families as a column, with a new numerical index.
New York dataframe before (left) and after (right) the index was reset to make the plant families a column.

Merging Data

To preserve all the plant species regardless of presence in each individual state, I perform outer merges. This will allow for areas without data to be filled with zeros after the family counts are combined.

combo1 = pd.merge(ny_fam_2, ca_fam_2, how='outer')
combo2 = pd.merge(combo1, id_fam_2, how='outer')
A table shows the combined plant family data for New York, California, and Idaho.
Plant family table with counts from each state, before formatting.
pd.options.display.float_format = '{:,.0f}'.format
combo2 = combo2.fillna(0)
A table shows combined data for plant families in New York, California, and Idaho where the numbers are formatted to show zero decimal places and any instances of no data are replaced by zeroes.
Plant family table with counts from each state, formatted to drop decimals and replace NaNs with zeros.

Creating a New Column

I added a column to aid in visualizations. I created a function to return the state with the highest presence of each plant family based on the existing columns.

def presence(row):
    if row['NY Count'] > row['CA Count'] and row['NY Count'] > row['ID Count']:
        return 'NY'
    elif row['CA Count'] > row['NY Count'] and row['CA Count'] > row['ID Count']:
        return 'CA'
    elif row['ID Count'] > row['NY Count'] and row['ID Count'] > row['CA Count']:
        return 'ID'
    elif row['NY Count'] == row['CA Count'] and row['NY Count'] > row['ID Count']:
        return 'CA/NY'
    elif row['CA Count'] == row['ID Count'] and row['CA Count'] > row['NY Count']:
        return 'CA/ID'
    elif row['ID Count'] == row['NY Count'] and row['ID Count'] > row['CA Count']:
        return 'ID/NY'
    else:
        return 'Same'
    
combo2['Highest Presence'] = combo2.apply(presence, axis=1)
A table of plant families from New York, California, and Idaho shows counts for each family and a new column that names which state has the highest presence of each plant family.
Table with added column to indicate highest presence of species count within each plant family.

Below is the full table of all plant families in the dataframe.

 FamilyNY CountCA CountID CountHighest Presence
0Acanthaceae770CA/NY
1Acarosporaceae1118ID
2Aceraceae462924NY
3Acoraceae524NY
4Actinidiaceae300NY
5Adoxaceae200NY
6Agavaceae4480CA
7Aizoaceae5420CA
8Alismataceae645333NY
9Amaranthaceae788238CA
10Amblystegiaceae6400NY
11Anacardiaceae464722CA
12Andreaeaceae300NY
13Annonaceae300NY
14Anomodontaceae800NY
15Apiaceae190372257CA
16Apocynaceae486042CA
17Aquifoliaceae2130NY
18Araceae26183NY
19Araliaceae2668NY
20Aristolochiaceae2893NY
21Asclepiadaceae506720CA
22Aspleniaceae2376NY
23Asteraceae2,0573,8582,260CA
24Aulacomniaceae700NY
25Azollaceae440CA/NY
26Bacidiaceae110CA/NY
27Balsaminaceae10611ID
28Bartramiaceae1100NY
29Berberidaceae215925CA
30Betulaceae864353NY
31Bignoniaceae9110CA
32Blechnaceae5117CA
33Boraginaceae127478263CA
34Brachytheciaceae5900NY
35Brassicaceae4681,123774CA
36Bruchiaceae400NY
37Bryaceae2220NY
38Buddlejaceae450CA
39Butomaceae202ID/NY
40Buxaceae500NY
41Buxbaumiaceae500NY
42Cabombaceae663CA/NY
43Cactaceae1023878CA
44Callitrichaceae161811CA
45Calycanthaceae720NY
46Campanulaceae7614657CA
47Cannabaceae13810NY
48Cannaceae400NY
49Capparaceae184926CA
50Caprifoliaceae18210781NY
51Caryophyllaceae338506414CA
52Celastraceae24184NY
53Ceratophyllaceae855NY
54Cercidiphyllaceae200NY
55Chenopodiaceae245404245CA
56Cistaceae44270NY
57Cladoniaceae522NY
58Clethraceae400NY
59Climaciaceae500NY
60Clusiaceae521512NY
61Commelinaceae37140NY
62Convolvulaceae6813017CA
63Cornaceae553436NY
64Crassulaceae6223057CA
65Cucurbitaceae41496CA
66Cupressaceae3913030CA
67Cuscutaceae315631CA
68Cyperaceae1,016733663NY
69Dennstaedtiaceae855NY
70Diapensiaceae800NY
71Dicranaceae2000NY
72Dioscoreaceae600NY
73Dipsacaceae17108NY
74Ditrichaceae920NY
75Droseraceae8116CA
76Dryopteridaceae1216571NY
77Ebenaceae770CA/NY
78Elaeagnaceae161012NY
79Elatinaceae6144CA
80Empetraceae1360NY
81Entodontaceae800NY
82Ephemeraceae500NY
83Equisetaceae483651ID
84Ericaceae236310110CA
85Eriocaulaceae730NY
86Euphorbiaceae11123366CA
87Fabaceae6041,855871CA
88Fagaceae96910NY
89Fissidentaceae2211NY
90Flacourtiaceae200NY
91Fontinalaceae900NY
92Fumariaceae312820NY
93Funariaceae2100NY
94Gentianaceae72162122CA
95Geraniaceae348025CA
96Ginkgoaceae200NY
97Grimmiaceae233CA/ID
98Grossulariaceae4615072CA
99Haemodoraceae700NY
100Haloragaceae372820NY
101Hamamelidaceae820NY
102Hippocastanaceae1420NY
103Hippuridaceae222Same
104Hydrangeaceae245314CA
105Hydrocharitaceae454430NY
106Hydrophyllaceae1533689CA
107Hylocomiaceae800NY
108Hymeneliaceae114ID
109Hymenophyllaceae200NY
110Hypnaceae2000NY
111Iridaceae4611426CA
112Isoetaceae393028NY
113Juglandaceae52102NY
114Juncaceae162177143CA
115Juncaginaceae92213CA
116Lamiaceae399413182CA
117Lardizabalaceae200NY
118Lauraceae12110NY
119Lecanoraceae311NY
120Lemnaceae274517CA
121Lentibulariaceae372317NY
122Leucobryaceae200NY
123Liliaceae263741243CA
124Limnanthaceae3363CA
125Linaceae365220CA
126Lycopodiaceae114947NY
127Lygodiaceae200NY
128Lythraceae252616CA
129Magnoliaceae2000NY
130Malvaceae6628462CA
131Marsileaceae21512CA
132Melastomataceae1200NY
133Meliaceae330CA/NY
134Menispermaceae200NY
135Menyanthaceae1073NY
136Mniaceae1000NY
137Molluginaceae493CA
138Monotropaceae193121CA
139Moraceae21165NY
140Myricaceae2250NY
141Najadaceae252110NY
142Nelumbonaceae950NY
143Nyctaginaceae301578CA
144Nymphaeaceae502330NY
145Oleaceae39490CA
146Onagraceae237661314CA
147Ophioglossaceae605348NY
148Orchidaceae285175173NY
149Orobanchaceae227451CA
150Orthotrichaceae1100NY
151Osmundaceae1000NY
152Oxalidaceae705847NY
153Paeoniaceae242CA
154Papaveraceae389826CA
155Parmeliaceae111Same
156Pedaliaceae11258CA
157Phytolaccaceae470CA
158Pinaceae5211365CA
159Plantaginaceae647833CA
160Platanaceae980NY
161Plumbaginaceae16220CA
162Poaceae1,9272,3471,507CA
163Podostemaceae300NY
164Polemoniaceae35637279CA
165Polygalaceae29170NY
166Polygonaceae331917435CA
167Polypodiaceae9149CA
168Polytrichaceae2300NY
169Pontederiaceae17176CA/NY
170Portulacaceae24203120CA
171Potamogetonaceae11583103NY
172Pottiaceae3421NY
173Primulaceae67104102CA
174Pteridaceae1611139CA
175Pyrolaceae546567ID
176Ranunculaceae288434412CA
177Resedaceae780CA
178Rhamnaceae1820220CA
179Rosaceae1,305803609NY
180Rubiaceae15723072CA
181Ruppiaceae11177CA
182Rutaceae18120NY
183Salicaceae284352383ID
184Salviniaceae530NY
185Santalaceae959ID/NY
186Sapindaceae5373CA
187Sarraceniaceae11160CA
188Saururaceae230CA
189Saxifragaceae55219234ID
190Scheuchzeriaceae555Same
191Schistostegaceae202ID/NY
192Schizaeaceae200NY
193Scrophulariaceae4301,146556CA
194Selaginellaceae61514CA
195Sematophyllaceae600NY
196Simaroubaceae474CA
197Smilacaceae2230NY
198Solanaceae11824463CA
199Sparganiaceae242224ID/NY
200Sphagnaceae4231NY
201Staphyleaceae220CA/NY
202Sterculiaceae2300CA
203Styracaceae790CA
204Symplocaceae500NY
205Taxaceae1042NY
206Teloschistaceae111Same
207Tetraphidaceae200NY
208Theliaceae300NY
209Thelypteridaceae26913NY
210Thuidiaceae300NY
211Thymelaeaceae620NY
212Tiliaceae2200NY
213Trapaceae500NY
214Tropaeolaceae220CA/NY
215Typhaceae6105CA
216Ulmaceae231910NY
217Urticaceae526132CA
218Valerianaceae135927CA
219Verbenaceae531267CA
220Violaceae17612079NY
221Viscaceae12446CA
222Vitaceae55184NY
223Vittariaceae200NY
224Xyridaceae1500NY
225Zannichelliaceae555Same
226Zosteraceae580CA
227Zygophyllaceae5315CA
228Aloaceae020CA
229Aponogetonaceae020CA
230Arecaceae0150CA
231Basellaceae040CA
232Bataceae020CA
233Burseraceae020CA
234Caulerpaceae020CA
235Crossosomataceae0189CA
236Cyatheaceae030CA
237Cymodoceaceae050CA
238Datiscaceae020CA
239Elaeocarpaceae040CA
240Ephedraceae0160CA
241Fouquieriaceae030CA
242Frankeniaceae060CA
243Garryaceae0110CA
244Gracilariaceae020CA
245Gunneraceae030CA
246Halymeniaceae020CA
247Krameriaceae080CA
248Lennoaceae060CA
249Loasaceae08429CA
250Melianthaceae020CA
251Myoporaceae020CA
252Myrtaceae0320CA
253Parkeriaceae040CA
254Passifloraceae060CA
255Pittosporaceae080CA
256Punicaceae020CA
257Rafflesiaceae020CA
258Scouleriaceae033CA/ID
259Simmondsiaceae040CA
260Stereocaulaceae020CA
261Tamaricaceae0123CA
262Ulvaceae030CA
263Verrucariaceae002ID

Visualizing the Data

I created a count plot using seaborn to show which states, or state combinations, have the highest variety within each plant family.

base_color = sb.color_palette()[2]
sb.countplot(data=combo4, x='Highest Presence', color="#B6D1BE", order=combo4['Highest Presence'].value_counts().index)
n_points = combo4.shape[0]
cat_counts = combo4['Highest Presence'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count+5, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Highest Concentration of Plant Families by State', fontsize=14, y=1.05)
plt.ylim(0, 140)
sb.despine();
A vertical bar graph show the highest concentration of plant families organized by state, where the concentrations are high to low as follows: California, New York, California/New York, Idaho, all the states tie, Idaho/New York, and California/Idaho.
Shows the count of plant families with the highest concentration in each state.

Further Considerations

There are many factors that play into plant family diversity. The comparison of plant families in New York, California, and Idaho was purely out of curiosity. Further investigations should take into account each state’s ecosystem types and land usage and ownership that may influence species diversity.

How to Mindfully Summarize Data Insights

How to Mindfully Summarize Data Insights

It can be difficult to transparently present key insights in a world saturated with fake news and click-bait. Study results are often distilled into share-worthy titles that get read and taken as gospel without anyone actually reading further, so be conscious of how you choose to share data findings. Here are a few ways to stay humble while summarizing your data, based on a headline from the Girlgaze newsletter about an article from Gay Times.

Shot of newsletter article summary from Girlgaze.
Screenshot of Girlgaze newsletter from June 29, 2019 with added underlining and highlighting.
  1. Avoid sweeping generalizations that encourage assumptions about entire populations. This headline implies the sexual identity of all young people, instead of a small portion of youth in existence. A quick change would be to add ‘Surveyed’ after ‘Youth’ to reinforce that this was not a large sample size (‘over 2,000 adults‘).
  2. Name the party or parties responsible for sponsoring the collection of data. This communication successfully names the commissioner of the study early in the description. This allows the audience to have full transparency about involved parties who may influence a study’s outcomes.
  3. Avoid the use of definitive language. The last sentence in this quick summary is too precise in saying ‘clear indication‘ and should say what the ‘much more fluid perspective‘ is in comparison to. Studies use a sample from a population to provide meaningful insights. Samples are not meant to determine a definitive stance for every member of a population. Surveys take into account an array of potential variables, and language such as ‘clear indication‘ would imply researchers have explored every possible avenue of bias.

Other ways to communicate transparently with an audience include sharing links to raw data, naming potential sources of error, and making suggestions for future method improvements. Providing an audience with every opportunity to explore your data and understand methods empowers people to consume insights responsibly.