This month Women Who Code (WWC) held their CONNECT Empower conference. It was a one-day virtual event with different technical sessions and social talks. I had not been able to attend the conference in previous years but highly recommend it. Women Who Code is a global non-profit organization that promotes women in technology and has resources for career-building, networking, and continuous learning. Their CONNECT Empower event incorporates sessions that represent the group’s mission goals.
The event is formatted to have two parts to cater to attendees across different time zones. I opted for talks in the morning and afternoon in EST, but for those who cannot attend, sessions are recorded and can be found on WWC’s YouTube. Certain parts of the event like virtual career fair booths or networking are draws for attending the event in real time as well. Compared to conferences I have attended previously, it impressed me that WWC chooses their speakers based off anonymous proposals to eliminate bias that could be created based on background, job title, or years of experience.
A couple main takeaways from the conference were:
Work to celebrate and help fellow technologists. Having a sense of community in the tech field can help build confidence and dispel feelings of imposter syndrome.
Practicing consistency with learning goals, like the 100 Days of Code challenge, can be helpful for progressively building programming skills.
Inclusion and access were two of the major themes for the WWC keynote, and both are core values worth building upon across organizations. This can be in the form of open-for-all events or content.
Be open to asking for feedback from technical interviews.
Don’t worry about if you think a question you need to ask is stupid, it’s better to own up to ignorance and be open to learning.
It can be helpful to expand your interests outside your main area of focus in order to increase adaptability.
Overall, the WWC CONNECT Empower event is a great opportunity to meet fellow tech-minded folks and attend talks that encourage further skill-building and a greater sense of community.
In fall 2021, I started Georgia Tech’s Online Master’s of Science in Analytics (OMSA). Here are some thoughts so far on the courses I’ve taken and overall experience as I head into my sixth class of the program.
OMSA Program Background
Georgia Tech’s OMSA program is one of a few well-known online graduate programs in the data community. As data science and analytics become more mainstream and academia leans further into online curriculums, I assume similar program offerings will continue to grow. I heard about Georgia Tech’s Master’s in Analytics initially through my brother who knew about their online cybersecurity program. My main deciding factors were the online format and price tag of about $10,000 USD. I completely support building data skills through free MOOCs (shoutout to Codecademy and O’Reilly, both of which I still find as great resources). However, I figured the structure, schedule, and breadth of what Georgia Tech offered would keep me accountable in my studies. There are 3 different tracks for coursework: Analytical Tools, Business Analytics, and Computational Data Analytics. I am in the Analytical Tools track.
Class Reviews
CSE 6040: Computing for Data Analysis – Fall Semester 2021
This class works primarily in Python to illustrate computing concepts. The homework assignments and tests were auto-graded from JuPyter notebooks and were open book and open internet. The instant feedback was helpful but overall I found the time restrictions for the midterm and final to be challenging. I found creating a comprehensive Python file with code snippets from the whole class helpful for quick searches on the final. There were some optional course items like a project that could be submitted for extra credit. Aside from some difficulties with overthinking, the class was fairly enjoyable for the range of information it covered.
ISYE 6501: Introduction to Analytics Modeling – Fall Semester 2021
ISYE 6501 is essentially a whirlwind tour of analytics modeling concepts through R. Homework assignments are peer-graded and quizzes allow for a restricted number of note sheets (1 or 2 pages). This was a solid refresher in R and I found the content helpful for other classes like Regression Analysis. The course provided a decent background to the theory and troubleshooting involved in real world analytics problems.
MGT 8803: Business Fundamentals for Analytics – Spring Semester 2022
The class covered as many business concepts as possible including marketing, accounting, and supply chain optimization. There were different professors who helped guide the modules so it was interesting to have that mix for course. The level of straight memorization required to succeed with graded assignments was a bit much for me. Unfortunately this was a required course so whether or not the evaluation style was in my lane, I had to deal with it.
Regression Analysis gave a breadth of model building and illustrated underlying concepts behind each. The course uses R and material included cleaning and transforming data, variable selection, and linear and logistic regression. The homework assignments were fairly simple and I found having code snippets prepared in one file for the open book portion of exams to be helpful.
CSE 6242: Data and Visual Analytics – Fall Semester 2022
As one of the advanced requirements for the OMSA program, I was a little hesitant on the learning curve this course was rumored to have. The class was a grand tour of dabbling in different languages and programs like Python, SQL, Spark, and D3. A course project is a huge component of this class and it was nice to collaborate with classmates and translate what we learned into something tangible. Overall the homework assignments were only a major bummer because you could have a solution that looked exactly like the answer, but still get zero or minimal points from the auto-grader. There are no explicit homework solutions and I struggled with understanding what exactly needed to be corrected when I did not get full credit. Luckily, there were ample opportunities for extra credit to make up for any missed points from the homework.
ISYE 7406: Data Mining and Statistical Learning – Spring Semester 2023
I’m currently a couple weeks into the course for the Spring 2023 semester but so far the blend of theoretical background for statistics and practical R analysis has been manageable.
Balancing Professional, Academic, and Social Obligations
I attended a meetup in December 2022 at the main Atlanta campus for the Analytics and Cybersecurity programs and reflected on my personal experience after speaking with fellow students and alumni. There seemed to be a decent mix of professional backgrounds and mostly everyone I spoke with also worked full time for the duration of the program.
Personally, I have found that one course a semester has worked best for me. The only semester I doubled up was my first semester for two of the required core classes: ISYE 6501 and CSE 6040. In hindsight I think I would have been better off just taking one course instead of constantly feeling like I was flipping back and forth between the two.
For time spent on homework and studying, I have found that chipping away a little bit everyday has given me the best results thus far. It tends to keep concepts relevant as opposed to taking couple-day breaks between learning material. There have still been times where I have taken breaks for travel or vacation but a lot of the courses allow for some leniency for frontloading assignments and learning with the schedules for module releases.
Overall OMSA Experience
I’ve been pleasantly surprised in the program so far but still have plenty of days where I have the same frustrations that anyone else might with learning new material. The practical translation of concepts from the courses seems to be the main draw and highlight for folks in the OMSA program. I look forward to continuing to build skills in future classes. I am likely taking MGT 6203 (Data Analytics in Business) for summer 2023 and then ISYE 6669 (Deterministic Optimization) for fall 2023.
A sampler data package called lterdatasampler from the Long Term Ecological Research (LTER) program allows anyone to work with some neat environmental data. Data samples include weights for bison, fiddler crab body size, and meteorological data from a field station. The package homepage gives suggestions for modeling relationships, such as linear relationships and time series analysis. This will go over simple linear regression using R statistical software and a sugar maple dataset from the lterdatasampler package.
The data was collected at Hubbard Brook Experimental Forest in New Hampshire, which I am partial to having collected and analyzed data from there during an undergraduate internship.
Background
The sugar maple data comes from a study and paper from Stephanie Juice and Tim Fahey from Cornell University on the ‘Health of Sugar Maple (Acer saccharum) Seedlings in Response to Calcium Addition (2003-2004), Hubbard Brook LTER’. The data summary page points out the leaf samples were collected in transects from a watershed treated with calcium, and reference watershed sites. The data sample is 359 rows with the following 11 variables: year, watershed, elevation, transect, sample, stem_length, leaf1area, leaf2area, leaf_dry_mass, stem_dry_mass, and corrected_leaf_area.
R Code
The code that follows can be found in an associated GitHub repository here in an R-Markdown file. As mentioned above, the sugar maple data was chosen for simple linear regression due to the linear relationship noted from the data package site.
First, install the lterdatasampler package. If this is your first usage, use the following:
Once the package is installed, call it using the library function and load the car (Companion to Applied Regression) package and caTools (Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc) package for later use.
For background to the study, I plotted the corrected leaf area (in centimeters squared) between the reference and calcium treated Watershed 1. This shows the overall differences in samples collected between each area. This simple linear regression analysis will look at the overall measurements without respect to watershed, but this boxplot can help give an idea of the spread of data.
plot(hbr_maples$watershed, hbr_maples$corrected_leaf_area,
xlab='Watershed',
ylab='Corrected Leaf Area (cm^2)',
main='Sugar Maple Leaf Area for Watershed Samples')
Next, I created scatterplots between the corrected leaf area and stem length, stem dry mass, and leaf dry mass. These plots are meant to show a preliminary relationship between these variables to confirm a linear relationship appears to be proper to explore further.
par(mfrow=c(1,3))
plot(x=hbr_maples$corrected_leaf_area,
y=hbr_maples$stem_length,
xlab='Leaf Area (cm^2)',
ylab='Stem Length (mm)')
plot(x=hbr_maples$corrected_leaf_area,
y=hbr_maples$stem_dry_mass,
xlab='Leaf Area (cm^2)',
ylab='Stem Dry Mass (g)')
plot(x=hbr_maples$corrected_leaf_area,
y=hbr_maples$leaf_dry_mass,
xlab='Leaf Area (cm^2)',
ylab='Leaf Dry Mass (g)')
title("Scatterplots of Stem Length, Stem Dry Mass, and Leaf Dry Mass", line = -1, outer = TRUE)
I decided to work with the leaf dry mass as the predicting variable and corrected leaf area as the response for this regression. As a precaution, I check to see if there are null or NA values in the variable I plan on using for the response variable. Missing values here would not be helpful in training the model so those can be found and removed.
The preliminary graphs above seem to show at least some indication of an outlier. That can be checked with another quick plot and summary statistics for the response variable.
plot(hbr_maples_cleaned$leaf_dry_mass)
summary(hbr_maples_cleaned$leaf_dry_mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01170 0.03540 0.04745 0.05169 0.06105 0.38700
The scatterplot shows one value in particular that can be removed. Find the respective index and then drop it from a newly declared variable.
Split the data into training and test sets for later evaluation using the split function from caTools. With such a small dataset, the type of accuracy we will generate will not be reliable, but I would like to show these steps as good practice for larger datasets. I chose a 70% for training and 30% for test split. This splits as 167 records for the training set and 72 records for the test set.
Create a simple linear regression model to generate the corrected leaf area based on the leaf dry mass using the lm() function. Preview the model output using summary().
##
## Call:
## lm(formula = corrected_leaf_area ~ leaf_dry_mass, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5812 -1.7899 -0.3127 1.5761 6.9762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3121 0.5963 15.62 <2e-16 ***
## leaf_dry_mass 350.8474 11.0620 31.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 165 degrees of freedom
## Multiple R-squared: 0.8591, Adjusted R-squared: 0.8582
## F-statistic: 1006 on 1 and 165 DF, p-value: < 2.2e-16
Check the confidence interval of the linear regression model using the confint() function. This shows at the default of 95% confidence, the coefficient for leaf dry mass is between about 329 and 373.
Next, plot the residuals from the model against the fitted values. Declare variables for each and then plot in a scatterplot. The variance among the plotted values appears generally constant, but with some widely spread-out points as the fitted values increase.
leafarea_resids <- residuals(leafarea_model)
leafarea_fitted <- leafarea_model$fitted
plot(leafarea_fitted, leafarea_resids,
main='Residuals vs Fitted Values of Leaf Area Model',
xlab='Fitted Values',
ylab='Residual Values')
lines(lowess(leafarea_fitted, leafarea_resids), col='red')
Plot a histogram of the residuals from the model to check and see if there is normal distribution. The normality assumption for linear regression would suggest that the data is about normal if we see a standard distribution.
par(mfrow=c(1,2))
hist(leafarea_resids,main="Histogram of Residuals",xlab="Residuals")
qqnorm(leafarea_resids)
We can see from the histogram of residuals that the data might benefit from a transformation to give the errors a more normal distribution.
To identify outliers that may influence the model, we can use the Cook’s distance. This shows which data points could potentially be influencing the model and points them out for potential removal. This provied the index of 175 as an outlier that may need to be removed.
Then, we can locate the row that held the outlier and remove it. Overall, the handling of outliers depends on the goals for regression analysis and sometimes they are better to keep in.
Create a new linear regression model in the data where the outlier has been removed. We can see from the summary output that the model is the same as above, so removing the outlier did not have an effect.
##
## Call:
## lm(formula = corrected_leaf_area ~ leaf_dry_mass, data = train_data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5812 -1.7899 -0.3127 1.5761 6.9762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3121 0.5963 15.62 <2e-16 ***
## leaf_dry_mass 350.8474 11.0620 31.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 165 degrees of freedom
## Multiple R-squared: 0.8591, Adjusted R-squared: 0.8582
## F-statistic: 1006 on 1 and 165 DF, p-value: < 2.2e-16
Check to see if the model would benefit from a Box-Cox transformation to improve the model’s fit. This can help better meet the assumptions of simple linear regression, such as residual distribution. The lambda value given will dictate the transformation action that is suggested. From the output below, the optimal lambda value rounded to the nearest 0.5 would be 1. This means no transformation is suggested.
The linear regression equation we can construct from the model is:
Corrected Leaf Area (cm2) = 9.3121 + 350.8474*(Leaf Dry Mass (g))
The Multiple R-Squared value from the summary means about 86% of the variance found can be explained by the model.
Last, use the test data to check the model’s performance by using the predict() function and calculate the mean squared prediction error (MSPE). The MSPE is about 9.347519 and that value can be used as a performance comparison if other models are created.
pred_test <- predict(leafarea_model, test_data)
mse.model <- mean((pred_test-test_data$corrected_leaf_area)^2)
cat("The mean squared prediction error is",mse.model,"\n")
## The mean squared prediction error is 9.347519
Overview
This is an simple example of linear regression, and more practice can be gained with lm() when looking at other variables in the data. Given the linear relationship, multiple linear regression models can be good to explore as well after working through standard variable selection processes. There is plenty more to discover in the other lterdatapackage samples.
Source
Juice, S. and T. Fahey. 2019. Health and mycorrhizal colonization response of sugar maple (Acer saccharum) seedlings to calcium addition in Watershed 1 at the Hubbard Brook Experimental Forest ver 3. Environmental Data Initiative. https://doi.org/10.6073/pasta/0ade53ede9a916a36962799b2407097e
Here’s how to use Python and pandas to explore species data for the United States National Parks to find the average species richness and the distribution of species categories. This goes over some of the built-in functions in pandas and how to use those for exploratory data analysis. The source data is available via Kaggle, or the National Parks Species website. The associated Github repository for a more in-depth look at the two Jupyter Notebooks for the code below is available here.
Setup
Import the necessary packages, including pandas, matplotlib, seaborn, and math.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import math
Load the species dataset and the parks dataset with pandas. Given the size and that I used Jupyter Notebook for this, I give the low memory argument a value of False for the species data so it loads without too much trouble. The dataframes can be given any name.
Preview the datasets with the head() and info() pandas functions. The columns of focus will be park name and category for the species dataframe, and park name and acres for the parks dataframe.
species_data.head()
species_data.info()
parks_data.head()
Species Richness
Species richness is the quantification of species in a given area and can be a helpful metric for biodiversity.
For this, we’ll want to merge the species data and the parks data. The species data will need to be re-formatted beforehand.
Out of the box, the species data has one record per type of species per row. This will need to be turned into individual counts per column to get the number of species in each park. Use the groupby() function in pandas specifying the park name column, and then use count().
The parks data will need the names set as the index as well. Then, the two dataframes will be in good shape to merge. Specify the two dataframes, and give the left_index and right_index arguments values of true because they are the same between each.
Preview the newly merged dataframe to confirm it looks correct.
richness_data.head()
To take this one step further, estimate the number of species per acre using the species counts. Create a function to take the species count column and acres column and divide them, and normalize the result to an integer with math’s floor() function.
Calculate the mean of species per acre in all the parks. There are a lot of variables to consider that might affect this number, but it serves its purpose as a quick summary statistic to give us about 525 species per acre.
print(richness_data['Species Abundance'].mean())
Species Type Distribution
Species distribution in this setting will be the ratio of species types throughout each park. The different species categories in this data set are: ‘Mammal’, ‘Bird’, ‘Reptile’, ‘Amphibian’, ‘Fish’, ‘Vascular Plant’, ‘Spider/Scorpion’, ‘Insect’, ‘Invertebrate’, ‘Fungi’, ‘Nonvascular Plant’, ‘Crab/Lobster/Shrimp’, ‘Slug/Snail’, ‘Algae’.
To extract just the park names and categories, create a new dataframe with just these columns.
Use pandas’ groupby() function to group by park name and category. Then, use the size function to specify the number of rows for the unstack function, which will create a column for each of the unique row values. I give the argument of fill_value for the unstack function a value of 0, to keep anything with NaN values consistent for math operations.
Next, take the counts from each and put all on the same scale of out of 100. Use pandas’ div() function to divide based on the sum of the dataframe per each row. Then, multiply each value by 100 for better readability and translation for visuals.
To clean the dataframe values further, round each variable to 2 decimal places using Python’s round() function.
rounded = ratios.round(2)
rounded.head()
Create a box plot using matplotlib and seaborn to show the breakdown of species categories across all parks.
f, ax = plt.subplots(figsize=(12, 8))
sb.boxplot(data=rounded, orient='h')
ax.xaxis.grid(True)
ax.set(ylabel="")
ax.set(xlim=(0,100))
sb.despine(trim=True, left=True)
plt.title('Species Category Distribution in the National Parks', fontsize=16)
plt.show()
Optionally, export the species categories dataframe as a CSV for further use.
Park Name
Algae
Amphibian
Bird
Crab/Lobster/Shrimp
Fish
Fungi
Insect
Invertebrate
Mammal
Nonvascular Plant
Reptile
Slug/Snail
Spider/Scorpion
Vascular Plant
Acadia National Park
0.0
0.88
21.3
0.0
2.22
0.0
0.0
0.0
3.22
0.0
0.64
0.0
0.0
71.74
Arches National Park
0.0
0.76
19.56
0.0
1.05
0.0
0.0
0.0
5.63
0.0
1.91
0.0
0.0
71.09
Badlands National Park
0.0
0.72
17.21
0.0
1.73
12.46
17.21
0.07
4.61
0.0
0.94
0.0
0.07
45.0
Big Bend National Park
0.0
0.57
18.29
0.0
2.34
0.0
0.0
0.0
3.92
2.12
2.73
0.0
0.0
70.03
Biscayne National Park
0.0
0.46
13.5
0.0
47.39
0.0
0.64
1.97
1.62
0.0
2.32
0.0
0.0
32.1
Black Canyon of the Gunnison National Park
0.0
0.18
15.82
0.0
1.45
0.0
0.0
0.0
6.06
0.0
0.99
0.0
0.0
75.5
Bryce Canyon National Park
0.0
0.31
16.87
0.0
0.08
0.0
0.0
0.0
5.91
0.0
1.01
0.0
0.0
75.82
Canyonlands National Park
0.0
0.57
17.99
0.0
2.7
0.0
0.0
0.0
6.21
0.0
1.8
0.0
0.0
70.73
Capitol Reef National Park
0.0
0.38
15.84
0.0
0.96
0.0
0.0
0.0
4.66
0.0
1.34
0.0
0.0
76.82
Carlsbad Caverns National Park
0.0
0.98
23.89
0.0
0.33
0.0
0.0
0.0
5.99
0.0
4.04
0.0
0.0
64.78
Channel Islands National Park
3.24
0.21
18.94
0.58
14.48
0.0
0.11
10.4
2.33
3.61
0.58
1.75
0.05
43.71
Congaree National Park
3.19
1.85
8.62
0.26
2.8
12.02
26.58
0.65
1.68
0.3
2.15
0.9
0.9
38.09
Crater Lake National Park
5.8
0.53
6.7
1.06
0.35
5.11
26.44
1.81
2.55
5.13
0.53
0.24
0.45
43.3
Cuyahoga Valley National Park
0.0
1.24
12.67
0.41
4.38
0.0
11.7
1.29
2.42
0.0
1.18
0.77
0.1
63.83
Death Valley National Park
1.35
1.6
11.96
0.86
0.2
2.77
20.34
0.41
4.78
0.5
2.03
1.62
0.72
50.87
Denali National Park and Preserve
0.0
0.08
13.56
0.0
1.06
2.2
3.86
0.23
3.26
12.05
0.0
0.0
0.0
63.71
Dry Tortugas National Park
0.0
0.0
33.37
0.0
33.14
0.0
0.0
4.95
0.71
0.0
0.59
0.0
0.0
27.24
Everglades National Park
0.0
0.82
17.75
0.0
20.44
0.0
0.0
0.0
2.02
0.0
2.93
0.0
0.0
56.05
Gates Of The Arctic National Park and Preserve
0.0
0.07
9.9
0.0
1.26
35.03
0.0
0.0
2.88
0.0
0.0
0.0
0.0
50.85
Glacier Bay National Park and Preserve
3.27
0.26
13.18
4.91
18.34
0.1
1.79
5.98
2.96
5.01
0.15
1.89
0.15
42.0
Glacier National Park
0.08
0.23
10.84
0.23
1.06
10.8
7.71
0.08
2.7
15.81
0.16
0.78
0.0
49.53
Grand Canyon National Park
0.0
0.57
17.39
0.0
1.11
0.0
2.17
0.04
4.04
0.0
2.9
0.08
5.42
66.29
Grand Teton National Park
0.05
0.34
13.1
0.05
1.03
1.38
7.59
0.25
3.65
0.0
0.25
0.2
0.05
72.07
Great Basin National Park
0.0
0.83
12.44
0.23
0.79
0.57
17.19
1.09
3.88
0.6
2.19
1.13
0.68
58.39
Great Sand Dunes National Park and Preserve
0.0
0.63
25.21
0.0
0.63
0.0
0.11
0.0
7.14
0.11
0.84
0.0
0.0
65.34
Great Smoky Mountains National Park
0.0
0.92
4.11
0.15
1.62
9.54
36.45
1.27
1.42
7.97
0.77
1.39
1.57
32.83
Guadalupe Mountains National Park
0.0
0.69
15.58
0.29
0.17
3.78
6.59
0.4
4.35
0.29
3.21
3.32
0.11
61.23
Haleakala National Park
0.0
0.12
1.71
0.7
0.23
2.56
41.12
1.43
0.58
8.8
0.39
1.82
1.78
38.76
Hawaii Volcanoes National Park
0.0
0.12
2.37
1.7
0.12
0.21
43.33
5.25
0.45
4.09
0.39
2.15
3.06
36.75
Hot Springs National Park
1.23
1.38
19.85
0.46
4.62
0.0
0.77
1.13
2.67
0.92
2.67
0.1
0.0
64.21
Isle Royale National Park
0.0
0.93
18.68
1.0
4.51
0.0
0.0
1.5
1.86
0.07
0.36
0.0
0.0
71.08
Joshua Tree National Park
0.22
0.22
13.12
0.0
0.04
1.66
14.91
0.44
2.92
0.26
2.27
0.0
6.67
57.28
Katmai National Park and Preserve
0.0
0.08
18.12
0.33
3.59
8.57
0.57
2.37
4.41
1.63
0.0
0.57
0.0
59.76
Kenai Fjords National Park
0.0
0.09
22.23
0.0
3.97
0.19
0.28
3.97
5.2
0.0
0.0
0.38
0.0
63.67
Kobuk Valley National Park
0.0
0.1
12.2
0.0
2.54
32.49
0.0
0.0
3.71
2.54
0.0
0.0
0.0
46.44
Lake Clark National Park and Preserve
0.0
0.05
9.62
0.05
2.74
10.96
0.15
0.3
2.49
14.0
0.0
0.0
0.0
59.64
Lassen Volcanic National Park
0.11
0.95
13.63
0.83
1.11
2.84
5.29
1.0
5.56
8.9
1.22
0.33
0.0
58.21
Mammoth Cave National Park
0.0
1.32
8.4
0.04
4.88
0.0
10.8
2.16
2.24
0.04
1.68
0.0
0.08
68.35
Mesa Verde National Park
0.0
0.64
19.07
0.0
0.0
0.0
0.0
0.0
6.92
0.0
1.69
0.0
0.0
71.68
Mount Rainier National Park
0.0
0.92
10.73
0.0
1.32
1.78
2.7
0.0
3.96
20.48
0.29
0.0
0.0
57.83
North Cascades National Park
0.0
0.36
6.72
0.0
0.98
16.03
16.62
0.0
2.35
11.45
0.3
0.0
1.25
43.95
Olympic National Park
0.0
0.82
15.91
0.0
4.98
0.0
4.47
0.0
4.11
0.0
0.31
0.0
0.0
69.4
Petrified Forest National Park
0.0
0.94
28.6
0.0
0.0
0.12
0.0
0.0
7.27
0.12
2.46
0.0
0.0
60.49
Pinnacles National Park
0.0
0.71
12.01
0.56
0.42
1.91
22.81
2.19
4.24
0.0
2.05
0.78
0.71
51.62
Redwood National Park
1.76
0.52
7.94
1.93
3.91
21.6
11.79
5.29
2.44
3.98
0.62
2.33
0.11
35.77
Rocky Mountain National Park
4.76
0.16
8.79
1.24
0.38
9.71
21.45
1.52
2.35
13.2
0.1
0.32
0.7
35.34
Saguaro National Park
0.0
0.55
13.41
0.05
0.0
0.11
0.0
0.0
5.56
0.0
3.38
0.0
0.0
76.94
Sequoia and Kings Canyon National Parks
0.0
0.65
11.03
0.0
0.95
0.0
0.0
0.0
4.46
0.0
1.2
0.0
0.0
81.7
Shenandoah National Park
0.0
0.86
5.76
0.06
0.88
16.43
6.83
0.04
1.35
7.52
0.82
0.04
0.09
59.31
Theodore Roosevelt National Park
0.0
0.69
19.14
0.09
2.75
0.09
6.27
0.26
5.67
5.24
1.12
0.17
0.17
58.37
Voyageurs National Park
0.0
1.03
16.38
0.0
3.99
0.21
2.27
0.48
4.34
0.76
0.41
0.0
0.0
70.13
Wind Cave National Park
0.0
0.5
16.85
0.0
0.57
3.08
7.53
0.0
6.38
0.0
0.86
1.79
0.0
62.44
Wrangell - St Elias National Park and Preserve
0.0
0.11
11.75
0.0
5.18
0.0
3.23
0.22
3.45
0.39
0.06
0.06
0.0
75.56
Yellowstone National Park
5.14
0.23
8.32
1.59
0.48
0.28
41.05
1.97
1.97
0.38
0.23
1.49
1.08
35.8
Yosemite National Park
0.0
0.72
12.93
0.0
0.48
0.0
0.0
0.0
4.21
0.0
1.05
0.0
0.0
80.6
Zion National Park
0.0
0.39
16.76
0.0
0.84
0.0
0.0
0.0
4.45
0.0
1.67
0.0
0.0
75.89
Summary
Further exploration of the species category and count outputs might involve comparing a select number of parks against each other. Unique factors such as location, park size, and biomes provide opportunities for further analysis and insights.
Here is a brief overview of how to use the Python package Natural Language Toolkit (NLTK) for sentiment analysis with Amazon food product reviews. This is a basic way to use text classification on a dataset of words to help determine whether a review is positive or negative. The following is a snippet of a more comprehensive tutorial I put together for a workshop for the Syracuse Women in Machine Learning and Data Science group.
Data
The data for this tutorial comes from the Grocery and Gourmet Food Amazon reviews set from Jianmo Ni found at Amazon Review Data (2018). Out of the review categories to choose from, this set seemed like it would have a diverse range of people’s sentiment about food products. The data set itself is fairly large, so I use a smaller subset of 20,000 reviews in the example below.
Steps to clean the main data using pandas are detailed in the Jupyter Notebook. The reviews are categorized on an overall rating scale of 1 to 5, with 1 being the lowest approval and 5 being the highest. I split the data so that reviews set as a 1 or 2 is labeled as negative and those set as 4 or 5 as positive. I omit ratings of 3 for this exercise because they could vary between negative and positive.
Prepare Data for Classification
Import the necessary packages. The steps below assume the data has already been cleaned using pandas.
import pandas as pd
import random
import string
import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords
from nltk import classify
from nltk import NaiveBayesClassifier
Load in the cleaned data from a CSV from a data folder using pandas.
The main cleaned dataframe has three columns: overview, reviewText, and reaction. The overview column has the numeric review rating, the reviewText column has the product reviews in strings, and the reaction column is marked with ‘positive’ or ‘negative’. Each row represents an individual review.
Reduce the main pandas dataframe to a smaller group using the sample function from the random package and a lambda function on the reaction column. I use an even split of 20,000 reviews.
Use this sample dataframe to create a list for each sentiment type. Use the loc function from pandas to specify each entry that has ‘positive’ or ‘negative’ in the reaction column, respectively. Then, use the pandas tolist() function to convert the dataframe to a list type.
With these lists, use the lower() function and list comprehension to make each review lowercase. This reduces variance in the types of forms a word with various syntax can have.
pos_list_lowered = [word.lower() for word in pos_list]
neg_list_lowered = [word.lower() for word in neg_list]
Turn the lists into string types to more easily separate words and prepare for more cleaning. For this text classification, we will consider the frequency of words in each type of review.
pos_list_to_string = ' '.join([str(elem) for elem in pos_list_lowered])
neg_list_to_string = ' '.join([str(elem) for elem in neg_list_lowered])
To eliminate noise in the data, stop words (examples: ‘and’, ‘how’, ‘but’) should be removed, along with punctuation. Use NLTK’s built-in function for stop words to specify a variable for both stop words and punctuation.
Create a variable for the tokenizer. Tokenizing will separate all the words in the list based on a specific variable. In this example, I chose to use a whitespace tokenizer. This means words will be separated based on whitespace.
tokenizer = WhitespaceTokenizer()
Use list comprehension on the positive and negative word lists to tokenize any word that is not a stop word or a punctuation item.
filtered_pos_list = [w for w in tokenizer.tokenize(pos_list_to_string) if w notin stop]
filtered_neg_list = [w for w in tokenizer.tokenize(neg_list_to_string) if w notin stop]
Remove any punctuation that may be leftover if it was attached to a word itself.
filtered_pos_list2 = [w.strip(string.punctuation) for w in filtered_pos_list]
filtered_neg_list2 = [w.strip(string.punctuation) for w in filtered_neg_list]
As an optional sidebar, use NLTK’s Frequency Distribution function to check some of the most common words and their number of appearances in the respective reviews.
Create a function to make the feature sets for text classification. This will take the lists and create dictionaries with the proper labels.
def word_features(words):
return dict([(word, True) for word in words.split()])
Label the sets of word features and combine into one set to be split for training and testing for sentiment analysis.
positive_features = [(word_features(f), 'pos') for f in filtered_pos_list2]
negative_features = [(word_features(f), 'neg') for f in filtered_neg_list2]
labeledwords = positive_features + negative_features
Randomly shuffle the list of words before use in the classifier to reduce the likelihood of bias toward a given feature label.
random.shuffle(labeledwords)
Training and Testing the Text Classifier for Sentiment
Create a training set and a test set from the list. From NLTK, call upon the Naïve Bayes Classifier model and specify the training set will train the model for sentiment analysis.
Provide some test example reviews for proof of concept and print the results.
print(classifier.classify(word_features('I hate this product, it tasted weird')))
Use NLTK to show the most informative features of the text classifier. This generates a list based on certain features and shows the likelihood that they point to a specific classification of positive or negative review.
classifier.show_most_informative_features(15)
Further Steps
This was an overview of sentiment analysis with NLTK. There are opportunities to increase the accuracy of the classification model. One example would be to use part-of-speech tagging to train the model using descriptive adjectives or nouns. Another idea to pursue would be to use the results of the frequency distribution and select the most common positive and negative words to train the model.
The full GitHub repository tutorial for this can be found here.
In this tutorial, I review ways to take raw categorical survey data and create new variables for analysis and visualizations with Python using pandas and GeoPy. I’ll show how to make new pandas columns from encoding complex responses, geocoding locations, and measuring distances.
Here’s the associated GitHub repository for this workshop, which includes the data set and a Jupyter Notebook for the code.
Thanks to the St. Lawrence Eastern Lake Ontario Partnership for Regional Invasive Species Management (SLELO PRISM), I was able to use boat launch steward data from 2016 for this virtual workshop. The survey data was collected by boat launch stewards around Lake Ontario in upstate New York. Boaters were asked a series of survey questions and their watercrafts were inspected for aquatic invasive species.
This tutorial was originally designed for the Syracuse Women in Machine Learning and Data Science (Syracuse WiMLDS) Meetup group.
Here’s an overview of how to use newsgrab to get news headlines from Google News. Then, the data can be analyzed using the spaCy natural language processing library.
The motivation behind newgrab was to pull data on New York colleges to compare headlines about how institutions were being affected by COVID-19. I used the College Navigator from the National Center for Education Statistics to get a list of 4-year colleges in New York to use as the search data.
I had trouble finding a clean way to scrape headlines from Google News. My brother Randy helped me use Javascript and playwright to write the code for newsgrab.
Run a Search with newsgrab
First, install newsgrab globally through npm from the command line.
npm install -g newsgrab
Run a line with the package name and specify the file path (if outside current working directory) of a line-separated list of desired search terms. For my example, I used the names of New York colleges.
newsgrab ny_colleges.txt
The output of newsgrab is a JSON file called output and will follow the array structure below:
Afterwards, the output can be handled with Python.
Analyze the JSON Data with spaCy
Import the necessary packages for handling the data. These include: json, pandas, matplotlib, seaborn, re, and spaCy. Specific modules to import are the json_normalize module from pandas and the counter module from collections.
import json
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sb
import re
import spacy
from collections import Counter
Bring in one of the pre-trained models from spaCy. I use the model called en_core_web_sm. There are other options in their docs for English models, as well as those for different languages.
nlp = spacy.load("en_core_web_sm")
Read in the JSON data as a list and then normalize it with pandas. Specify the record path as ‘results’ and the meta as ‘search_term’ to correspond with the JSON array data structure from the output file.
with open('output.json',encoding="utf8") as raw_file1:
list1 = json.load(raw_file1)
search_data = pd.json_normalize(list1, record_path='results', meta='search_term',record_prefix='results')
Gather all separate data through spaCy. I wanted to pull noun chunks, named entities, and tokens from my results column. For the token output, I use the attributes of rule-based matching to specify that I want all tokens except for stop words or punctuation. Then, each output is put into a column of the main dataframe.
noun_chunks = []
named_entity = []
tokens = []
for doc in nlp.pipe(df['results_lower'].astype('unicode').values, batch_size=50,
n_process=5):
if doc.is_parsed:
noun_chunks.append([chunk.text for chunk in doc.noun_chunks])
named_entity.append([ent.text for ent in doc.ents])
tokens.append([token.text for token in doc if not token.is_stop and not token.is_punct])
else:
noun_chunks.append(None)
named_entity.append(None)
tokens.append(None)
df['results_noun_chunks'] = noun_chunks
df['results_named_entities'] = named_entity
df['results_tokens_clean'] = tokens
Process Tokens
Take the tokens column and flatten it into a list. Perform some general data cleaning like removing special characters and taking out line breaks and the remnants of ampersands. Then, use the counter module to get a frequency count of each of the words in the list.
word_frequency = Counter(string_list_of_words)
Before analyzing the list, I also remove the tokens for my list of original search terms to keep it more focused on the terms outside of these. Then, I create a dataframe of the top results and plot those with seaborn.
Process Noun Chunks
Perform some cleaning to separate the noun chunks lists per each individual search term. I remove excess characters after converting the output to strings, and then use the explode function from pandas to separate them.
Then, create a variable for the value count of each of the noun chunks, turn that into a dictionary, then map it to the dataframe for the following result.
Then, I sort the values in a new dataframe in descending order, remove duplicates, and narrow down to the top 20 noun chunks with frequencies above 10 to graph in a countplot.
Process Named Entities
Cleaning the named entity outputs for each headline is nearly the same in process as cleaning the noun chunks. The lists are converted to strings, are cleaned, and use the explode function to separate individually. The outputs for named entities can be customized depending on desired type.
After separating the individual named entities, I use spaCy to identify the type of each and create a new column for these.
named_entity_type = []
for doc in nlp.pipe(named['named_entity'].astype('unicode').values, batch_size=50,
n_process=5):
if doc.is_parsed:
named_entity_type.append([ent.label_ for ent in doc.ents])
else:
named_entity_type.append(None)
named['named_entities_type'] = named_entity_type
Then, I get the value counts for the named entities and append these to a dictionary. I map the dictionary to the named entity column, and put the result in a new column.
As seen in the snippet of the full dataframe below, the model for identifying named entity values and types is not always accurate. There is documentation for training spaCy’s models for those interested in increased accuracy.
From the dataframe, I narrow down the entity types to exclude cardinal and ordinal types to take out any numbers that may have high frequencies within the headlines. Then, I get the top named entity types with frequencies over 6 to graph.
For full details and cleaning steps to create the visualizations above, please reference below for the associated gist from Github.
Here’s an overview of how to map the coordinates of cities mentioned in song lyrics using Python. In this example, I used Lana Del Rey’s lyrics for my data and focused on United States cities. The full code for this is in a Jupyter Notebook on my GitHub under the lyrics_map repository.
Gather Bulk Song Lyrics Data
First, create an account with Genius to obtain an API key. This is used for making requests to scrape song lyrics data from a desired artist. Store the key in a text file. Then, follow the tutorial steps from this blog post by Nick Pai and reference the API key text file within the code.
You can customize the code to cater to a certain artist and number of songs. To be safe, I put in a request for lyrics from 300 songs.
Find Cities and Countries in the Data
After getting the song lyrics in a text file, open the file and use geotext to grab city names. Append these to a new pandas dataframe.
Use GeoText to gather country mentions and put these in a column. Then, clean the raw output and create a new dataframe querying only on the United States.
Personally, I focus only on United States cities to reduce errors from geotext reading common words such as ‘Born’ as foreign city names.
In my example, I corrected Newport and Venice to include ‘Beach’. I understand this can be cumbersome with larger datasets, but I did not see it imperative to automate this task for my example.
city_mentions = city_mentions.replace(to_replace ='Newport', value ='Newport Beach')
city_mentions = city_mentions.replace(to_replace ='Venice', value ='Venice Beach')
Next, save a list and a dataframe with value counts for each city to be used later for the map. Reset the index as well to have the two columns as city and mentions.
Use GeoPy to geocode the cities from the unique list, which pulls associated coordinates and location data. The user agent needs to be specified to avoid an error. Create a dataframe from this output.
chrome_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
geolocator = Nominatim(timeout=10,user_agent=chrome_user_agent)
lat_lon = []
for city in unique_list:
try:
location = geolocator.geocode(city)
if location:
lat_lon.append(location)
except GeocoderTimedOut as e:
print("Error: geocode failed on input %s with message %s"%
(city, e))
city_data = pd.DataFrame(lat_lon, columns=['raw_data','raw_data2'])
city_data = city_data[['raw_data2', 'raw_data']]
This yields one column as the latitude and longitude and another with comma separated location data.
Reduce the Geocode Data to Desired Columns
I cleaned my data to have only city names and associated coordinates. The output from GeoPy allows for more information such as county and state, if desired.
To split the location data (raw_data) column, convert it to a string and then split it and create a new column (city) from the first indexed object.
Then, convert the coordinates column (raw_data2) into a string type to remove the parentheses and finally split on the comma.
#change the coordinates to a string
city_data['raw_data2'] = city_data['raw_data2'].astype(str)
#split the coordinates using the comma as the delimiter
city_data[['lat','lon']] = city_data.raw_data2.str.split(",",expand=True,)
#remove the parentheses
city_data['lat'] = city_data['lat'].map(lambda x:x.lstrip('()'))
city_data['lon'] = city_data['lon'].map(lambda x:x.rstrip('()'))
Convert the latitude and longitude columns back to floats because this is the usable type for plotly.
Create an account with MapBox to obtain an API key to plot my song lyric locations in a Plotly Express bubble map. Alternatively, it is also possible to generate the map without an API key if you have Dash installed. Customize the map for visibility by adjusting variables such as the color scale, the zoom extent, and the data that appears when hovering over the data.
px.set_mapbox_access_token(open("mapbox_token.txt").read())
df = px.data.carshare()
fig = px.scatter_mapbox(merged, lat='lat', lon='lon', color='mentions', size='mentions',
color_continuous_scale=px.colors.sequential.Agsunset, size_max=40, zoom=3,
hover_data=['city'])
fig.update_layout(
title={
'text': 'US Cities Mentioned in Lana Del Rey Songs',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
#save graph as html
with open('plotly_graph.html', 'w') as f:
f.write(fig.to_html(include_plotlyjs='cdn'))
Visualizing qualitative data can be difficult if care is not taken for hierarchical characteristics. Variables representing levels of feelings can be presented in a horizontal range to improve comprehension. The online bank, Simple, includes a poll in its newsletter to account holders and often asks for levels of confidence with financial topics. Here’s how to present hierarchical qualitative data in a few different ways based on visualizations from Simple’s monthly newsletter.
To represent qualitative data, careful consideration should be given to:
Graph Type
Logical Order of Data
Color Scheme
Original Graphs
Graph 1
In September, Simple’s poll question was: “How confident do you feel making big purchases in today’s financial environment?” Here is the visualization that accompanied it.
Although the legend is presented in a sensible high-to-low order, this graph is pretty confusing. The choice of a pie chart muddles the range of emotions being presented. The viewer’s eye, if moving clockwise, hits ‘Not at all Confident’ at about the same time as ‘Very Confident’. The color palette has no inherent significance for the survey responses. It does not travel on an easily understood color spectrum of high to low.
Graph 2
In November, Simple’s poll question was: “How do you feel about the money you’ll be spending this holiday season?” Below is the graph that illustrated these results.
Simple’s graph shows various emotions, but does not show them in any particular order, whether by percentage or type of feeling. Similar to the pie chart, the color palette does not have any particular significance.
Improved Graphs
Using Python and matplotlib’s horizontal stacked bar chart, I created different representations of the survey data for big purchase confidence and feelings about holiday spending. A bar chart presents results for viewers to read logically from left to right.
Graph 1
I associated the levels of confidence with a green to red spectrum to signify the range of positive to negative feelings. Another variation could have been a monochrome spectrum where a dark shade moving to a lighter shades would signify decreasing confidence.
Graph 2
I arranged the emotions from negative to positive feelings so they could show a spectrum. The color palette reflects the movements from troubled to excited by moving from red to green.
References
The survey data, as mentioned, comes from Simple‘s monthly newsletter.
This article from matplotlib on discrete distribution provided me with the base for these graphs. The main distinction is that I only included one bar to achieve the singular spectrum of survey results. I found variations of tree maps and waffle plots did not divide sections horizontally in rectangles as well as the stacked bar plot would.
Code
Visual #1 – September Survey Data
category_names1 = ['very \nconfident', 'somewhat \nconfident', 'mixed \nfeelings', 'not really \nconfident', 'not at all \nconfident']
results1 = {'': [14,16,30,19,21]}
def survey1(results, category_names):
labels = list(results.keys())
data = np.array(list(results.values()))
data_cum = data.cumsum(axis=1)
category_colors = plt.get_cmap('RdYlGn_r')(
np.linspace(0.15, 0.85, data.shape[1]))
fig, ax = plt.subplots(figsize=(12, 4))
ax.invert_yaxis()
ax.xaxis.set_visible(False)
ax.set_xlim(0, np.sum(data, axis=1).max())
for i, (colname, color) in enumerate(zip(category_names, category_colors)):
widths = data[:, i]
starts = data_cum[:, i] - widths
ax.barh(labels, widths, left=starts, height=0.5,
label=colname, color=color)
xcenters = starts + widths / 2
r, g, b, _ = color
text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
for y, (x, c) in enumerate(zip(xcenters, widths)):
ax.text(x, y, str(int(c))+'%', ha='center', va='center',
color=text_color, fontsize=20, fontweight='bold',
fontname='Gill Sans MT')
ax.legend(ncol=len(category_names), bbox_to_anchor=(0.007, 1),
loc='lower left',prop={'family':'Gill Sans MT', 'size':'15'})
ax.axis('off')
return fig, ax
survey1(results1, category_names1)
plt.suptitle(t ='How confident do you feel making big purchases in today\'s financial environment?', x=0.515, y=1.16,
fontsize=22, style='italic', fontname='Gill Sans MT')
#plt.savefig('big_purchase_confidence.jpeg', bbox_inches = 'tight')
plt.show()
Visual #2 – November Survey Data
category_names2 = ['in a pickle','worried','fine','calm','excited']
results2 = {'': [14,32,16,29,9]}
def survey2(results, category_names):
labels = list(results.keys())
data = np.array(list(results.values()))
data_cum = data.cumsum(axis=1)
category_colors = plt.get_cmap('RdYlGn')(
np.linspace(0.15, 0.85, data.shape[1]))
fig, ax = plt.subplots(figsize=(10.5, 4))
ax.invert_yaxis()
ax.xaxis.set_visible(False)
ax.set_xlim(0, np.sum(data, axis=1).max())
for i, (colname, color) in enumerate(zip(category_names,
category_colors)):
widths = data[:, i]
starts = data_cum[:, i] - widths
ax.barh(labels, widths, left=starts, height=0.5,
label=colname, color=color)
xcenters = starts + widths / 2
r, g, b, _ = color
text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
for y, (x, c) in enumerate(zip(xcenters, widths)):
ax.text(x, y, str(int(c))+'%', ha='center', va='center',
color=text_color, fontsize=20, fontweight='bold', fontname='Gill Sans MT')
ax.legend(ncol=len(category_names), bbox_to_anchor=(- 0.01, 1),
loc='lower left', prop={'family':'Gill Sans MT', 'size':'16'})
ax.axis('off')
return fig, ax
survey2(results2, category_names2)
plt.suptitle(t ='How do you feel about the money you\'ll be spending this holiday season?', x=0.509, y=1.1, fontsize=22,
style='italic', fontname='Gill Sans MT')
#plt.savefig('holiday_money.jpeg', bbox_inches = 'tight')
plt.show()
I used the Spotify Web API to pull the top songs from my personal account. I’ll go over how to get the fifty most popular songs from a user’s Spotify account using spotipy, clean the data, and produce visualizations in Python.
Top 50 Spotify Songs
Top 50 songs from my personal Spotify account, extracted using the Spotify API.
Song
Artist
Album
Popularity
1
Borderline
Tame Impala
Borderline
77
2
Groceries
Mallrat
In the Sky
64
3
Fading
Toro y Moi
Outer Peace
48
4
Fanfare
Magic City Hippies
Hippie Castle EP
57
5
Limestone
Magic City Hippies
Hippie Castle EP
59
6
High Steppin'
The Avett Brothers
Closer Than Together
51
7
I Think Your Nose Is Bleeding
The Front Bottoms
Ann
43
8
Die Die Die
The Avett Brothers
Emotionalism (Bonus Track Version)
44
9
Spice
Magic City Hippies
Modern Animal
42
10
Bleeding White
The Avett Brothers
Closer Than Together
53
11
Prom Queen
Beach Bunny
Prom Queen
73
12
Sports
Beach Bunny
Sports
65
13
February
Beach Bunny
Crybaby
51
14
Pale Beneath The Tan (Squeeze)
The Front Bottoms
Ann
43
15
12 Feet Deep
The Front Bottoms
Rose
49
16
Au Revoir (Adios)
The Front Bottoms
Talon Of The Hawk
50
17
Freelance
Toro y Moi
Outer Peace
57
18
Spaceman
The Killers
Day & Age (Bonus Tracks)
62
19
Destroyed By Hippie Powers
Car Seat Headrest
Teens of Denial
51
20
Why Won't They Talk To Me?
Tame Impala
Lonerism
59
21
Fallingwater
Maggie Rogers
Heard It In A Past Life
71
22
Funny You Should Ask
The Front Bottoms
Talon Of The Hawk
48
23
You Used To Say (Holy Fuck)
The Front Bottoms
Going Grey
47
24
Today Is Not Real
The Front Bottoms
Ann
41
25
Father
The Front Bottoms
The Front Bottoms
43
26
Broken Boy
Cage The Elephant
Social Cues
60
27
Wait a Minute!
WILLOW
ARDIPITHECUS
80
28
Laugh Till I Cry
The Front Bottoms
Back On Top
47
29
Nobody's Home
Mallrat
Nobody's Home
56
30
Apocalypse Dreams
Tame Impala
Lonerism
60
31
Fill in the Blank
Car Seat Headrest
Teens of Denial
56
32
Spiderhead
Cage The Elephant
Melophobia
57
33
Tie Dye Dragon
The Front Bottoms
Ann
47
34
Summer Shandy
The Front Bottoms
Back On Top
43
35
At the Beach
The Avett Brothers
Mignonette
51
36
Motorcycle
The Front Bottoms
Back On Top
41
37
The New Love Song
The Avett Brothers
Mignonette
42
38
Paranoia in B Major
The Avett Brothers
Emotionalism (Bonus Track Version)
49
39
Aberdeen
Cage The Elephant
Thank You Happy Birthday
54
40
Losing Touch
The Killers
Day & Age (Bonus Tracks)
51
41
Four of a Kind
Magic City Hippies
Hippie Castle EP
46
42
Cosmic Hero (Live at the Tramshed, Cardiff, Wa...
Car Seat Headrest
Commit Yourself Completely
34
43
Locked Up
The Avett Brothers
Closer Than Together
49
44
Bull Ride
Magic City Hippies
Hippie Castle EP
49
45
The Weight of Lies
The Avett Brothers
Emotionalism (Bonus Track Version)
51
46
Heat Wave
Snail Mail
Lush
60
47
Awkward Conversations
The Front Bottoms
Rose
42
48
Baby Drive It Down
Toro y Moi
Outer Peace
47
49
Your Love
Middle Kids
Middle Kids EP
29
50
Ordinary Pleasure
Toro y Moi
Outer Peace
58
Using Spotipy and the Spotify Web API
First, I created an account with Spotify for Developers and created a client ID from the dashboard. This provides both a client ID and client secret for your application to be used when making requests to the API.
Next, from the application page, in ‘Edit Settings’, in Redirect URIs, I add http://localhost:8888/callback . This will come in handy later when logging into a specific Spotify account to pull data.
Then, I write the code to make the request to the API. This will pull the data and put it in a JSON file format.
I import the following libraries:
Python’s OS library to facilitate the client ID, client secret, and redirect API for the code using the computer’s operating system. This will temporarily set the credentials in the environmental variables.
Spotipy to provide an authorization flow for logging in to a Spotify account and obtain current top tracks for export.
import os
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
Next, I define the client ID and secret to what has been assigned to my application from the Spotify API. Then, I set the environmental variables to include the the client ID, client secret, and the redirect URI.
Then, I work through the authorization flow from the Spotipy documentation. The first time this code is run, the user will have to provide their Sptofy username and password when prompted in the web browser.
In the results section, I specify the information to pull. The arguments I provide indicate 50 songs as the limit, the index of the first item to return, and the time range. The time range options, as specified in Spotify’s documentation, are:
short_term : approximately last 4 weeks of listening
medium_term : approximately last 6 months of listening
long_term : last several years of listening
For my query, I decided to use the medium term argument because I thought that would give the best picture of my listening habits for the past half year. Lastly, I create a list to append the results to and then write them to a JSON file.
if token:
sp = spotipy.Spotify(auth=token)
results = sp.current_user_top_tracks(limit=50,offset=0,time_range='medium_term')
for song in range(50):
list = []
list.append(results)
with open('top50_data.json', 'w', encoding='utf-8') as f:
json.dump(list, f, ensure_ascii=False, indent=4)
else:
print("Can't get token for", username)
After compiling this code into a Python file, I run it from the command line. The output is top50_data.JSON which will need to be cleaned before using it to create visualizations.
Cleaning JSON Data for Visualizations
The top song data JSON file output is nested according to different categories, as seen in the sample below.
Before cleaning the JSON data and creating visualizations in a new file, I import json, pandas, matplotlib, and seaborn. Next, I load the JSON file with the top 50 song data.
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
with open('top50_data.json') as f:
data = json.load(f)
I create a full list of all the data to start. Next, I create lists where I will append the specific JSON data. Using a loop, I access each of the items of interest for analysis and append them to the lists.
Using the DataFrame, I create two visualizations. The first is a count plot using seaborn to show how many top songs came from each artist represented in the top 50 tracks.
descending_order = top50['artist'].value_counts().sort_values(ascending=False).index
ax = sb.countplot(y = top50['artist'], order=descending_order)
sb.despine(fig=None, ax=None, top=True, right=True, left=False, trim=False)
sb.set(rc={'figure.figsize':(6,7.2)})
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_title('Songs per Artist in Top 50', fontsize=16, fontweight='heavy')
sb.set(font_scale = 1.4)
ax.axes.get_xaxis().set_visible(False)
ax.set_frame_on(False)
y = top50['artist'].value_counts()
for i, v in enumerate(y):
ax.text(v + 0.2, i + .16, str(v), color='black', fontweight='light', fontsize=14)
plt.savefig('top50_songs_per_artist.jpg', bbox_inches="tight")
The second graph is a seaborn box plot to show the popularity of songs within individual artists represented.
For future interactions with the Spotify Web API, I would like to complete requests that pull top song data for each of the three term options and compare them. This would give a comprehensive view of listening habits and could lead to pulling further information from each artist.