Data Exploration with Student Test Scores

Data Exploration with Student Test Scores

I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.

Preliminary Data Cleaning

For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
labels = ['gender', 'race/ethnicity', 'par_level_educ', 'lunch', 'test_prep', 'math', 'reading', 'writing']
tests = pd.read_csv('StudentsPerformance.csv', header=0, names=labels)
tests.info()
Output of info() for student test scores.
tests.head(10)
Output of head() for student test scores.

Univariate Analysis

Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.

plt.figure(figsize=[10,4])
plt.subplot(1, 3, 1)
plt.hist(data=tests, x='math', bins=20)
plt.title('Math')
plt.xlim(0,100)
plt.ylim(0,160)
plt.ylabel('Number of Students', fontsize=12)
plt.subplot(1, 3, 2)
plt.hist(data=tests, x='reading', bins=20)
plt.title('Reading')
plt.xlim(0,100)
plt.ylim(0,160)
plt.subplot(1, 3, 3)
plt.hist(data=tests, x='writing', bins=20)
plt.title('Writing')
plt.xlim(0,100)
plt.ylim(0,160)
plt.suptitle('Test Scores', fontsize=16, y=1.0);
Histograms showing the spread of student test scores across all topics.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
base_color = sb.color_palette()[9]
sb.countplot(data=tests, x='par_level_educ', color=base_color, order=ed_order)
n_points = tests.shape[0]
cat_counts = tests['par_level_educ'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count-35, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Parental Education Level of Student Test Takers');
Bivariate count plots of student test scores across parental levels of education.

Bivariate Analysis

Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.

base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
Violin plots that show average student test scores base on level of test preparation.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Box plots that show reading scores across varying levels of parental education.

Multivariate Analysis

A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.

ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
A swarm plot that shows student test scores, test preparation level, and the highest levels of parental education.
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);
Multivariate box plots showing test scores in Math, Reading, and Writing based on student gender and test preparation level.

Data Visualizations for Spending Habits

Data Visualizations for Spending Habits

I am not a huge fan of bank-generated visuals to analyze my spending habits. My bank breaks up expenses into murky categories such as bills and utilities, shopping, other, and un-categorized. As a result, I began tracking all my expenses in a spreadsheet to better capture data. This includes month, vendor, amount, and expense category for every purchase. To explore this data, I used a heat map and a waffle chart.

Here’s a look at how to use Python to create heat maps and waffle charts for spending habit data.

Setup

Libraries and packages I utilized for my spending data include pandas, NumPy, matplotlib, seaborn and pywaffle. I upload my data from a CSV and turn it into a dataframe using pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from pywaffle import Waffle
%matplotlib inline

df = pd.read_csv('2018_expenses.csv')

Heat Map of Number of Purchases

Heat maps display numerical trends using a sequential scale of color intensity. Below, I graph number of purchases across spending categories and organize the information by month.

I customize the graph to have annotations on each block and color the area using the seaborn color palette ‘rocket’ set to be reversed in shading by adding ‘_r’. An additional argument (‘annot_kws’) adjusts the text size of the annotations. Other features clean up the axes labels and name the tick labels accordingly.

ct_counts=df.groupby(['category','month']).size()
ct_counts=ct_counts.reset_index(name='count')
ct_counts=ct_counts.pivot(index='month', columns='category', values='count')
plt.figure(figsize=(12, 6))
ax=sb.heatmap(ct_counts, annot=True, cmap='rocket_r', annot_kws={"size":14})
plt.title('Number of Monthly Expenses', fontsize=16)
plt.xticks(rotation=0)
plt.xlabel('')
plt.ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
ax.set_xticklabels(['Business', 'Education', 'Entertainment', 'Food', 'Health', 'Other', 'Transportation'])
ax.set_yticklabels(['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], rotation=0);

The heat map shows my most frequent purchases are on food, entertainment, other, and transportation. Categories such as business and health have the lowest frequency of purchases. Across months, purchases are fairly similar with only between 1 and 9 purchases per spending category.

Waffle Chart of Total Spending

Waffle charts display values in congregated squares and are an alternative to pie charts. Formatting options allow for a varied number of columns and rows. I stuck with ten columns with ten rows, with each square to represent one percent out of a total one hundred. This allows for a simple breakdown of where my money went. I created a dictionary of values for this example, rather than using my pandas dataframe.

 data={'Education':63, 'Entertainment':9, 'Other':8, 'Health':7, 'Food':6, 'Transportation':4, 'Business':3}

fig=plt.figure(
FigureClass=Waffle, 
rows=10, 
values=data, 
colors=('indianred', 'lightsalmon', 'peachpuff', 'darkkhaki', 
'cadetblue','paleturquoise', 'lightsteelblue'), 
title={'label': 'Total Spending by Category', 'loc': 'center'}, 
labels=["{0} ({1}%)".format(k, v) for k, v in data.items()],
legend={'loc': 'upper left', 'bbox_to_anchor': (1.1, 1)}
)
fig.set_tight_layout(False)
plt.show();

The waffle chart shows most of my spending was on education and the other categories were all under 10% of total expenses.

Reducing Plastic Use

Reducing Plastic Use

Various pieces of plastic trash debris are strewn alongside seaweed and rocks on a beach.
Assorted plastic trash on the beach at Pelican Cove Park in Rancho Palos Verdes, CA, 2017.

In the spirit of this year’s Earth Day theme (‘End Plastic Pollution’), I researched the fate of plastic. The Environmental Protection Agency (EPA) prepared a report for 2014 municipal waste stream data for the United States. Plastic products were either recycled, burned for energy production, or sent to landfills. I used pandas to look at the data and Matplotlib to create a graph. I included percentages for each fate and compared the categories of total plastics, containers and packaging, durable goods, and nondurable goods.

A graph compares different types of plastic products and their fate in the municipal waste stream.
Percentages of total plastics and plastic types that get recycled, burned for energy, or sent to a landfill, according to the EPA.

The EPA data shows a majority of plastic products reported in the waste stream were sent to landfills. Obviously, not all plastic waste actually reaches a recycling facility or landfill. Roadsides, waterways, and beaches are all subject to plastic pollution. Decreasing personal use of plastic products can help reduce the overall production of waste.

Here are some ideas for cutting back on plastic use:

  • Bring reusable shopping bags to every store.
    • Utilize cloth bags for all purchases.
    • Opt for reusable produce bags for fresh fruit and vegetables instead of store-provided plastic ones.
  • Ditch party plasticware.
    • Buy an assortment of silverware from a thrift store for party use.
    • Snag a set of used glassware for drinks instead of buying single-use plastic cups.
  • Use Bee’s Wrap instead of plastic wrap.
    • Bee’s Wrap is beeswax covered cloth for food storage. It works exactly the same as plastic wrap, but it can be used over and over.
  • Choose glassware instead of plastic zip-locked bags for storing food.
    • Glass containers like Pyrex can be used in place of single-use plastic storage bags.
  • Say ‘no’ to plastic straws.
    • Get in the habit of refusing a straw at restaurants when you go out.
    • Bring a reusable straw made out of bamboo, stainless steel, or glass to your favorite drink spot.

 

To check out the code for the figure I created, here’s the repository for it.