I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.
Preliminary Data Cleaning
For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
labels = ['gender', 'race/ethnicity', 'par_level_educ', 'lunch', 'test_prep', 'math', 'reading', 'writing']
tests = pd.read_csv('StudentsPerformance.csv', header=0, names=labels)
tests.info()
tests.head(10)
Univariate Analysis
Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.
plt.figure(figsize=[10,4])
plt.subplot(1, 3, 1)
plt.hist(data=tests, x='math', bins=20)
plt.title('Math')
plt.xlim(0,100)
plt.ylim(0,160)
plt.ylabel('Number of Students', fontsize=12)
plt.subplot(1, 3, 2)
plt.hist(data=tests, x='reading', bins=20)
plt.title('Reading')
plt.xlim(0,100)
plt.ylim(0,160)
plt.subplot(1, 3, 3)
plt.hist(data=tests, x='writing', bins=20)
plt.title('Writing')
plt.xlim(0,100)
plt.ylim(0,160)
plt.suptitle('Test Scores', fontsize=16, y=1.0);
ed_order = ['some high school', 'high school', 'some college',
'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
base_color = sb.color_palette()[9]
sb.countplot(data=tests, x='par_level_educ', color=base_color, order=ed_order)
n_points = tests.shape[0]
cat_counts = tests['par_level_educ'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
count = cat_counts[label.get_text()]
pct_string = count
plt.text(loc, count-35, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Parental Education Level of Student Test Takers');
Bivariate Analysis
Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.
base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
ed_order = ['some high school', 'high school', 'some college',
'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Multivariate Analysis
A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.
ed_order = ['some high school', 'high school', 'some college',
'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);