Species Richness and Distribution in the National Parks with pandas

Here’s how to use Python and pandas to explore species data for the United States National Parks to find the average species richness and the distribution of species categories. This goes over some of the built-in functions in pandas and how to use those for exploratory data analysis. The source data is available via Kaggle, or the National Parks Species website. The associated Github repository for a more in-depth look at the two Jupyter Notebooks for the code below is available here.

An elk sits off a trail at Yellowstone National Park as visitors walk by.
An elk sits near a trail in Yellowstone National Park, May 2017.

Setup

Import the necessary packages, including pandas, matplotlib, seaborn, and math.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import math

Load the species dataset and the parks dataset with pandas. Given the size and that I used Jupyter Notebook for this, I give the low memory argument a value of False for the species data so it loads without too much trouble. The dataframes can be given any name.

species_data = pd.read_csv('species.csv', low_memory=False)
parks_data = pd.read_csv('parks.csv')

Preview the datasets with the head() and info() pandas functions. The columns of focus will be park name and category for the species dataframe, and park name and acres for the parks dataframe.

species_data.head()
A pandas dataframe shows columns and rows for species data.
species_data.info()
A preview of the species dataframe info shows all the columns, their counts, and data types.
parks_data.head()
A pandas dataframe shows columns and rows for parks data.

Species Richness

Species richness is the quantification of species in a given area and can be a helpful metric for biodiversity.

For this, we’ll want to merge the species data and the parks data. The species data will need to be re-formatted beforehand.

Out of the box, the species data has one record per type of species per row. This will need to be turned into individual counts per column to get the number of species in each park. Use the groupby() function in pandas specifying the park name column, and then use count().

all_species_data = species_data.groupby(['Park Name']).count()
A pandas dataframe shows columns per each national park and their respective counts.

Next, I narrow down to just the Species ID column and change the name of it to reflect just ‘Species’.

species_counts = all_species_data[['Species ID']].copy()
species_counts = species_counts.rename(columns={'Species ID' : 'Species'}

The parks data will need the names set as the index as well. Then, the two dataframes will be in good shape to merge. Specify the two dataframes, and give the left_index and right_index arguments values of true because they are the same between each.

parks_data = parks_data.set_index('Park Name')
richness_data = species_counts.merge(parks_data, left_index=True, right_index=True)

Preview the newly merged dataframe to confirm it looks correct.

richness_data.head()
A pandas dataframe shows merged species and park data columns.

To take this one step further, estimate the number of species per acre using the species counts. Create a function to take the species count column and acres column and divide them, and normalize the result to an integer with math’s floor() function.

def species_abundance(df):
   return df.apply(
       lambda row:
         math.floor(
           row['Acres'] / row['Species']),
       axis=1
   )

Create a new column in the dataframe and apply the species abundance function.

richness_data['Species Abundance'] = species_abundance(richness_data)
richness_data.head()
A dataframe show the number of species and abundance of species per acre in individual national parks.

Calculate the mean of species per acre in all the parks. There are a lot of variables to consider that might affect this number, but it serves its purpose as a quick summary statistic to give us about 525 species per acre.

print(richness_data['Species Abundance'].mean())

Species Type Distribution

Species distribution in this setting will be the ratio of species types throughout each park. The different species categories in this data set are: ‘Mammal’, ‘Bird’, ‘Reptile’, ‘Amphibian’, ‘Fish’, ‘Vascular Plant’, ‘Spider/Scorpion’, ‘Insect’, ‘Invertebrate’, ‘Fungi’, ‘Nonvascular Plant’, ‘Crab/Lobster/Shrimp’, ‘Slug/Snail’, ‘Algae’.

To extract just the park names and categories, create a new dataframe with just these columns.

types = species_data[['Park Name', 'Category']].copy()
A pandas dataframe shows one row per species record of a given park and category.

Use pandas’ groupby() function to group by park name and category. Then, use the size function to specify the number of rows for the unstack function, which will create a column for each of the unique row values. I give the argument of fill_value for the unstack function a value of 0, to keep anything with NaN values consistent for math operations.

df = types.groupby(['Park Name','Category']).size().unstack(fill_value=0)
df.head()
A pandas dataframe shows the number of species in each category per park.

Next, take the counts from each and put all on the same scale of out of 100. Use pandas’ div() function to divide based on the sum of the dataframe per each row. Then, multiply each value by 100 for better readability and translation for visuals.

ratios = df.div(df.sum(axis=1), axis=0).multiply(100)
ratios.head()
A pandas dataframe shows raw values for each park's species category ratios.

To clean the dataframe values further, round each variable to 2 decimal places using Python’s round() function.

rounded = ratios.round(2)
rounded.head()
A pandas dataframe shows rounded values for ratios in each park's species category.

Create a box plot using matplotlib and seaborn to show the breakdown of species categories across all parks.

f, ax = plt.subplots(figsize=(12, 8))
sb.boxplot(data=rounded, orient='h')
ax.xaxis.grid(True)
ax.set(ylabel="")
ax.set(xlim=(0,100))
sb.despine(trim=True, left=True)
plt.title('Species Category Distribution in the National Parks', fontsize=16)
plt.show()
A boxplot shows the species category types and their distributions.

Optionally, export the species categories dataframe as a CSV for further use.

Park NameAlgaeAmphibianBirdCrab/Lobster/ShrimpFishFungiInsectInvertebrateMammalNonvascular PlantReptileSlug/SnailSpider/ScorpionVascular Plant
Acadia National Park0.00.8821.30.02.220.00.00.03.220.00.640.00.071.74
Arches National Park0.00.7619.560.01.050.00.00.05.630.01.910.00.071.09
Badlands National Park0.00.7217.210.01.7312.4617.210.074.610.00.940.00.0745.0
Big Bend National Park0.00.5718.290.02.340.00.00.03.922.122.730.00.070.03
Biscayne National Park0.00.4613.50.047.390.00.641.971.620.02.320.00.032.1
Black Canyon of the Gunnison National Park0.00.1815.820.01.450.00.00.06.060.00.990.00.075.5
Bryce Canyon National Park0.00.3116.870.00.080.00.00.05.910.01.010.00.075.82
Canyonlands National Park0.00.5717.990.02.70.00.00.06.210.01.80.00.070.73
Capitol Reef National Park0.00.3815.840.00.960.00.00.04.660.01.340.00.076.82
Carlsbad Caverns National Park0.00.9823.890.00.330.00.00.05.990.04.040.00.064.78
Channel Islands National Park3.240.2118.940.5814.480.00.1110.42.333.610.581.750.0543.71
Congaree National Park3.191.858.620.262.812.0226.580.651.680.32.150.90.938.09
Crater Lake National Park5.80.536.71.060.355.1126.441.812.555.130.530.240.4543.3
Cuyahoga Valley National Park0.01.2412.670.414.380.011.71.292.420.01.180.770.163.83
Death Valley National Park1.351.611.960.860.22.7720.340.414.780.52.031.620.7250.87
Denali National Park and Preserve0.00.0813.560.01.062.23.860.233.2612.050.00.00.063.71
Dry Tortugas National Park0.00.033.370.033.140.00.04.950.710.00.590.00.027.24
Everglades National Park0.00.8217.750.020.440.00.00.02.020.02.930.00.056.05
Gates Of The Arctic National Park and Preserve0.00.079.90.01.2635.030.00.02.880.00.00.00.050.85
Glacier Bay National Park and Preserve3.270.2613.184.9118.340.11.795.982.965.010.151.890.1542.0
Glacier National Park0.080.2310.840.231.0610.87.710.082.715.810.160.780.049.53
Grand Canyon National Park0.00.5717.390.01.110.02.170.044.040.02.90.085.4266.29
Grand Teton National Park0.050.3413.10.051.031.387.590.253.650.00.250.20.0572.07
Great Basin National Park0.00.8312.440.230.790.5717.191.093.880.62.191.130.6858.39
Great Sand Dunes National Park and Preserve0.00.6325.210.00.630.00.110.07.140.110.840.00.065.34
Great Smoky Mountains National Park0.00.924.110.151.629.5436.451.271.427.970.771.391.5732.83
Guadalupe Mountains National Park0.00.6915.580.290.173.786.590.44.350.293.213.320.1161.23
Haleakala National Park0.00.121.710.70.232.5641.121.430.588.80.391.821.7838.76
Hawaii Volcanoes National Park0.00.122.371.70.120.2143.335.250.454.090.392.153.0636.75
Hot Springs National Park1.231.3819.850.464.620.00.771.132.670.922.670.10.064.21
Isle Royale National Park0.00.9318.681.04.510.00.01.51.860.070.360.00.071.08
Joshua Tree National Park0.220.2213.120.00.041.6614.910.442.920.262.270.06.6757.28
Katmai National Park and Preserve0.00.0818.120.333.598.570.572.374.411.630.00.570.059.76
Kenai Fjords National Park0.00.0922.230.03.970.190.283.975.20.00.00.380.063.67
Kobuk Valley National Park0.00.112.20.02.5432.490.00.03.712.540.00.00.046.44
Lake Clark National Park and Preserve0.00.059.620.052.7410.960.150.32.4914.00.00.00.059.64
Lassen Volcanic National Park0.110.9513.630.831.112.845.291.05.568.91.220.330.058.21
Mammoth Cave National Park0.01.328.40.044.880.010.82.162.240.041.680.00.0868.35
Mesa Verde National Park0.00.6419.070.00.00.00.00.06.920.01.690.00.071.68
Mount Rainier National Park0.00.9210.730.01.321.782.70.03.9620.480.290.00.057.83
North Cascades National Park0.00.366.720.00.9816.0316.620.02.3511.450.30.01.2543.95
Olympic National Park0.00.8215.910.04.980.04.470.04.110.00.310.00.069.4
Petrified Forest National Park0.00.9428.60.00.00.120.00.07.270.122.460.00.060.49
Pinnacles National Park0.00.7112.010.560.421.9122.812.194.240.02.050.780.7151.62
Redwood National Park1.760.527.941.933.9121.611.795.292.443.980.622.330.1135.77
Rocky Mountain National Park4.760.168.791.240.389.7121.451.522.3513.20.10.320.735.34
Saguaro National Park0.00.5513.410.050.00.110.00.05.560.03.380.00.076.94
Sequoia and Kings Canyon National Parks0.00.6511.030.00.950.00.00.04.460.01.20.00.081.7
Shenandoah National Park0.00.865.760.060.8816.436.830.041.357.520.820.040.0959.31
Theodore Roosevelt National Park0.00.6919.140.092.750.096.270.265.675.241.120.170.1758.37
Voyageurs National Park0.01.0316.380.03.990.212.270.484.340.760.410.00.070.13
Wind Cave National Park0.00.516.850.00.573.087.530.06.380.00.861.790.062.44
Wrangell - St Elias National Park and Preserve0.00.1111.750.05.180.03.230.223.450.390.060.060.075.56
Yellowstone National Park5.140.238.321.590.480.2841.051.971.970.380.231.491.0835.8
Yosemite National Park0.00.7212.930.00.480.00.00.04.210.01.050.00.080.6
Zion National Park0.00.3916.760.00.840.00.00.04.450.01.670.00.075.89

Summary

Further exploration of the species category and count outputs might involve comparing a select number of parks against each other. Unique factors such as location, park size, and biomes provide opportunities for further analysis and insights.

Data Exploration with Student Test Scores

Data Exploration with Student Test Scores

I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.

Preliminary Data Cleaning

For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
labels = ['gender', 'race/ethnicity', 'par_level_educ', 'lunch', 'test_prep', 'math', 'reading', 'writing']
tests = pd.read_csv('StudentsPerformance.csv', header=0, names=labels)
tests.info()
Output of info() for student test scores.
tests.head(10)
Output of head() for student test scores.

Univariate Analysis

Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.

plt.figure(figsize=[10,4])
plt.subplot(1, 3, 1)
plt.hist(data=tests, x='math', bins=20)
plt.title('Math')
plt.xlim(0,100)
plt.ylim(0,160)
plt.ylabel('Number of Students', fontsize=12)
plt.subplot(1, 3, 2)
plt.hist(data=tests, x='reading', bins=20)
plt.title('Reading')
plt.xlim(0,100)
plt.ylim(0,160)
plt.subplot(1, 3, 3)
plt.hist(data=tests, x='writing', bins=20)
plt.title('Writing')
plt.xlim(0,100)
plt.ylim(0,160)
plt.suptitle('Test Scores', fontsize=16, y=1.0);
Histograms showing the spread of student test scores across all topics.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
base_color = sb.color_palette()[9]
sb.countplot(data=tests, x='par_level_educ', color=base_color, order=ed_order)
n_points = tests.shape[0]
cat_counts = tests['par_level_educ'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count-35, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Parental Education Level of Student Test Takers');
Bivariate count plots of student test scores across parental levels of education.

Bivariate Analysis

Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.

base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
Violin plots that show average student test scores base on level of test preparation.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Box plots that show reading scores across varying levels of parental education.

Multivariate Analysis

A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.

ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
A swarm plot that shows student test scores, test preparation level, and the highest levels of parental education.
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);
Multivariate box plots showing test scores in Math, Reading, and Writing based on student gender and test preparation level.