Species Richness and Distribution in the National Parks with pandas

Here’s how to use Python and pandas to explore species data for the United States National Parks to find the average species richness and the distribution of species categories. This goes over some of the built-in functions in pandas and how to use those for exploratory data analysis. The source data is available via Kaggle, or the National Parks Species website. The associated Github repository for a more in-depth look at the two Jupyter Notebooks for the code below is available here.

An elk sits off a trail at Yellowstone National Park as visitors walk by.
An elk sits near a trail in Yellowstone National Park, May 2017.

Setup

Import the necessary packages, including pandas, matplotlib, seaborn, and math.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import math

Load the species dataset and the parks dataset with pandas. Given the size and that I used Jupyter Notebook for this, I give the low memory argument a value of False for the species data so it loads without too much trouble. The dataframes can be given any name.

species_data = pd.read_csv('species.csv', low_memory=False)
parks_data = pd.read_csv('parks.csv')

Preview the datasets with the head() and info() pandas functions. The columns of focus will be park name and category for the species dataframe, and park name and acres for the parks dataframe.

species_data.head()
A pandas dataframe shows columns and rows for species data.
species_data.info()
A preview of the species dataframe info shows all the columns, their counts, and data types.
parks_data.head()
A pandas dataframe shows columns and rows for parks data.

Species Richness

Species richness is the quantification of species in a given area and can be a helpful metric for biodiversity.

For this, we’ll want to merge the species data and the parks data. The species data will need to be re-formatted beforehand.

Out of the box, the species data has one record per type of species per row. This will need to be turned into individual counts per column to get the number of species in each park. Use the groupby() function in pandas specifying the park name column, and then use count().

all_species_data = species_data.groupby(['Park Name']).count()
A pandas dataframe shows columns per each national park and their respective counts.

Next, I narrow down to just the Species ID column and change the name of it to reflect just ‘Species’.

species_counts = all_species_data[['Species ID']].copy()
species_counts = species_counts.rename(columns={'Species ID' : 'Species'}

The parks data will need the names set as the index as well. Then, the two dataframes will be in good shape to merge. Specify the two dataframes, and give the left_index and right_index arguments values of true because they are the same between each.

parks_data = parks_data.set_index('Park Name')
richness_data = species_counts.merge(parks_data, left_index=True, right_index=True)

Preview the newly merged dataframe to confirm it looks correct.

richness_data.head()
A pandas dataframe shows merged species and park data columns.

To take this one step further, estimate the number of species per acre using the species counts. Create a function to take the species count column and acres column and divide them, and normalize the result to an integer with math’s floor() function.

def species_abundance(df):
   return df.apply(
       lambda row:
         math.floor(
           row['Acres'] / row['Species']),
       axis=1
   )

Create a new column in the dataframe and apply the species abundance function.

richness_data['Species Abundance'] = species_abundance(richness_data)
richness_data.head()
A dataframe show the number of species and abundance of species per acre in individual national parks.

Calculate the mean of species per acre in all the parks. There are a lot of variables to consider that might affect this number, but it serves its purpose as a quick summary statistic to give us about 525 species per acre.

print(richness_data['Species Abundance'].mean())

Species Type Distribution

Species distribution in this setting will be the ratio of species types throughout each park. The different species categories in this data set are: ‘Mammal’, ‘Bird’, ‘Reptile’, ‘Amphibian’, ‘Fish’, ‘Vascular Plant’, ‘Spider/Scorpion’, ‘Insect’, ‘Invertebrate’, ‘Fungi’, ‘Nonvascular Plant’, ‘Crab/Lobster/Shrimp’, ‘Slug/Snail’, ‘Algae’.

To extract just the park names and categories, create a new dataframe with just these columns.

types = species_data[['Park Name', 'Category']].copy()
A pandas dataframe shows one row per species record of a given park and category.

Use pandas’ groupby() function to group by park name and category. Then, use the size function to specify the number of rows for the unstack function, which will create a column for each of the unique row values. I give the argument of fill_value for the unstack function a value of 0, to keep anything with NaN values consistent for math operations.

df = types.groupby(['Park Name','Category']).size().unstack(fill_value=0)
df.head()
A pandas dataframe shows the number of species in each category per park.

Next, take the counts from each and put all on the same scale of out of 100. Use pandas’ div() function to divide based on the sum of the dataframe per each row. Then, multiply each value by 100 for better readability and translation for visuals.

ratios = df.div(df.sum(axis=1), axis=0).multiply(100)
ratios.head()
A pandas dataframe shows raw values for each park's species category ratios.

To clean the dataframe values further, round each variable to 2 decimal places using Python’s round() function.

rounded = ratios.round(2)
rounded.head()
A pandas dataframe shows rounded values for ratios in each park's species category.

Create a box plot using matplotlib and seaborn to show the breakdown of species categories across all parks.

f, ax = plt.subplots(figsize=(12, 8))
sb.boxplot(data=rounded, orient='h')
ax.xaxis.grid(True)
ax.set(ylabel="")
ax.set(xlim=(0,100))
sb.despine(trim=True, left=True)
plt.title('Species Category Distribution in the National Parks', fontsize=16)
plt.show()
A boxplot shows the species category types and their distributions.

Optionally, export the species categories dataframe as a CSV for further use.

Park NameAlgaeAmphibianBirdCrab/Lobster/ShrimpFishFungiInsectInvertebrateMammalNonvascular PlantReptileSlug/SnailSpider/ScorpionVascular Plant
Acadia National Park0.00.8821.30.02.220.00.00.03.220.00.640.00.071.74
Arches National Park0.00.7619.560.01.050.00.00.05.630.01.910.00.071.09
Badlands National Park0.00.7217.210.01.7312.4617.210.074.610.00.940.00.0745.0
Big Bend National Park0.00.5718.290.02.340.00.00.03.922.122.730.00.070.03
Biscayne National Park0.00.4613.50.047.390.00.641.971.620.02.320.00.032.1
Black Canyon of the Gunnison National Park0.00.1815.820.01.450.00.00.06.060.00.990.00.075.5
Bryce Canyon National Park0.00.3116.870.00.080.00.00.05.910.01.010.00.075.82
Canyonlands National Park0.00.5717.990.02.70.00.00.06.210.01.80.00.070.73
Capitol Reef National Park0.00.3815.840.00.960.00.00.04.660.01.340.00.076.82
Carlsbad Caverns National Park0.00.9823.890.00.330.00.00.05.990.04.040.00.064.78
Channel Islands National Park3.240.2118.940.5814.480.00.1110.42.333.610.581.750.0543.71
Congaree National Park3.191.858.620.262.812.0226.580.651.680.32.150.90.938.09
Crater Lake National Park5.80.536.71.060.355.1126.441.812.555.130.530.240.4543.3
Cuyahoga Valley National Park0.01.2412.670.414.380.011.71.292.420.01.180.770.163.83
Death Valley National Park1.351.611.960.860.22.7720.340.414.780.52.031.620.7250.87
Denali National Park and Preserve0.00.0813.560.01.062.23.860.233.2612.050.00.00.063.71
Dry Tortugas National Park0.00.033.370.033.140.00.04.950.710.00.590.00.027.24
Everglades National Park0.00.8217.750.020.440.00.00.02.020.02.930.00.056.05
Gates Of The Arctic National Park and Preserve0.00.079.90.01.2635.030.00.02.880.00.00.00.050.85
Glacier Bay National Park and Preserve3.270.2613.184.9118.340.11.795.982.965.010.151.890.1542.0
Glacier National Park0.080.2310.840.231.0610.87.710.082.715.810.160.780.049.53
Grand Canyon National Park0.00.5717.390.01.110.02.170.044.040.02.90.085.4266.29
Grand Teton National Park0.050.3413.10.051.031.387.590.253.650.00.250.20.0572.07
Great Basin National Park0.00.8312.440.230.790.5717.191.093.880.62.191.130.6858.39
Great Sand Dunes National Park and Preserve0.00.6325.210.00.630.00.110.07.140.110.840.00.065.34
Great Smoky Mountains National Park0.00.924.110.151.629.5436.451.271.427.970.771.391.5732.83
Guadalupe Mountains National Park0.00.6915.580.290.173.786.590.44.350.293.213.320.1161.23
Haleakala National Park0.00.121.710.70.232.5641.121.430.588.80.391.821.7838.76
Hawaii Volcanoes National Park0.00.122.371.70.120.2143.335.250.454.090.392.153.0636.75
Hot Springs National Park1.231.3819.850.464.620.00.771.132.670.922.670.10.064.21
Isle Royale National Park0.00.9318.681.04.510.00.01.51.860.070.360.00.071.08
Joshua Tree National Park0.220.2213.120.00.041.6614.910.442.920.262.270.06.6757.28
Katmai National Park and Preserve0.00.0818.120.333.598.570.572.374.411.630.00.570.059.76
Kenai Fjords National Park0.00.0922.230.03.970.190.283.975.20.00.00.380.063.67
Kobuk Valley National Park0.00.112.20.02.5432.490.00.03.712.540.00.00.046.44
Lake Clark National Park and Preserve0.00.059.620.052.7410.960.150.32.4914.00.00.00.059.64
Lassen Volcanic National Park0.110.9513.630.831.112.845.291.05.568.91.220.330.058.21
Mammoth Cave National Park0.01.328.40.044.880.010.82.162.240.041.680.00.0868.35
Mesa Verde National Park0.00.6419.070.00.00.00.00.06.920.01.690.00.071.68
Mount Rainier National Park0.00.9210.730.01.321.782.70.03.9620.480.290.00.057.83
North Cascades National Park0.00.366.720.00.9816.0316.620.02.3511.450.30.01.2543.95
Olympic National Park0.00.8215.910.04.980.04.470.04.110.00.310.00.069.4
Petrified Forest National Park0.00.9428.60.00.00.120.00.07.270.122.460.00.060.49
Pinnacles National Park0.00.7112.010.560.421.9122.812.194.240.02.050.780.7151.62
Redwood National Park1.760.527.941.933.9121.611.795.292.443.980.622.330.1135.77
Rocky Mountain National Park4.760.168.791.240.389.7121.451.522.3513.20.10.320.735.34
Saguaro National Park0.00.5513.410.050.00.110.00.05.560.03.380.00.076.94
Sequoia and Kings Canyon National Parks0.00.6511.030.00.950.00.00.04.460.01.20.00.081.7
Shenandoah National Park0.00.865.760.060.8816.436.830.041.357.520.820.040.0959.31
Theodore Roosevelt National Park0.00.6919.140.092.750.096.270.265.675.241.120.170.1758.37
Voyageurs National Park0.01.0316.380.03.990.212.270.484.340.760.410.00.070.13
Wind Cave National Park0.00.516.850.00.573.087.530.06.380.00.861.790.062.44
Wrangell - St Elias National Park and Preserve0.00.1111.750.05.180.03.230.223.450.390.060.060.075.56
Yellowstone National Park5.140.238.321.590.480.2841.051.971.970.380.231.491.0835.8
Yosemite National Park0.00.7212.930.00.480.00.00.04.210.01.050.00.080.6
Zion National Park0.00.3916.760.00.840.00.00.04.450.01.670.00.075.89

Summary

Further exploration of the species category and count outputs might involve comparing a select number of parks against each other. Unique factors such as location, park size, and biomes provide opportunities for further analysis and insights.