Here’s how to use Python and pandas to explore species data for the United States National Parks to find the average species richness and the distribution of species categories. This goes over some of the built-in functions in pandas and how to use those for exploratory data analysis. The source data is available via Kaggle, or the National Parks Species website. The associated Github repository for a more in-depth look at the two Jupyter Notebooks for the code below is available here.
Setup
Import the necessary packages, including pandas, matplotlib, seaborn, and math.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import math
Load the species dataset and the parks dataset with pandas. Given the size and that I used Jupyter Notebook for this, I give the low memory argument a value of False for the species data so it loads without too much trouble. The dataframes can be given any name.
species_data = pd.read_csv('species.csv', low_memory=False)
parks_data = pd.read_csv('parks.csv')
Preview the datasets with the head() and info() pandas functions. The columns of focus will be park name and category for the species dataframe, and park name and acres for the parks dataframe.
species_data.head()
species_data.info()
parks_data.head()
Species Richness
Species richness is the quantification of species in a given area and can be a helpful metric for biodiversity.
For this, we’ll want to merge the species data and the parks data. The species data will need to be re-formatted beforehand.
Out of the box, the species data has one record per type of species per row. This will need to be turned into individual counts per column to get the number of species in each park. Use the groupby() function in pandas specifying the park name column, and then use count().
all_species_data = species_data.groupby(['Park Name']).count()
Next, I narrow down to just the Species ID column and change the name of it to reflect just ‘Species’.
species_counts = all_species_data[['Species ID']].copy()
species_counts = species_counts.rename(columns={'Species ID' : 'Species'}
The parks data will need the names set as the index as well. Then, the two dataframes will be in good shape to merge. Specify the two dataframes, and give the left_index and right_index arguments values of true because they are the same between each.
parks_data = parks_data.set_index('Park Name')
richness_data = species_counts.merge(parks_data, left_index=True, right_index=True)
Preview the newly merged dataframe to confirm it looks correct.
richness_data.head()
To take this one step further, estimate the number of species per acre using the species counts. Create a function to take the species count column and acres column and divide them, and normalize the result to an integer with math’s floor() function.
def species_abundance(df):
return df.apply(
lambda row:
math.floor(
row['Acres'] / row['Species']),
axis=1
)
Create a new column in the dataframe and apply the species abundance function.
richness_data['Species Abundance'] = species_abundance(richness_data)
richness_data.head()
Calculate the mean of species per acre in all the parks. There are a lot of variables to consider that might affect this number, but it serves its purpose as a quick summary statistic to give us about 525 species per acre.
print(richness_data['Species Abundance'].mean())
Species Type Distribution
Species distribution in this setting will be the ratio of species types throughout each park. The different species categories in this data set are: ‘Mammal’, ‘Bird’, ‘Reptile’, ‘Amphibian’, ‘Fish’, ‘Vascular Plant’, ‘Spider/Scorpion’, ‘Insect’, ‘Invertebrate’, ‘Fungi’, ‘Nonvascular Plant’, ‘Crab/Lobster/Shrimp’, ‘Slug/Snail’, ‘Algae’.
To extract just the park names and categories, create a new dataframe with just these columns.
types = species_data[['Park Name', 'Category']].copy()
Use pandas’ groupby() function to group by park name and category. Then, use the size function to specify the number of rows for the unstack function, which will create a column for each of the unique row values. I give the argument of fill_value for the unstack function a value of 0, to keep anything with NaN values consistent for math operations.
df = types.groupby(['Park Name','Category']).size().unstack(fill_value=0)
df.head()
Next, take the counts from each and put all on the same scale of out of 100. Use pandas’ div() function to divide based on the sum of the dataframe per each row. Then, multiply each value by 100 for better readability and translation for visuals.
ratios = df.div(df.sum(axis=1), axis=0).multiply(100)
ratios.head()
To clean the dataframe values further, round each variable to 2 decimal places using Python’s round() function.
rounded = ratios.round(2)
rounded.head()
Create a box plot using matplotlib and seaborn to show the breakdown of species categories across all parks.
f, ax = plt.subplots(figsize=(12, 8))
sb.boxplot(data=rounded, orient='h')
ax.xaxis.grid(True)
ax.set(ylabel="")
ax.set(xlim=(0,100))
sb.despine(trim=True, left=True)
plt.title('Species Category Distribution in the National Parks', fontsize=16)
plt.show()
Optionally, export the species categories dataframe as a CSV for further use.
Park Name | Algae | Amphibian | Bird | Crab/Lobster/Shrimp | Fish | Fungi | Insect | Invertebrate | Mammal | Nonvascular Plant | Reptile | Slug/Snail | Spider/Scorpion | Vascular Plant |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acadia National Park | 0.0 | 0.88 | 21.3 | 0.0 | 2.22 | 0.0 | 0.0 | 0.0 | 3.22 | 0.0 | 0.64 | 0.0 | 0.0 | 71.74 |
Arches National Park | 0.0 | 0.76 | 19.56 | 0.0 | 1.05 | 0.0 | 0.0 | 0.0 | 5.63 | 0.0 | 1.91 | 0.0 | 0.0 | 71.09 |
Badlands National Park | 0.0 | 0.72 | 17.21 | 0.0 | 1.73 | 12.46 | 17.21 | 0.07 | 4.61 | 0.0 | 0.94 | 0.0 | 0.07 | 45.0 |
Big Bend National Park | 0.0 | 0.57 | 18.29 | 0.0 | 2.34 | 0.0 | 0.0 | 0.0 | 3.92 | 2.12 | 2.73 | 0.0 | 0.0 | 70.03 |
Biscayne National Park | 0.0 | 0.46 | 13.5 | 0.0 | 47.39 | 0.0 | 0.64 | 1.97 | 1.62 | 0.0 | 2.32 | 0.0 | 0.0 | 32.1 |
Black Canyon of the Gunnison National Park | 0.0 | 0.18 | 15.82 | 0.0 | 1.45 | 0.0 | 0.0 | 0.0 | 6.06 | 0.0 | 0.99 | 0.0 | 0.0 | 75.5 |
Bryce Canyon National Park | 0.0 | 0.31 | 16.87 | 0.0 | 0.08 | 0.0 | 0.0 | 0.0 | 5.91 | 0.0 | 1.01 | 0.0 | 0.0 | 75.82 |
Canyonlands National Park | 0.0 | 0.57 | 17.99 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | 6.21 | 0.0 | 1.8 | 0.0 | 0.0 | 70.73 |
Capitol Reef National Park | 0.0 | 0.38 | 15.84 | 0.0 | 0.96 | 0.0 | 0.0 | 0.0 | 4.66 | 0.0 | 1.34 | 0.0 | 0.0 | 76.82 |
Carlsbad Caverns National Park | 0.0 | 0.98 | 23.89 | 0.0 | 0.33 | 0.0 | 0.0 | 0.0 | 5.99 | 0.0 | 4.04 | 0.0 | 0.0 | 64.78 |
Channel Islands National Park | 3.24 | 0.21 | 18.94 | 0.58 | 14.48 | 0.0 | 0.11 | 10.4 | 2.33 | 3.61 | 0.58 | 1.75 | 0.05 | 43.71 |
Congaree National Park | 3.19 | 1.85 | 8.62 | 0.26 | 2.8 | 12.02 | 26.58 | 0.65 | 1.68 | 0.3 | 2.15 | 0.9 | 0.9 | 38.09 |
Crater Lake National Park | 5.8 | 0.53 | 6.7 | 1.06 | 0.35 | 5.11 | 26.44 | 1.81 | 2.55 | 5.13 | 0.53 | 0.24 | 0.45 | 43.3 |
Cuyahoga Valley National Park | 0.0 | 1.24 | 12.67 | 0.41 | 4.38 | 0.0 | 11.7 | 1.29 | 2.42 | 0.0 | 1.18 | 0.77 | 0.1 | 63.83 |
Death Valley National Park | 1.35 | 1.6 | 11.96 | 0.86 | 0.2 | 2.77 | 20.34 | 0.41 | 4.78 | 0.5 | 2.03 | 1.62 | 0.72 | 50.87 |
Denali National Park and Preserve | 0.0 | 0.08 | 13.56 | 0.0 | 1.06 | 2.2 | 3.86 | 0.23 | 3.26 | 12.05 | 0.0 | 0.0 | 0.0 | 63.71 |
Dry Tortugas National Park | 0.0 | 0.0 | 33.37 | 0.0 | 33.14 | 0.0 | 0.0 | 4.95 | 0.71 | 0.0 | 0.59 | 0.0 | 0.0 | 27.24 |
Everglades National Park | 0.0 | 0.82 | 17.75 | 0.0 | 20.44 | 0.0 | 0.0 | 0.0 | 2.02 | 0.0 | 2.93 | 0.0 | 0.0 | 56.05 |
Gates Of The Arctic National Park and Preserve | 0.0 | 0.07 | 9.9 | 0.0 | 1.26 | 35.03 | 0.0 | 0.0 | 2.88 | 0.0 | 0.0 | 0.0 | 0.0 | 50.85 |
Glacier Bay National Park and Preserve | 3.27 | 0.26 | 13.18 | 4.91 | 18.34 | 0.1 | 1.79 | 5.98 | 2.96 | 5.01 | 0.15 | 1.89 | 0.15 | 42.0 |
Glacier National Park | 0.08 | 0.23 | 10.84 | 0.23 | 1.06 | 10.8 | 7.71 | 0.08 | 2.7 | 15.81 | 0.16 | 0.78 | 0.0 | 49.53 |
Grand Canyon National Park | 0.0 | 0.57 | 17.39 | 0.0 | 1.11 | 0.0 | 2.17 | 0.04 | 4.04 | 0.0 | 2.9 | 0.08 | 5.42 | 66.29 |
Grand Teton National Park | 0.05 | 0.34 | 13.1 | 0.05 | 1.03 | 1.38 | 7.59 | 0.25 | 3.65 | 0.0 | 0.25 | 0.2 | 0.05 | 72.07 |
Great Basin National Park | 0.0 | 0.83 | 12.44 | 0.23 | 0.79 | 0.57 | 17.19 | 1.09 | 3.88 | 0.6 | 2.19 | 1.13 | 0.68 | 58.39 |
Great Sand Dunes National Park and Preserve | 0.0 | 0.63 | 25.21 | 0.0 | 0.63 | 0.0 | 0.11 | 0.0 | 7.14 | 0.11 | 0.84 | 0.0 | 0.0 | 65.34 |
Great Smoky Mountains National Park | 0.0 | 0.92 | 4.11 | 0.15 | 1.62 | 9.54 | 36.45 | 1.27 | 1.42 | 7.97 | 0.77 | 1.39 | 1.57 | 32.83 |
Guadalupe Mountains National Park | 0.0 | 0.69 | 15.58 | 0.29 | 0.17 | 3.78 | 6.59 | 0.4 | 4.35 | 0.29 | 3.21 | 3.32 | 0.11 | 61.23 |
Haleakala National Park | 0.0 | 0.12 | 1.71 | 0.7 | 0.23 | 2.56 | 41.12 | 1.43 | 0.58 | 8.8 | 0.39 | 1.82 | 1.78 | 38.76 |
Hawaii Volcanoes National Park | 0.0 | 0.12 | 2.37 | 1.7 | 0.12 | 0.21 | 43.33 | 5.25 | 0.45 | 4.09 | 0.39 | 2.15 | 3.06 | 36.75 |
Hot Springs National Park | 1.23 | 1.38 | 19.85 | 0.46 | 4.62 | 0.0 | 0.77 | 1.13 | 2.67 | 0.92 | 2.67 | 0.1 | 0.0 | 64.21 |
Isle Royale National Park | 0.0 | 0.93 | 18.68 | 1.0 | 4.51 | 0.0 | 0.0 | 1.5 | 1.86 | 0.07 | 0.36 | 0.0 | 0.0 | 71.08 |
Joshua Tree National Park | 0.22 | 0.22 | 13.12 | 0.0 | 0.04 | 1.66 | 14.91 | 0.44 | 2.92 | 0.26 | 2.27 | 0.0 | 6.67 | 57.28 |
Katmai National Park and Preserve | 0.0 | 0.08 | 18.12 | 0.33 | 3.59 | 8.57 | 0.57 | 2.37 | 4.41 | 1.63 | 0.0 | 0.57 | 0.0 | 59.76 |
Kenai Fjords National Park | 0.0 | 0.09 | 22.23 | 0.0 | 3.97 | 0.19 | 0.28 | 3.97 | 5.2 | 0.0 | 0.0 | 0.38 | 0.0 | 63.67 |
Kobuk Valley National Park | 0.0 | 0.1 | 12.2 | 0.0 | 2.54 | 32.49 | 0.0 | 0.0 | 3.71 | 2.54 | 0.0 | 0.0 | 0.0 | 46.44 |
Lake Clark National Park and Preserve | 0.0 | 0.05 | 9.62 | 0.05 | 2.74 | 10.96 | 0.15 | 0.3 | 2.49 | 14.0 | 0.0 | 0.0 | 0.0 | 59.64 |
Lassen Volcanic National Park | 0.11 | 0.95 | 13.63 | 0.83 | 1.11 | 2.84 | 5.29 | 1.0 | 5.56 | 8.9 | 1.22 | 0.33 | 0.0 | 58.21 |
Mammoth Cave National Park | 0.0 | 1.32 | 8.4 | 0.04 | 4.88 | 0.0 | 10.8 | 2.16 | 2.24 | 0.04 | 1.68 | 0.0 | 0.08 | 68.35 |
Mesa Verde National Park | 0.0 | 0.64 | 19.07 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.92 | 0.0 | 1.69 | 0.0 | 0.0 | 71.68 |
Mount Rainier National Park | 0.0 | 0.92 | 10.73 | 0.0 | 1.32 | 1.78 | 2.7 | 0.0 | 3.96 | 20.48 | 0.29 | 0.0 | 0.0 | 57.83 |
North Cascades National Park | 0.0 | 0.36 | 6.72 | 0.0 | 0.98 | 16.03 | 16.62 | 0.0 | 2.35 | 11.45 | 0.3 | 0.0 | 1.25 | 43.95 |
Olympic National Park | 0.0 | 0.82 | 15.91 | 0.0 | 4.98 | 0.0 | 4.47 | 0.0 | 4.11 | 0.0 | 0.31 | 0.0 | 0.0 | 69.4 |
Petrified Forest National Park | 0.0 | 0.94 | 28.6 | 0.0 | 0.0 | 0.12 | 0.0 | 0.0 | 7.27 | 0.12 | 2.46 | 0.0 | 0.0 | 60.49 |
Pinnacles National Park | 0.0 | 0.71 | 12.01 | 0.56 | 0.42 | 1.91 | 22.81 | 2.19 | 4.24 | 0.0 | 2.05 | 0.78 | 0.71 | 51.62 |
Redwood National Park | 1.76 | 0.52 | 7.94 | 1.93 | 3.91 | 21.6 | 11.79 | 5.29 | 2.44 | 3.98 | 0.62 | 2.33 | 0.11 | 35.77 |
Rocky Mountain National Park | 4.76 | 0.16 | 8.79 | 1.24 | 0.38 | 9.71 | 21.45 | 1.52 | 2.35 | 13.2 | 0.1 | 0.32 | 0.7 | 35.34 |
Saguaro National Park | 0.0 | 0.55 | 13.41 | 0.05 | 0.0 | 0.11 | 0.0 | 0.0 | 5.56 | 0.0 | 3.38 | 0.0 | 0.0 | 76.94 |
Sequoia and Kings Canyon National Parks | 0.0 | 0.65 | 11.03 | 0.0 | 0.95 | 0.0 | 0.0 | 0.0 | 4.46 | 0.0 | 1.2 | 0.0 | 0.0 | 81.7 |
Shenandoah National Park | 0.0 | 0.86 | 5.76 | 0.06 | 0.88 | 16.43 | 6.83 | 0.04 | 1.35 | 7.52 | 0.82 | 0.04 | 0.09 | 59.31 |
Theodore Roosevelt National Park | 0.0 | 0.69 | 19.14 | 0.09 | 2.75 | 0.09 | 6.27 | 0.26 | 5.67 | 5.24 | 1.12 | 0.17 | 0.17 | 58.37 |
Voyageurs National Park | 0.0 | 1.03 | 16.38 | 0.0 | 3.99 | 0.21 | 2.27 | 0.48 | 4.34 | 0.76 | 0.41 | 0.0 | 0.0 | 70.13 |
Wind Cave National Park | 0.0 | 0.5 | 16.85 | 0.0 | 0.57 | 3.08 | 7.53 | 0.0 | 6.38 | 0.0 | 0.86 | 1.79 | 0.0 | 62.44 |
Wrangell - St Elias National Park and Preserve | 0.0 | 0.11 | 11.75 | 0.0 | 5.18 | 0.0 | 3.23 | 0.22 | 3.45 | 0.39 | 0.06 | 0.06 | 0.0 | 75.56 |
Yellowstone National Park | 5.14 | 0.23 | 8.32 | 1.59 | 0.48 | 0.28 | 41.05 | 1.97 | 1.97 | 0.38 | 0.23 | 1.49 | 1.08 | 35.8 |
Yosemite National Park | 0.0 | 0.72 | 12.93 | 0.0 | 0.48 | 0.0 | 0.0 | 0.0 | 4.21 | 0.0 | 1.05 | 0.0 | 0.0 | 80.6 |
Zion National Park | 0.0 | 0.39 | 16.76 | 0.0 | 0.84 | 0.0 | 0.0 | 0.0 | 4.45 | 0.0 | 1.67 | 0.0 | 0.0 | 75.89 |
Summary
Further exploration of the species category and count outputs might involve comparing a select number of parks against each other. Unique factors such as location, park size, and biomes provide opportunities for further analysis and insights.