Georgia Tech’s OMSA: Halfway Point Reflections

Georgia Tech’s OMSA: Halfway Point Reflections

In fall 2021, I started Georgia Tech’s Online Master’s of Science in Analytics (OMSA). Here are some thoughts so far on the courses I’ve taken and overall experience as I head into my sixth class of the program.

Lettie Pate Whitehead Evans Administration Building at Night
Lettie Pate Whitehead Evans Administration Building at Georgia Tech in Atlanta, December 2022.

OMSA Program Background

Georgia Tech’s OMSA program is one of a few well-known online graduate programs in the data community. As data science and analytics become more mainstream and academia leans further into online curriculums, I assume similar program offerings will continue to grow. I heard about Georgia Tech’s Master’s in Analytics initially through my brother who knew about their online cybersecurity program. My main deciding factors were the online format and price tag of about $10,000 USD. I completely support building data skills through free MOOCs (shoutout to Codecademy and O’Reilly, both of which I still find as great resources). However, I figured the structure, schedule, and breadth of what Georgia Tech offered would keep me accountable in my studies. There are 3 different tracks for coursework: Analytical Tools, Business Analytics, and Computational Data Analytics. I am in the Analytical Tools track.

Class Reviews

CSE 6040: Computing for Data Analysis – Fall Semester 2021

This class works primarily in Python to illustrate computing concepts. The homework assignments and tests were auto-graded from JuPyter notebooks and were open book and open internet. The instant feedback was helpful but overall I found the time restrictions for the midterm and final to be challenging. I found creating a comprehensive Python file with code snippets from the whole class helpful for quick searches on the final. There were some optional course items like a project that could be submitted for extra credit. Aside from some difficulties with overthinking, the class was fairly enjoyable for the range of information it covered.

ISYE 6501: Introduction to Analytics Modeling – Fall Semester 2021

ISYE 6501 is essentially a whirlwind tour of analytics modeling concepts through R. Homework assignments are peer-graded and quizzes allow for a restricted number of note sheets (1 or 2 pages). This was a solid refresher in R and I found the content helpful for other classes like Regression Analysis. The course provided a decent background to the theory and troubleshooting involved in real world analytics problems.

MGT 8803: Business Fundamentals for Analytics – Spring Semester 2022

The class covered as many business concepts as possible including marketing, accounting, and supply chain optimization. There were different professors who helped guide the modules so it was interesting to have that mix for course. The level of straight memorization required to succeed with graded assignments was a bit much for me. Unfortunately this was a required course so whether or not the evaluation style was in my lane, I had to deal with it.

ISYE 6414: Regression Analysis – Summer Semester 2022

Regression Analysis gave a breadth of model building and illustrated underlying concepts behind each. The course uses R and material included cleaning and transforming data, variable selection, and linear and logistic regression. The homework assignments were fairly simple and I found having code snippets prepared in one file for the open book portion of exams to be helpful.

CSE 6242: Data and Visual Analytics – Fall Semester 2022

As one of the advanced requirements for the OMSA program, I was a little hesitant on the learning curve this course was rumored to have. The class was a grand tour of dabbling in different languages and programs like Python, SQL, Spark, and D3. A course project is a huge component of this class and it was nice to collaborate with classmates and translate what we learned into something tangible. Overall the homework assignments were only a major bummer because you could have a solution that looked exactly like the answer, but still get zero or minimal points from the auto-grader. There are no explicit homework solutions and I struggled with understanding what exactly needed to be corrected when I did not get full credit. Luckily, there were ample opportunities for extra credit to make up for any missed points from the homework.

ISYE 7406: Data Mining and Statistical Learning – Spring Semester 2023

I’m currently a couple weeks into the course for the Spring 2023 semester but so far the blend of theoretical background for statistics and practical R analysis has been manageable.

Balancing Professional, Academic, and Social Obligations

I attended a meetup in December 2022 at the main Atlanta campus for the Analytics and Cybersecurity programs and reflected on my personal experience after speaking with fellow students and alumni. There seemed to be a decent mix of professional backgrounds and mostly everyone I spoke with also worked full time for the duration of the program.

Personally, I have found that one course a semester has worked best for me. The only semester I doubled up was my first semester for two of the required core classes: ISYE 6501 and CSE 6040. In hindsight I think I would have been better off just taking one course instead of constantly feeling like I was flipping back and forth between the two.

For time spent on homework and studying, I have found that chipping away a little bit everyday has given me the best results thus far. It tends to keep concepts relevant as opposed to taking couple-day breaks between learning material. There have still been times where I have taken breaks for travel or vacation but a lot of the courses allow for some leniency for frontloading assignments and learning with the schedules for module releases.

Overall OMSA Experience

I’ve been pleasantly surprised in the program so far but still have plenty of days where I have the same frustrations that anyone else might with learning new material. The practical translation of concepts from the courses seems to be the main draw and highlight for folks in the OMSA program. I look forward to continuing to build skills in future classes. I am likely taking MGT 6203 (Data Analytics in Business) for summer 2023 and then ISYE 6669 (Deterministic Optimization) for fall 2023.

Species Richness and Distribution in the National Parks with pandas

Here’s how to use Python and pandas to explore species data for the United States National Parks to find the average species richness and the distribution of species categories. This goes over some of the built-in functions in pandas and how to use those for exploratory data analysis. The source data is available via Kaggle, or the National Parks Species website. The associated Github repository for a more in-depth look at the two Jupyter Notebooks for the code below is available here.

An elk sits off a trail at Yellowstone National Park as visitors walk by.
An elk sits near a trail in Yellowstone National Park, May 2017.

Setup

Import the necessary packages, including pandas, matplotlib, seaborn, and math.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import math

Load the species dataset and the parks dataset with pandas. Given the size and that I used Jupyter Notebook for this, I give the low memory argument a value of False for the species data so it loads without too much trouble. The dataframes can be given any name.

species_data = pd.read_csv('species.csv', low_memory=False)
parks_data = pd.read_csv('parks.csv')

Preview the datasets with the head() and info() pandas functions. The columns of focus will be park name and category for the species dataframe, and park name and acres for the parks dataframe.

species_data.head()
A pandas dataframe shows columns and rows for species data.
species_data.info()
A preview of the species dataframe info shows all the columns, their counts, and data types.
parks_data.head()
A pandas dataframe shows columns and rows for parks data.

Species Richness

Species richness is the quantification of species in a given area and can be a helpful metric for biodiversity.

For this, we’ll want to merge the species data and the parks data. The species data will need to be re-formatted beforehand.

Out of the box, the species data has one record per type of species per row. This will need to be turned into individual counts per column to get the number of species in each park. Use the groupby() function in pandas specifying the park name column, and then use count().

all_species_data = species_data.groupby(['Park Name']).count()
A pandas dataframe shows columns per each national park and their respective counts.

Next, I narrow down to just the Species ID column and change the name of it to reflect just ‘Species’.

species_counts = all_species_data[['Species ID']].copy()
species_counts = species_counts.rename(columns={'Species ID' : 'Species'}

The parks data will need the names set as the index as well. Then, the two dataframes will be in good shape to merge. Specify the two dataframes, and give the left_index and right_index arguments values of true because they are the same between each.

parks_data = parks_data.set_index('Park Name')
richness_data = species_counts.merge(parks_data, left_index=True, right_index=True)

Preview the newly merged dataframe to confirm it looks correct.

richness_data.head()
A pandas dataframe shows merged species and park data columns.

To take this one step further, estimate the number of species per acre using the species counts. Create a function to take the species count column and acres column and divide them, and normalize the result to an integer with math’s floor() function.

def species_abundance(df):
   return df.apply(
       lambda row:
         math.floor(
           row['Acres'] / row['Species']),
       axis=1
   )

Create a new column in the dataframe and apply the species abundance function.

richness_data['Species Abundance'] = species_abundance(richness_data)
richness_data.head()
A dataframe show the number of species and abundance of species per acre in individual national parks.

Calculate the mean of species per acre in all the parks. There are a lot of variables to consider that might affect this number, but it serves its purpose as a quick summary statistic to give us about 525 species per acre.

print(richness_data['Species Abundance'].mean())

Species Type Distribution

Species distribution in this setting will be the ratio of species types throughout each park. The different species categories in this data set are: ‘Mammal’, ‘Bird’, ‘Reptile’, ‘Amphibian’, ‘Fish’, ‘Vascular Plant’, ‘Spider/Scorpion’, ‘Insect’, ‘Invertebrate’, ‘Fungi’, ‘Nonvascular Plant’, ‘Crab/Lobster/Shrimp’, ‘Slug/Snail’, ‘Algae’.

To extract just the park names and categories, create a new dataframe with just these columns.

types = species_data[['Park Name', 'Category']].copy()
A pandas dataframe shows one row per species record of a given park and category.

Use pandas’ groupby() function to group by park name and category. Then, use the size function to specify the number of rows for the unstack function, which will create a column for each of the unique row values. I give the argument of fill_value for the unstack function a value of 0, to keep anything with NaN values consistent for math operations.

df = types.groupby(['Park Name','Category']).size().unstack(fill_value=0)
df.head()
A pandas dataframe shows the number of species in each category per park.

Next, take the counts from each and put all on the same scale of out of 100. Use pandas’ div() function to divide based on the sum of the dataframe per each row. Then, multiply each value by 100 for better readability and translation for visuals.

ratios = df.div(df.sum(axis=1), axis=0).multiply(100)
ratios.head()
A pandas dataframe shows raw values for each park's species category ratios.

To clean the dataframe values further, round each variable to 2 decimal places using Python’s round() function.

rounded = ratios.round(2)
rounded.head()
A pandas dataframe shows rounded values for ratios in each park's species category.

Create a box plot using matplotlib and seaborn to show the breakdown of species categories across all parks.

f, ax = plt.subplots(figsize=(12, 8))
sb.boxplot(data=rounded, orient='h')
ax.xaxis.grid(True)
ax.set(ylabel="")
ax.set(xlim=(0,100))
sb.despine(trim=True, left=True)
plt.title('Species Category Distribution in the National Parks', fontsize=16)
plt.show()
A boxplot shows the species category types and their distributions.

Optionally, export the species categories dataframe as a CSV for further use.

Park NameAlgaeAmphibianBirdCrab/Lobster/ShrimpFishFungiInsectInvertebrateMammalNonvascular PlantReptileSlug/SnailSpider/ScorpionVascular Plant
Acadia National Park0.00.8821.30.02.220.00.00.03.220.00.640.00.071.74
Arches National Park0.00.7619.560.01.050.00.00.05.630.01.910.00.071.09
Badlands National Park0.00.7217.210.01.7312.4617.210.074.610.00.940.00.0745.0
Big Bend National Park0.00.5718.290.02.340.00.00.03.922.122.730.00.070.03
Biscayne National Park0.00.4613.50.047.390.00.641.971.620.02.320.00.032.1
Black Canyon of the Gunnison National Park0.00.1815.820.01.450.00.00.06.060.00.990.00.075.5
Bryce Canyon National Park0.00.3116.870.00.080.00.00.05.910.01.010.00.075.82
Canyonlands National Park0.00.5717.990.02.70.00.00.06.210.01.80.00.070.73
Capitol Reef National Park0.00.3815.840.00.960.00.00.04.660.01.340.00.076.82
Carlsbad Caverns National Park0.00.9823.890.00.330.00.00.05.990.04.040.00.064.78
Channel Islands National Park3.240.2118.940.5814.480.00.1110.42.333.610.581.750.0543.71
Congaree National Park3.191.858.620.262.812.0226.580.651.680.32.150.90.938.09
Crater Lake National Park5.80.536.71.060.355.1126.441.812.555.130.530.240.4543.3
Cuyahoga Valley National Park0.01.2412.670.414.380.011.71.292.420.01.180.770.163.83
Death Valley National Park1.351.611.960.860.22.7720.340.414.780.52.031.620.7250.87
Denali National Park and Preserve0.00.0813.560.01.062.23.860.233.2612.050.00.00.063.71
Dry Tortugas National Park0.00.033.370.033.140.00.04.950.710.00.590.00.027.24
Everglades National Park0.00.8217.750.020.440.00.00.02.020.02.930.00.056.05
Gates Of The Arctic National Park and Preserve0.00.079.90.01.2635.030.00.02.880.00.00.00.050.85
Glacier Bay National Park and Preserve3.270.2613.184.9118.340.11.795.982.965.010.151.890.1542.0
Glacier National Park0.080.2310.840.231.0610.87.710.082.715.810.160.780.049.53
Grand Canyon National Park0.00.5717.390.01.110.02.170.044.040.02.90.085.4266.29
Grand Teton National Park0.050.3413.10.051.031.387.590.253.650.00.250.20.0572.07
Great Basin National Park0.00.8312.440.230.790.5717.191.093.880.62.191.130.6858.39
Great Sand Dunes National Park and Preserve0.00.6325.210.00.630.00.110.07.140.110.840.00.065.34
Great Smoky Mountains National Park0.00.924.110.151.629.5436.451.271.427.970.771.391.5732.83
Guadalupe Mountains National Park0.00.6915.580.290.173.786.590.44.350.293.213.320.1161.23
Haleakala National Park0.00.121.710.70.232.5641.121.430.588.80.391.821.7838.76
Hawaii Volcanoes National Park0.00.122.371.70.120.2143.335.250.454.090.392.153.0636.75
Hot Springs National Park1.231.3819.850.464.620.00.771.132.670.922.670.10.064.21
Isle Royale National Park0.00.9318.681.04.510.00.01.51.860.070.360.00.071.08
Joshua Tree National Park0.220.2213.120.00.041.6614.910.442.920.262.270.06.6757.28
Katmai National Park and Preserve0.00.0818.120.333.598.570.572.374.411.630.00.570.059.76
Kenai Fjords National Park0.00.0922.230.03.970.190.283.975.20.00.00.380.063.67
Kobuk Valley National Park0.00.112.20.02.5432.490.00.03.712.540.00.00.046.44
Lake Clark National Park and Preserve0.00.059.620.052.7410.960.150.32.4914.00.00.00.059.64
Lassen Volcanic National Park0.110.9513.630.831.112.845.291.05.568.91.220.330.058.21
Mammoth Cave National Park0.01.328.40.044.880.010.82.162.240.041.680.00.0868.35
Mesa Verde National Park0.00.6419.070.00.00.00.00.06.920.01.690.00.071.68
Mount Rainier National Park0.00.9210.730.01.321.782.70.03.9620.480.290.00.057.83
North Cascades National Park0.00.366.720.00.9816.0316.620.02.3511.450.30.01.2543.95
Olympic National Park0.00.8215.910.04.980.04.470.04.110.00.310.00.069.4
Petrified Forest National Park0.00.9428.60.00.00.120.00.07.270.122.460.00.060.49
Pinnacles National Park0.00.7112.010.560.421.9122.812.194.240.02.050.780.7151.62
Redwood National Park1.760.527.941.933.9121.611.795.292.443.980.622.330.1135.77
Rocky Mountain National Park4.760.168.791.240.389.7121.451.522.3513.20.10.320.735.34
Saguaro National Park0.00.5513.410.050.00.110.00.05.560.03.380.00.076.94
Sequoia and Kings Canyon National Parks0.00.6511.030.00.950.00.00.04.460.01.20.00.081.7
Shenandoah National Park0.00.865.760.060.8816.436.830.041.357.520.820.040.0959.31
Theodore Roosevelt National Park0.00.6919.140.092.750.096.270.265.675.241.120.170.1758.37
Voyageurs National Park0.01.0316.380.03.990.212.270.484.340.760.410.00.070.13
Wind Cave National Park0.00.516.850.00.573.087.530.06.380.00.861.790.062.44
Wrangell - St Elias National Park and Preserve0.00.1111.750.05.180.03.230.223.450.390.060.060.075.56
Yellowstone National Park5.140.238.321.590.480.2841.051.971.970.380.231.491.0835.8
Yosemite National Park0.00.7212.930.00.480.00.00.04.210.01.050.00.080.6
Zion National Park0.00.3916.760.00.840.00.00.04.450.01.670.00.075.89

Summary

Further exploration of the species category and count outputs might involve comparing a select number of parks against each other. Unique factors such as location, park size, and biomes provide opportunities for further analysis and insights.

Analysis of USDA Plant Families Data Using Pandas

The United States Department of Agriculture PLANTS database provides general information about plant species across the country. Given 3 states, I wanted to visualize which plant families are present in each and which state(s) hold the most species in each family. To accomplish this task, I used Python’s pandas, matplotlib, and seaborn libraries for analysis.

Different groups of plants, including wildflowers, are arranged in a rock bed in the desert.
Various plant species in Death Valley National Park, California.

Initial Setup

Before beginning, I import pandas, matplotlib, and seaborn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

Gathering Data

I pulled data sets from the USDA website for New York, Idaho, and California. The default encoding is in Latin-1 for exported text files. When importing into pandas, the encoding must be specified to work properly.

ny_list = pd.read_csv('ny_list.txt', encoding='latin-1')
ca_list = pd.read_csv('ca_list.txt', encoding='latin-1')
id_list = pd.read_csv('id_list.txt', encoding='latin-1')

I double check the files have been loaded correctly into dataframe format using head().

A preview of a table shows 5 entries of various plants that are present in New York state, including their USDA symbol, synonym symbol, scientific name, common name, and plant family.
Initial import of New York state plants list categorized by USDA symbol, synonym symbol, scientific name with author, national common name, and family.

Cleaning Data

The major point of interest in the imported dataframe is the ‘Family’ column. I create a new dataframe organized by this column and returning the count from each row.

ny_fam = ny_list.groupby('Family').count()
A table shows plants grouped by taxonomic family for New York state.
Initial dataframe for New York plant data grouped by taxonomic family.

Next, I remove the unwanted columns. I’ve chosen only to keep the ‘Symbol’ column as a representation of count because this variable is required for every plant instance.

ny_fam_1 = ny_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)

Then, I change the column name from ‘Symbol’ to ‘{State} Count’ to lend itself for merging the dataframes without confusion.

ny_fam_1 = ny_fam_1.rename(columns = {"Symbol":"NY Count"})
Two tables show a before and after image of a table where the column named 'Symbol' changes to 'NY Count.'
Column before (left) and after (right) a name change.

I complete the same process for the California and Idaho data.

ca_fam = ca_list.groupby('Family').count()
ca_fam_1 = ca_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
ca_fam_1 = ca_fam_1.rename(columns = {"Symbol":"CA Count"})
id_fam = id_list.groupby('Family').count()
id_fam_1 = id_fam.drop(['Synonym Symbol', 'Scientific Name with Author', 'National Common Name'], axis=1)
id_fam_1 = id_fam_1.rename(columns = {"Symbol":"ID Count"})

Reset the index to prepare the data frames for outer merges based on column names. The index is set to ‘Family’ as default, from the initial data frame creation using the count() function. To discourage any unwanted changes, I create a copy of each data frame as I go.

ny_fam_2 = ny_fam_1.copy()
ny_fam_2 = ny_fam_2.reset_index()
ca_fam_2 = ca_fam_1.copy()
ca_fam_2 = ca_fam_2.reset_index()
id_fam_2 = id_fam_1.copy()
id_fam_2 = id_fam_2.reset_index()
Two tables show a before and after image of New York state plant family data. The first table shows plant families as the index and the second table shows plant families as a column, with a new numerical index.
New York dataframe before (left) and after (right) the index was reset to make the plant families a column.

Merging Data

To preserve all the plant species regardless of presence in each individual state, I perform outer merges. This will allow for areas without data to be filled with zeros after the family counts are combined.

combo1 = pd.merge(ny_fam_2, ca_fam_2, how='outer')
combo2 = pd.merge(combo1, id_fam_2, how='outer')
A table shows the combined plant family data for New York, California, and Idaho.
Plant family table with counts from each state, before formatting.
pd.options.display.float_format = '{:,.0f}'.format
combo2 = combo2.fillna(0)
A table shows combined data for plant families in New York, California, and Idaho where the numbers are formatted to show zero decimal places and any instances of no data are replaced by zeroes.
Plant family table with counts from each state, formatted to drop decimals and replace NaNs with zeros.

Creating a New Column

I added a column to aid in visualizations. I created a function to return the state with the highest presence of each plant family based on the existing columns.

def presence(row):
    if row['NY Count'] > row['CA Count'] and row['NY Count'] > row['ID Count']:
        return 'NY'
    elif row['CA Count'] > row['NY Count'] and row['CA Count'] > row['ID Count']:
        return 'CA'
    elif row['ID Count'] > row['NY Count'] and row['ID Count'] > row['CA Count']:
        return 'ID'
    elif row['NY Count'] == row['CA Count'] and row['NY Count'] > row['ID Count']:
        return 'CA/NY'
    elif row['CA Count'] == row['ID Count'] and row['CA Count'] > row['NY Count']:
        return 'CA/ID'
    elif row['ID Count'] == row['NY Count'] and row['ID Count'] > row['CA Count']:
        return 'ID/NY'
    else:
        return 'Same'
    
combo2['Highest Presence'] = combo2.apply(presence, axis=1)
A table of plant families from New York, California, and Idaho shows counts for each family and a new column that names which state has the highest presence of each plant family.
Table with added column to indicate highest presence of species count within each plant family.

Below is the full table of all plant families in the dataframe.

FamilyNY CountCA CountID CountHighest Presence
0Acanthaceae770CA/NY
1Acarosporaceae1118ID
2Aceraceae462924NY
3Acoraceae524NY
4Actinidiaceae300NY
5Adoxaceae200NY
6Agavaceae4480CA
7Aizoaceae5420CA
8Alismataceae645333NY
9Amaranthaceae788238CA
10Amblystegiaceae6400NY
11Anacardiaceae464722CA
12Andreaeaceae300NY
13Annonaceae300NY
14Anomodontaceae800NY
15Apiaceae190372257CA
16Apocynaceae486042CA
17Aquifoliaceae2130NY
18Araceae26183NY
19Araliaceae2668NY
20Aristolochiaceae2893NY
21Asclepiadaceae506720CA
22Aspleniaceae2376NY
23Asteraceae2,0573,8582,260CA
24Aulacomniaceae700NY
25Azollaceae440CA/NY
26Bacidiaceae110CA/NY
27Balsaminaceae10611ID
28Bartramiaceae1100NY
29Berberidaceae215925CA
30Betulaceae864353NY
31Bignoniaceae9110CA
32Blechnaceae5117CA
33Boraginaceae127478263CA
34Brachytheciaceae5900NY
35Brassicaceae4681,123774CA
36Bruchiaceae400NY
37Bryaceae2220NY
38Buddlejaceae450CA
39Butomaceae202ID/NY
40Buxaceae500NY
41Buxbaumiaceae500NY
42Cabombaceae663CA/NY
43Cactaceae1023878CA
44Callitrichaceae161811CA
45Calycanthaceae720NY
46Campanulaceae7614657CA
47Cannabaceae13810NY
48Cannaceae400NY
49Capparaceae184926CA
50Caprifoliaceae18210781NY
51Caryophyllaceae338506414CA
52Celastraceae24184NY
53Ceratophyllaceae855NY
54Cercidiphyllaceae200NY
55Chenopodiaceae245404245CA
56Cistaceae44270NY
57Cladoniaceae522NY
58Clethraceae400NY
59Climaciaceae500NY
60Clusiaceae521512NY
61Commelinaceae37140NY
62Convolvulaceae6813017CA
63Cornaceae553436NY
64Crassulaceae6223057CA
65Cucurbitaceae41496CA
66Cupressaceae3913030CA
67Cuscutaceae315631CA
68Cyperaceae1,016733663NY
69Dennstaedtiaceae855NY
70Diapensiaceae800NY
71Dicranaceae2000NY
72Dioscoreaceae600NY
73Dipsacaceae17108NY
74Ditrichaceae920NY
75Droseraceae8116CA
76Dryopteridaceae1216571NY
77Ebenaceae770CA/NY
78Elaeagnaceae161012NY
79Elatinaceae6144CA
80Empetraceae1360NY
81Entodontaceae800NY
82Ephemeraceae500NY
83Equisetaceae483651ID
84Ericaceae236310110CA
85Eriocaulaceae730NY
86Euphorbiaceae11123366CA
87Fabaceae6041,855871CA
88Fagaceae96910NY
89Fissidentaceae2211NY
90Flacourtiaceae200NY
91Fontinalaceae900NY
92Fumariaceae312820NY
93Funariaceae2100NY
94Gentianaceae72162122CA
95Geraniaceae348025CA
96Ginkgoaceae200NY
97Grimmiaceae233CA/ID
98Grossulariaceae4615072CA
99Haemodoraceae700NY
100Haloragaceae372820NY
101Hamamelidaceae820NY
102Hippocastanaceae1420NY
103Hippuridaceae222Same
104Hydrangeaceae245314CA
105Hydrocharitaceae454430NY
106Hydrophyllaceae1533689CA
107Hylocomiaceae800NY
108Hymeneliaceae114ID
109Hymenophyllaceae200NY
110Hypnaceae2000NY
111Iridaceae4611426CA
112Isoetaceae393028NY
113Juglandaceae52102NY
114Juncaceae162177143CA
115Juncaginaceae92213CA
116Lamiaceae399413182CA
117Lardizabalaceae200NY
118Lauraceae12110NY
119Lecanoraceae311NY
120Lemnaceae274517CA
121Lentibulariaceae372317NY
122Leucobryaceae200NY
123Liliaceae263741243CA
124Limnanthaceae3363CA
125Linaceae365220CA
126Lycopodiaceae114947NY
127Lygodiaceae200NY
128Lythraceae252616CA
129Magnoliaceae2000NY
130Malvaceae6628462CA
131Marsileaceae21512CA
132Melastomataceae1200NY
133Meliaceae330CA/NY
134Menispermaceae200NY
135Menyanthaceae1073NY
136Mniaceae1000NY
137Molluginaceae493CA
138Monotropaceae193121CA
139Moraceae21165NY
140Myricaceae2250NY
141Najadaceae252110NY
142Nelumbonaceae950NY
143Nyctaginaceae301578CA
144Nymphaeaceae502330NY
145Oleaceae39490CA
146Onagraceae237661314CA
147Ophioglossaceae605348NY
148Orchidaceae285175173NY
149Orobanchaceae227451CA
150Orthotrichaceae1100NY
151Osmundaceae1000NY
152Oxalidaceae705847NY
153Paeoniaceae242CA
154Papaveraceae389826CA
155Parmeliaceae111Same
156Pedaliaceae11258CA
157Phytolaccaceae470CA
158Pinaceae5211365CA
159Plantaginaceae647833CA
160Platanaceae980NY
161Plumbaginaceae16220CA
162Poaceae1,9272,3471,507CA
163Podostemaceae300NY
164Polemoniaceae35637279CA
165Polygalaceae29170NY
166Polygonaceae331917435CA
167Polypodiaceae9149CA
168Polytrichaceae2300NY
169Pontederiaceae17176CA/NY
170Portulacaceae24203120CA
171Potamogetonaceae11583103NY
172Pottiaceae3421NY
173Primulaceae67104102CA
174Pteridaceae1611139CA
175Pyrolaceae546567ID
176Ranunculaceae288434412CA
177Resedaceae780CA
178Rhamnaceae1820220CA
179Rosaceae1,305803609NY
180Rubiaceae15723072CA
181Ruppiaceae11177CA
182Rutaceae18120NY
183Salicaceae284352383ID
184Salviniaceae530NY
185Santalaceae959ID/NY
186Sapindaceae5373CA
187Sarraceniaceae11160CA
188Saururaceae230CA
189Saxifragaceae55219234ID
190Scheuchzeriaceae555Same
191Schistostegaceae202ID/NY
192Schizaeaceae200NY
193Scrophulariaceae4301,146556CA
194Selaginellaceae61514CA
195Sematophyllaceae600NY
196Simaroubaceae474CA
197Smilacaceae2230NY
198Solanaceae11824463CA
199Sparganiaceae242224ID/NY
200Sphagnaceae4231NY
201Staphyleaceae220CA/NY
202Sterculiaceae2300CA
203Styracaceae790CA
204Symplocaceae500NY
205Taxaceae1042NY
206Teloschistaceae111Same
207Tetraphidaceae200NY
208Theliaceae300NY
209Thelypteridaceae26913NY
210Thuidiaceae300NY
211Thymelaeaceae620NY
212Tiliaceae2200NY
213Trapaceae500NY
214Tropaeolaceae220CA/NY
215Typhaceae6105CA
216Ulmaceae231910NY
217Urticaceae526132CA
218Valerianaceae135927CA
219Verbenaceae531267CA
220Violaceae17612079NY
221Viscaceae12446CA
222Vitaceae55184NY
223Vittariaceae200NY
224Xyridaceae1500NY
225Zannichelliaceae555Same
226Zosteraceae580CA
227Zygophyllaceae5315CA
228Aloaceae020CA
229Aponogetonaceae020CA
230Arecaceae0150CA
231Basellaceae040CA
232Bataceae020CA
233Burseraceae020CA
234Caulerpaceae020CA
235Crossosomataceae0189CA
236Cyatheaceae030CA
237Cymodoceaceae050CA
238Datiscaceae020CA
239Elaeocarpaceae040CA
240Ephedraceae0160CA
241Fouquieriaceae030CA
242Frankeniaceae060CA
243Garryaceae0110CA
244Gracilariaceae020CA
245Gunneraceae030CA
246Halymeniaceae020CA
247Krameriaceae080CA
248Lennoaceae060CA
249Loasaceae08429CA
250Melianthaceae020CA
251Myoporaceae020CA
252Myrtaceae0320CA
253Parkeriaceae040CA
254Passifloraceae060CA
255Pittosporaceae080CA
256Punicaceae020CA
257Rafflesiaceae020CA
258Scouleriaceae033CA/ID
259Simmondsiaceae040CA
260Stereocaulaceae020CA
261Tamaricaceae0123CA
262Ulvaceae030CA
263Verrucariaceae002ID

Visualizing the Data

I created a count plot using seaborn to show which states, or state combinations, have the highest variety within each plant family.

base_color = sb.color_palette()[2]
sb.countplot(data=combo4, x='Highest Presence', color="#B6D1BE", order=combo4['Highest Presence'].value_counts().index)
n_points = combo4.shape[0]
cat_counts = combo4['Highest Presence'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count+5, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Highest Concentration of Plant Families by State', fontsize=14, y=1.05)
plt.ylim(0, 140)
sb.despine();
A vertical bar graph show the highest concentration of plant families organized by state, where the concentrations are high to low as follows: California, New York, California/New York, Idaho, all the states tie, Idaho/New York, and California/Idaho.
Shows the count of plant families with the highest concentration in each state.

Further Considerations

There are many factors that play into plant family diversity. The comparison of plant families in New York, California, and Idaho was purely out of curiosity. Further investigations should take into account each state’s ecosystem types and land usage and ownership that may influence species diversity.

Statistics Principles for Data Analysis

Statistics Principles for Data Analysis
Four traditional dice on a game board

Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.

  • Central Tendency
    • Mean = the average of a distribution
    • Median = a distribution’s midpoint
    • Mode = the variable which occurs most often in a distribution
  • Variability
    • The distribution of data, also known as spread
    • Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
      • Represented through boxplots graphically
    • Summarized through quartiles:
      • Q1: median of all values to left of Q2
      • Q2: median (50th percentile) of all values in distribution
      • Q3: median of all values to right of Q2
    • Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
    • Standard deviation: square root of the variance (s^2)
  • Normal Distribution: bell-curve distribution of data
  • Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
    • Compare claims
    • Null hypothesis: statement challenged in significance testing
      • Example: There is not a difference between means.
    • Alternative hypothesis: statement suspected as true instead of the null hypothesis
      • Example: The means are not the same.
    • Accept or reject null hypothesis based on a certain p-value.
    • p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
    • Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
  • Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
  • Probability: The proportion of times an outcome would occur given many repeated tests.
  • Correlation
    • A measure of the linear relationship between two quantitative variables, based on direction and strength.
    • Examples: strong, weak, or no correlation; positive or negative
    • Represented by r
    • r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
  • Regression
    • Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
    • Multiple linear: statistical model with more than one explanatory variable
  • Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
  • Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
  • Analysis of Variance (ANOVAs)
    • One-way: Compare population means based on 1 independent variable
    • Two-way: Compare population means classified based on 2 independent variables

Source:

Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.

See here for an updated version of textbook

Data Analysis and UFO Reports

Data Analysis and UFO Reports

Data analysis and unidentified flying object (UFO) reports go hand-in-hand. I attended a talk by author Cheryl Costa who analyzes records of UFO sightings and explores their patterns. Cheryl and her wife Linda Miller Costa co-authored a book that compiles UFO reports called UFO Sightings Desk Reference: United States of America 2001-2015.

Records of UFO sightings are considered citizen science because people voluntarily report their experiences. This is similar to wildlife sightings recorded on websites like eBird that help illustrate bird distributions across the world. People report information about UFO sighting events including date, time, and location.

A dark night sky with the moon barely visible and trees below.
Night sky along the roadside outside Wayquecha Biological Field Station in Peru, taken April 2015.

Cheryl spoke about gathering data from two main online databases, MUFON (Mutual UFO Network) and NUFORC (National UFO Reporting Network). NUFORC’s database is public and reports can be sorted by date, UFO shape, and state. MUFON’s database requires a paid membership to access the majority of their data. This talk was not a session to discuss conspiracy theories, but a chance to look at trends in citizen science reports.

The use of data analysis on UFO reports requires careful consideration of potential bias and reasonable explanations for numbers in question. For example, a high volume of reports in the summer could be because more people are spending time outside and would be more likely to notice something strange in the sky.

This talk showed me that conclusions may be temptingly easy to draw when looking at UFO data as a whole, but speculations should be met with careful criticism. The use of the scientific method when approaching ufology, or the study of UFO sightings, seems key for a field often met with overwhelming skepticism.

I have yet to work with any open-source data on UFO reports, but this talk reminded me of the importance of a methodical approach to data analysis. Data visualization for any field of study starts with asking questions, being mindful of outside factors, and being able to communicate messages within large data sets to any audience.