How to Mindfully Summarize Data Insights

How to Mindfully Summarize Data Insights

It can be difficult to transparently present key insights in a world saturated with fake news and click-bait. Study results are often distilled into share-worthy titles that get read and taken as gospel without anyone actually reading further, so be conscious of how you choose to share data findings. Here are a few ways to stay humble while summarizing your data, based on a headline from the Girlgaze newsletter about an article from Gay Times.

Shot of newsletter article summary from Girlgaze.
Screenshot of Girlgaze newsletter from June 29, 2019 with added underlining and highlighting.
  1. Avoid sweeping generalizations that encourage assumptions about entire populations. This headline implies the sexual identity of all young people, instead of a small portion of youth in existence. A quick change would be to add ‘Surveyed’ after ‘Youth’ to reinforce that this was not a large sample size (‘over 2,000 adults‘).
  2. Name the party or parties responsible for sponsoring the collection of data. This communication successfully names the commissioner of the study early in the description. This allows the audience to have full transparency about involved parties who may influence a study’s outcomes.
  3. Avoid the use of definitive language. The last sentence in this quick summary is too precise in saying ‘clear indication‘ and should say what the ‘much more fluid perspective‘ is in comparison to. Studies use a sample from a population to provide meaningful insights. Samples are not meant to determine a definitive stance for every member of a population. Surveys take into account an array of potential variables, and language such as ‘clear indication‘ would imply researchers have explored every possible avenue of bias.

Other ways to communicate transparently with an audience include sharing links to raw data, naming potential sources of error, and making suggestions for future method improvements. Providing an audience with every opportunity to explore your data and understand methods empowers people to consume insights responsibly.

Data Exploration with Student Test Scores

Data Exploration with Student Test Scores

I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.

Preliminary Data Cleaning

For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
labels = ['gender', 'race/ethnicity', 'par_level_educ', 'lunch', 'test_prep', 'math', 'reading', 'writing']
tests = pd.read_csv('StudentsPerformance.csv', header=0, names=labels)
tests.info()
Output of info() for student test scores.
tests.head(10)
Output of head() for student test scores.

Univariate Analysis

Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.

plt.figure(figsize=[10,4])
plt.subplot(1, 3, 1)
plt.hist(data=tests, x='math', bins=20)
plt.title('Math')
plt.xlim(0,100)
plt.ylim(0,160)
plt.ylabel('Number of Students', fontsize=12)
plt.subplot(1, 3, 2)
plt.hist(data=tests, x='reading', bins=20)
plt.title('Reading')
plt.xlim(0,100)
plt.ylim(0,160)
plt.subplot(1, 3, 3)
plt.hist(data=tests, x='writing', bins=20)
plt.title('Writing')
plt.xlim(0,100)
plt.ylim(0,160)
plt.suptitle('Test Scores', fontsize=16, y=1.0);
Histograms showing the spread of student test scores across all topics.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
base_color = sb.color_palette()[9]
sb.countplot(data=tests, x='par_level_educ', color=base_color, order=ed_order)
n_points = tests.shape[0]
cat_counts = tests['par_level_educ'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count-35, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Parental Education Level of Student Test Takers');
Bivariate count plots of student test scores across parental levels of education.

Bivariate Analysis

Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.

base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
Violin plots that show average student test scores base on level of test preparation.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Box plots that show reading scores across varying levels of parental education.

Multivariate Analysis

A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.

ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
A swarm plot that shows student test scores, test preparation level, and the highest levels of parental education.
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);
Multivariate box plots showing test scores in Math, Reading, and Writing based on student gender and test preparation level.

Statistics Principles for Data Analysis

Statistics Principles for Data Analysis
Four traditional dice on a game board

Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.

  • Central Tendency
    • Mean = the average of a distribution
    • Median = a distribution’s midpoint
    • Mode = the variable which occurs most often in a distribution
  • Variability
    • The distribution of data, also known as spread
    • Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
      • Represented through boxplots graphically
    • Summarized through quartiles:
      • Q1: median of all values to left of Q2
      • Q2: median (50th percentile) of all values in distribution
      • Q3: median of all values to right of Q2
    • Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
    • Standard deviation: square root of the variance (s^2)
  • Normal Distribution: bell-curve distribution of data
  • Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
    • Compare claims
    • Null hypothesis: statement challenged in significance testing
      • Example: There is not a difference between means.
    • Alternative hypothesis: statement suspected as true instead of the null hypothesis
      • Example: The means are not the same.
    • Accept or reject null hypothesis based on a certain p-value.
    • p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
    • Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
  • Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
  • Probability: The proportion of times an outcome would occur given many repeated tests.
  • Correlation
    • A measure of the linear relationship between two quantitative variables, based on direction and strength.
    • Examples: strong, weak, or no correlation; positive or negative
    • Represented by r
    • r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
  • Regression
    • Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
    • Multiple linear: statistical model with more than one explanatory variable
  • Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
  • Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
  • Analysis of Variance (ANOVAs)
    • One-way: Compare population means based on 1 independent variable
    • Two-way: Compare population means classified based on 2 independent variables

Source:

Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.

See here for an updated version of textbook

Data Visualizations for Spending Habits

Data Visualizations for Spending Habits

I am not a huge fan of bank-generated visuals to analyze my spending habits. My bank breaks up expenses into murky categories such as bills and utilities, shopping, other, and un-categorized. As a result, I began tracking all my expenses in a spreadsheet to better capture data. This includes month, vendor, amount, and expense category for every purchase. To explore this data, I used a heat map and a waffle chart.

Here’s a look at how to use Python to create heat maps and waffle charts for spending habit data.

Setup

Libraries and packages I utilized for my spending data include pandas, NumPy, matplotlib, seaborn and pywaffle. I upload my data from a CSV and turn it into a dataframe using pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from pywaffle import Waffle
%matplotlib inline

df = pd.read_csv('2018_expenses.csv')

Heat Map of Number of Purchases

Heat maps display numerical trends using a sequential scale of color intensity. Below, I graph number of purchases across spending categories and organize the information by month.

I customize the graph to have annotations on each block and color the area using the seaborn color palette ‘rocket’ set to be reversed in shading by adding ‘_r’. An additional argument (‘annot_kws’) adjusts the text size of the annotations. Other features clean up the axes labels and name the tick labels accordingly.

ct_counts=df.groupby(['category','month']).size()
ct_counts=ct_counts.reset_index(name='count')
ct_counts=ct_counts.pivot(index='month', columns='category', values='count')
plt.figure(figsize=(12, 6))
ax=sb.heatmap(ct_counts, annot=True, cmap='rocket_r', annot_kws={"size":14})
plt.title('Number of Monthly Expenses', fontsize=16)
plt.xticks(rotation=0)
plt.xlabel('')
plt.ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
ax.set_xticklabels(['Business', 'Education', 'Entertainment', 'Food', 'Health', 'Other', 'Transportation'])
ax.set_yticklabels(['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], rotation=0);

The heat map shows my most frequent purchases are on food, entertainment, other, and transportation. Categories such as business and health have the lowest frequency of purchases. Across months, purchases are fairly similar with only between 1 and 9 purchases per spending category.

Waffle Chart of Total Spending

Waffle charts display values in congregated squares and are an alternative to pie charts. Formatting options allow for a varied number of columns and rows. I stuck with ten columns with ten rows, with each square to represent one percent out of a total one hundred. This allows for a simple breakdown of where my money went. I created a dictionary of values for this example, rather than using my pandas dataframe.

 data={'Education':63, 'Entertainment':9, 'Other':8, 'Health':7, 'Food':6, 'Transportation':4, 'Business':3}

fig=plt.figure(
FigureClass=Waffle, 
rows=10, 
values=data, 
colors=('indianred', 'lightsalmon', 'peachpuff', 'darkkhaki', 
'cadetblue','paleturquoise', 'lightsteelblue'), 
title={'label': 'Total Spending by Category', 'loc': 'center'}, 
labels=["{0} ({1}%)".format(k, v) for k, v in data.items()],
legend={'loc': 'upper left', 'bbox_to_anchor': (1.1, 1)}
)
fig.set_tight_layout(False)
plt.show();

The waffle chart shows most of my spending was on education and the other categories were all under 10% of total expenses.

Cleaning Data with Pandas

A project for my Udacity Data Analyst Nanodegree Program involved wrangling messy data using pandas. Although my coursework reviewed data cleaning methods, I revisited documentation for specific functions. Here’s a breakdown of the steps I used with pandas to clean the data and complete the assignment.

The examples from my assignment involve a collection of WeRateDogs™ data retrieved from Twitter.

Import Libraries:

Import pandasNumPy, and Python’s regular expression operations library (re).

import pandas as pd
import numpy as np
import re

Import Files:

Use read_csv to load the files you wish to clean.

twt_arc = pd.read_csv('twitter_archive.csv')
img_pred = pd.read_csv('image_predictions.csv')
twt_counts = pd.read_csv('tweet_counts.csv')

Create Copies:

Create copies of the original files using copy before cleaning just in case you need to restore some of the original contents.

twt_arc_clean = twt_arc.copy()
img_pred_clean = img_pred.copy()
twt_counts_clean = twt_counts.copy()

Merge Data:

Combine specific files using the merge function.

In this example, the main data is in the Twitter archive file. I perform a left merge to maintain the original contents of this file and add the image prediction and tweet count files as the original tweet IDs aligned.

df1 = pd.merge(twt_arc_clean, img_pred_clean, how='left')
df2 = pd.merge(df1, twt_counts, how='left')

Drop Columns:

Remove unwanted columns using the drop function. List the columns to remove and specify the axis as ‘columns’.

The Twitter data includes mostly individual tweets, but some of the data is repeated in the form of retweets.

First, I make sure the data only includes tweets where the ‘retweeted_status_id’ was null using the isnull function. Then, I drop the columns related to retweets.

df2_clean = df2_clean[df2_clean['retweeted_status_id'].isnull()]

df2_clean = df2_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 
                          'retweeted_status_id','retweeted_status_user_id',                                                        
                          'retweeted_status_timestamp'], axis='columns')

Change Data Types:

Use astype by listing the preferred data type as the argument.

The Tweet IDs were uploaded as integers, so I convert them to objects.

df2_clean.tweet_id = df2_clean.tweet_id.astype(object)

Use to_datetime to convert a column to datetime by entering the selected column as the argument.

Time stamps were objects instead of datetime objects. I create a new column called ‘time’ and delete the old ‘timestamp’ column.

df2_clean['time'] = pd.to_datetime(df2_clean['timestamp'])

df2_clean = df2_clean.drop('timestamp', 1)

Replace Text:

Use the replace function and list the old value to replace followed by the new value.

Text entries for this data set had the shortened spelling of ampersand instead of the symbol itself.

df2_clean['text'] = df2_clean['text'].replace('&', '&')

Combine and Map Columns:

First, create a new column. Select the data frame, applicable columns to combine, determine the separator for the combined contents, and join the column rows as strings.

Next, use unique to verify all the possible combinations to re-map from the result.

Then, use map to replace row entries with preferred values.

In this case, I had 4 columns called ‘doggo’, ‘floofer’, ‘pupper’ and ‘puppo’ that determine whether or not a tweet contains these words. I change it to a single column of ‘dog type’. Then, I map the values to be shorter versions of the combined column entries.

df2_clean['dog_type'] = df2_clean[df2_clean.columns[6:10]].apply(lambda x:                                                                    
                ','.join(x.dropna().astype(str)), axis=1)

df2_clean['dog_type'].unique()

df2_clean['dog_type'] = df2_clean.dog_type.map({'None,None,None,None': np.nan, 
                'doggo,None,None,None':'doggo',
                'None,None,None,puppo':'puppo', 
                'None,None,pupper,None':'pupper',
                'None,floofer,None,None':'floofer', 
                'doggo,None,None,puppo':'doggo/puppo',
                'doggo,floofer,None,None':'doggo/floofer', 
                'doggo,None,pupper,None':'doggo/pupper'})

Remove HTML Tags:

Write a function to remove HTML tags using re. Compile the tags by specifying ‘<.*?>’, and use sub to replace the compiled tags with empty spaces.

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

df2_clean['source'] = df2_clean['source'].apply(remove_html_tags)

 

What I Learned at NYC Uncubed 2018

What I Learned at NYC Uncubed 2018

Every year, Uncubed hosts a 1-day conference (NYC Uncubed) featuring a fair to connect people with start-up and tech careers. I attended the event in Brooklyn with data science positions in mind and here’s what I learned.

  • Bring physical copies your resume.

    • Print out a few copies of your resume for the event because HR personnel like to mark it up with notes as they speak to you.
    • You can be as aggressive with this as you want. I saw a lot of people with fancy folders filled to the brim with resumes. I brought 5 copies and that was plenty. The amount you bring should be reflective of how interested you are in the hiring companies announced ahead of the event.
  • Lower your expectations.

    • Don’t expect every single recruiter to blow your mind with the work you’ll be doing. Uncubed talks about how this is ‘New York City’s Top Recruiting Event for Digital Talent’, but the reality is that most companies in attendance weren’t revolutionary. If handling health insurance data floats your boat, this might be the conference for you.
    • Not every company is there to offer technical roles. Only a handful of companies out of the total 44 in attendance had openings in data science.
    • Instead of speaking to you about open roles, some companies will simply tell you to refer to their job boards because they’re not sure what specific openings are available.
    • This event isn’t going to change your life if you’re looking for entry-level positions. People straight up want you to have 3-5 years experience for many of the technical roles.
  • Prepare to be judged.

    • Get ready for: “Have you considered attending a bootcamp?” followed by “Where did you even take these classes?” and “I don’t see how any of this experience is relevant.”
    • Telling recruiters that you’re self-taught can freak them out. Some people are super nice and understanding, while others can’t fathom the possibility that you might be changing careers.
  • You don’t need to stay the whole time.

    • The event has a content session from 10:00 AM-1:00 PM and the employer showcase from 1:00 PM-5:00 PM, followed by a casual networking session.
    • The content sessions are something that’s nice, but not essential. It’s mostly people talking about the work they do for their companies in various positions, including data science, human resources, and marketing.
    • Four hours is a bit excessive for the company showcase. If you’re looking for a role in a specific field, it’s best to give yourself 1 to 2 hours to speak to employers. I was finished visiting the companies with data analyst openings by 3 PM.

It was an educational experience to learn the expectations of recruiters and to attend the content sessions. Honestly, I don’t see myself attending NYC Uncubed again. Instead, I’ll focus on industry-specific talks and networking events as future professional growth opportunities.

Project Overview: FoodPact

A fork and knife surround a plate that has earth on it and FoodPact is written below.
Logo for FoodPact program

A few months ago I began a project with my brother to create a calculator for the environmental footprint of food. It’s called FoodPact to merge food and ecological impact. It’s a work in progress and I’m excited to share the code for it.

Data sources to inform the calculator include:

  • Water footprint data for crops from a 2011 study by M.M. Mekonnen and A. Y. Hoekstra.
  • Greenhouse gas emissions data from Business for Social Responsibility (BSR) and Environmental Protection Agency (EPA) documents on transport via boatrail, and freight.
  • Food waste data from the United States Department of Agriculture (USDA) Economic Research Service (ERS)
  • Global food import data from the USDA Foreign Agricultural Service’s Global Agricultural Trade System (GATS).
  • Country centroid data from a President and Fellows of Harvard College 2015 data file.
  • US city locations from  SimpleMaps.

We used a Bootstrap Bootswatch for the web application’s layout and Flask as the microframework.

Python packages used in the program include:

  • Pandas to create more refined dataframes for use within the application
  • NumPy for equations
  • geopy for calculating great circle distance between latitudes and longitudes
  • Matplotlib and pyplot for creating graphs

The whole point of the program is to take a user’s location, food product, and the product’s country of origin to generate the estimated distance the food traveled, the approximate amount of carbon dioxide that travel generated, and the water requirements for the product.

Conversions include: cubic metric tons to gallons of water, tons of crops to pounds, and grams of carbon dioxide per kilometer to pounds per mile.

Selected graphics from FoodPact project:

One limitation of the calculator is that the values for carbon dioxide consider either full travel by ship, train, or truck and not a combination of the three methods. Emissions refer to the amount it takes to ship a twenty-foot equivalent (TEU) container full of the food product across the world. The country of origin considers the centroid and not the exact location of food production. Similarly, the list of cities displays the 5 most populated cities in that given state. The only exception is New York, for which I considered New York City close enough in latitude and longitude to account for Brooklyn, Queens, Manhattan, the Bronx, and Staten Island.

The data referenced in the calculator is meant to give a relative idea of the inputs required to generate and transport food products to give perspective to consumers. Ideally, the calculator will encourage conversations about the food system and inspire people to reduce their personal food waste.

The Role of Open Access

The Role of Open Access

Open access research provides an opportunity for the public to learn and use data as needed for free, but it is not overwhelmingly common. For researchers outside of academia, trying to pull together useful data can be difficult when considering accessibility barriers.

About two months ago, I began looking for data to create a model of biological inputs and energy requirements in the United States food system. Open data resources such as FAOSTAT, the Economic Research Service,  and Bureau of Transportation Statistics provided helpful figures on land use, food imports, and food transportation values. Aside from these resources, a lot of information I wanted to reference in building a model came from scientific papers that require journal subscriptions or charge a per-article fee.

Three articles that may have been helpful in my research illustrate the cost of access:

Upon closer investigation, Appetite claims it ‘supports open access’ but charges authors $3000 to make articles available to everyone, according to publisher Elsevier. Clearly, providing affordable open access options doesn’t seem like a priority for publishers.

There may have been useful data in the articles mentioned above. However, I won’t find out because I’m sticking with open access resources for my food systems project.

Public government databases are great, but specific science studies may hold more value to independent researchers. Journals like PLOS ONE lead the way in open access articles for those looking for specific research to compliment information from public databases. A 2016 article by Paul Basken in The Chronicle of Higher Education called ‘As an Open-Access Megajournal Cedes Some Ground, a Movement Gathers Steam’ shows a rise in open access papers, but I got the figures via Boston College because the article itself is ‘premium content for subscribers.’

Rise of published open access articles over time between 2008 and 2015. Data from: Basken, P. 2016. As an open-access megajournal cedes some ground, a movement gathers steam. The Chronicle for Higher Education, 62(19), 5-5.

Charging fees for accessibility can create an elitist barrier between academia and those who want to learn more about certain topics. I’m not proposing that everyone would take advantage of open access research articles if there were cheaper publishing options, or no access fees. If more studies were open access, it would create more opportunities for members of the public to digest scientific studies on their own terms.

There’s immense value in the open-source, collaborative culture of the tech community that I hope spills over into academia. I’m optimistic about a continued increase in open access publications in the science community. For now, I’m looking forward to creating open source projects that take advantage of public data.

Data Analysis and UFO Reports

Data Analysis and UFO Reports

Data analysis and unidentified flying object (UFO) reports go hand-in-hand. I attended a talk by author Cheryl Costa who analyzes records of UFO sightings and explores their patterns. Cheryl and her wife Linda Miller Costa co-authored a book that compiles UFO reports called UFO Sightings Desk Reference: United States of America 2001-2015.

Records of UFO sightings are considered citizen science because people voluntarily report their experiences. This is similar to wildlife sightings recorded on websites like eBird that help illustrate bird distributions across the world. People report information about UFO sighting events including date, time, and location.

A dark night sky with the moon barely visible and trees below.
Night sky along the roadside outside Wayquecha Biological Field Station in Peru, taken April 2015.

Cheryl spoke about gathering data from two main online databases, MUFON (Mutual UFO Network) and NUFORC (National UFO Reporting Network). NUFORC’s database is public and reports can be sorted by date, UFO shape, and state. MUFON’s database requires a paid membership to access the majority of their data. This talk was not a session to discuss conspiracy theories, but a chance to look at trends in citizen science reports.

The use of data analysis on UFO reports requires careful consideration of potential bias and reasonable explanations for numbers in question. For example, a high volume of reports in the summer could be because more people are spending time outside and would be more likely to notice something strange in the sky.

This talk showed me that conclusions may be temptingly easy to draw when looking at UFO data as a whole, but speculations should be met with careful criticism. The use of the scientific method when approaching ufology, or the study of UFO sightings, seems key for a field often met with overwhelming skepticism.

I have yet to work with any open-source data on UFO reports, but this talk reminded me of the importance of a methodical approach to data analysis. Data visualization for any field of study starts with asking questions, being mindful of outside factors, and being able to communicate messages within large data sets to any audience.

Why I Started Reading More Often

This year, I began to read more books thanks to a social media hiatus between January and March. I logged out of Facebook, Instagram, Twitter, and Snapchat and deleted the apps on my phone. I knew I spent way too many hours scrolling mindlessly through photos and status updates.

My intention was to renegotiate my use of free time to focus solely on learning and self improvement. I decided to find books at the library on topics for data science, coding, and topics of general curiosity.

An assortment of books on a shelf.
A section of my personal library including books that I have not yet finished.

Self education through reading has helped me confront some of my general anxieties about topics I find challenging. In 2018, I have read books focusing on mindfulness, coding, personal finance, business management, and behavioral psychology. I still enjoy reading books in my comfort zone of science and conservation, but I think it’s helpful to understand other fields of interest.

Instead of basking in my blatant ignorance about retirement plans and investments, I’ve been trying to read more books about business and finance. Learning about topics I find totally foreign has forced me to realize it’s simple and rewarding to address ignorance head-on.

Additionally, a few of the books I have read this year are just for pure fun. Authors like Reshma Saujani (founder of Girls Who Code) and Tim Ferriss (lifestyle coach extraordinaire) inspire me endlessly so I chose to read books by each of them. I appreciate how books can provide a platform to connect readers with mentors from any field.

I hope to keep the momentum going and maintain my current reading habits for the rest of the year. In that spirit, I have designated a new page for my reading list and I include a few notes on each book.