Statistics Principles for Data Analysis

Statistics Principles for Data Analysis
Four traditional dice on a game board

Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.

  • Central Tendency
    • Mean = the average of a distribution
    • Median = a distribution’s midpoint
    • Mode = the variable which occurs most often in a distribution
  • Variability
    • The distribution of data, also known as spread
    • Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
      • Represented through boxplots graphically
    • Summarized through quartiles:
      • Q1: median of all values to left of Q2
      • Q2: median (50th percentile) of all values in distribution
      • Q3: median of all values to right of Q2
    • Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
    • Standard deviation: square root of the variance (s^2)
  • Normal Distribution: bell-curve distribution of data
  • Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
    • Compare claims
    • Null hypothesis: statement challenged in significance testing
      • Example: There is not a difference between means.
    • Alternative hypothesis: statement suspected as true instead of the null hypothesis
      • Example: The means are not the same.
    • Accept or reject null hypothesis based on a certain p-value.
    • p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
    • Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
  • Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
  • Probability: The proportion of times an outcome would occur given many repeated tests.
  • Correlation
    • A measure of the linear relationship between two quantitative variables, based on direction and strength.
    • Examples: strong, weak, or no correlation; positive or negative
    • Represented by r
    • r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
  • Regression
    • Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
    • Multiple linear: statistical model with more than one explanatory variable
  • Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
  • Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
  • Analysis of Variance (ANOVAs)
    • One-way: Compare population means based on 1 independent variable
    • Two-way: Compare population means classified based on 2 independent variables

Source:

Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.

See here for an updated version of textbook

Data Analysis and UFO Reports

Data Analysis and UFO Reports

Data analysis and unidentified flying object (UFO) reports go hand-in-hand. I attended a talk by author Cheryl Costa who analyzes records of UFO sightings and explores their patterns. Cheryl and her wife Linda Miller Costa co-authored a book that compiles UFO reports called UFO Sightings Desk Reference: United States of America 2001-2015.

Records of UFO sightings are considered citizen science because people voluntarily report their experiences. This is similar to wildlife sightings recorded on websites like eBird that help illustrate bird distributions across the world. People report information about UFO sighting events including date, time, and location.

A dark night sky with the moon barely visible and trees below.
Night sky along the roadside outside Wayquecha Biological Field Station in Peru, taken April 2015.

Cheryl spoke about gathering data from two main online databases, MUFON (Mutual UFO Network) and NUFORC (National UFO Reporting Network). NUFORC’s database is public and reports can be sorted by date, UFO shape, and state. MUFON’s database requires a paid membership to access the majority of their data. This talk was not a session to discuss conspiracy theories, but a chance to look at trends in citizen science reports.

The use of data analysis on UFO reports requires careful consideration of potential bias and reasonable explanations for numbers in question. For example, a high volume of reports in the summer could be because more people are spending time outside and would be more likely to notice something strange in the sky.

This talk showed me that conclusions may be temptingly easy to draw when looking at UFO data as a whole, but speculations should be met with careful criticism. The use of the scientific method when approaching ufology, or the study of UFO sightings, seems key for a field often met with overwhelming skepticism.

I have yet to work with any open-source data on UFO reports, but this talk reminded me of the importance of a methodical approach to data analysis. Data visualization for any field of study starts with asking questions, being mindful of outside factors, and being able to communicate messages within large data sets to any audience.