Statistics Principles for Data Analysis

Statistics Principles for Data Analysis
Four traditional dice on a game board

Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.

  • Central Tendency
    • Mean = the average of a distribution
    • Median = a distribution’s midpoint
    • Mode = the variable which occurs most often in a distribution
  • Variability
    • The distribution of data, also known as spread
    • Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
      • Represented through boxplots graphically
    • Summarized through quartiles:
      • Q1: median of all values to left of Q2
      • Q2: median (50th percentile) of all values in distribution
      • Q3: median of all values to right of Q2
    • Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
    • Standard deviation: square root of the variance (s^2)
  • Normal Distribution: bell-curve distribution of data
  • Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
    • Compare claims
    • Null hypothesis: statement challenged in significance testing
      • Example: There is not a difference between means.
    • Alternative hypothesis: statement suspected as true instead of the null hypothesis
      • Example: The means are not the same.
    • Accept or reject null hypothesis based on a certain p-value.
    • p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
    • Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
  • Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
  • Probability: The proportion of times an outcome would occur given many repeated tests.
  • Correlation
    • A measure of the linear relationship between two quantitative variables, based on direction and strength.
    • Examples: strong, weak, or no correlation; positive or negative
    • Represented by r
    • r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
  • Regression
    • Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
    • Multiple linear: statistical model with more than one explanatory variable
  • Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
  • Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
  • Analysis of Variance (ANOVAs)
    • One-way: Compare population means based on 1 independent variable
    • Two-way: Compare population means classified based on 2 independent variables

Source:

Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.

See here for an updated version of textbook