Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.
- Central Tendency
- Mean = the average of a distribution
- Median = a distribution’s midpoint
- Mode = the variable which occurs most often in a distribution
- Variability
- The distribution of data, also known as spread
- Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
- Represented through boxplots graphically
- Summarized through quartiles:
- Q1: median of all values to left of Q2
- Q2: median (50th percentile) of all values in distribution
- Q3: median of all values to right of Q2
- Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
- Standard deviation: square root of the variance (s^2)
- Normal Distribution: bell-curve distribution of data
- Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
- Compare claims
- Null hypothesis: statement challenged in significance testing
- Example: There is not a difference between means.
- Alternative hypothesis: statement suspected as true instead of the null hypothesis
- Example: The means are not the same.
- Accept or reject null hypothesis based on a certain p-value.
- p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
- Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
- Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
- Probability: The proportion of times an outcome would occur given many repeated tests.
- Correlation
- A measure of the linear relationship between two quantitative variables, based on direction and strength.
- Examples: strong, weak, or no correlation; positive or negative
- Represented by r
- r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
- Regression
- Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
- Multiple linear: statistical model with more than one explanatory variable
- Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
- Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
- Analysis of Variance (ANOVAs)
- One-way: Compare population means based on 1 independent variable
- Two-way: Compare population means classified based on 2 independent variables
Source:
Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.
See here for an updated version of textbook