Statistics Principles for Data Analysis Recently, I had to brush up on statistics terms for a data analyst exam. I had trouble pulling together old course notes to create a quick, cohesive study guide. Below, overarching concepts are from the test’s public posting and my notes are derived from a quantitative statistics textbook.

• Central Tendency
• Mean = the average of a distribution
• Median = a distribution’s midpoint
• Mode = the variable which occurs most often in a distribution
• Variability
• The distribution of data, also known as spread
• Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
• Represented through boxplots graphically
• Summarized through quartiles:
• Q1: median of all values to left of Q2
• Q2: median (50th percentile) of all values in distribution
• Q3: median of all values to right of Q2
• Variance: s^2 = SUM((value minus mean)^2 for all values)) / (number of values-1)
• Standard deviation: square root of the variance (s^2)
• Normal Distribution: bell-curve distribution of data
• Hypothesis Testing: Examine evidence against a null hypothesis, hypotheses referring to populations or models and not a certain outcome
• Compare claims
• Null hypothesis: statement challenged in significance testing
• Example: There is not a difference between means.
• Alternative hypothesis: statement suspected as true instead of the null hypothesis
• Example: The means are not the same.
• Accept or reject null hypothesis based on a certain p-value.
• p-value: the likelihood that the test statistic would be a value equal or higher than what is observed
• Smaller p-values signify stronger evidence against the null hypothesis in question. Often, an alpha value of 0.05 is used. Evidence would be so strong that something outside the p-value should only occur 5 out of every 100 times.
• Statistical Significance Testing: Achieved at the level where the p-value is equal or less than alpha.
• Probability: The proportion of times an outcome would occur given many repeated tests.
• Correlation
• A measure of the linear relationship between two quantitative variables, based on direction and strength.
• Examples: strong, weak, or no correlation; positive or negative
• Represented by r
• r = (1/n-1)*SUM((all x-values minus mean summed/standard deviation of all x-values),(all y-values minus mean summed/standard deviation of all y-values))
• Regression
• Simple linear: statistical model where the means of y occur on a line when plotted against x for one explanatory variable
• Multiple linear: statistical model with more than one explanatory variable
• Parametric Statistics: Use numerical data because this assumes data has a normal distribution.
• Nonparametric statistics: Use ordinal or categorical data because this does not assume a normal distribution.
• Analysis of Variance (ANOVAs)
• One-way: Compare population means based on 1 independent variable
• Two-way: Compare population means classified based on 2 independent variables

Source:

Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics. Seventh edition/Student edition. New York: W.H. Freeman and Company, a Macmillan Higher Education Company.

See here for an updated version of textbook