Concepts
Null hypothesis
A default hypothesis that a quantity to be measured is zero (null).
- The null hypothesis is true if data values are drawn from the hypothesized distribution.
p-value
The probability of obtaining observed, or more extreme, results when the null hypothesis is true.
- The probability that an observed difference could have just occurred by chance.
- A p-value near zero provides evidence against the null hypothesis.
t-test
A t-test automatically tests the null hypothesis that the mean value of the data is zero.
- T-tests assume:
- The distribution is normal
- The distribution remains fixed
Example
Example:
- A random sample of 5 customers spent $10.24, $12.31, $9.38, $14.03, and $11.72.
- Predict how much the next 100 customers will spend.
x = c(10.24, 12.31, 9.38, 14.03, 11.72)
t.test(x)
OSEMN data science framework
- Obtain — gather data from relevant sources
- Scrub — clean data to formats that machine understands
- Explore — Find significant patterns and trends using statistical methods
- Model — Construct models to predict and forecast
- iNterpret — put the results into good use
Simpson’s Paradox
- A phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
Lessons:
- What is true for a population may be false for each subpopulation.
- Many proportions and differences of proportions can only be interpreted clearly in the context of multiple causally relevant variables (multivariate models).