Probability Models
Analysis starts with exploration, visualization, summary…
…and then often continues with building probability models (OSMEN framework).
The Data Science Process
- Ask an interesting question
- What is the scientific goal?
- What would you do if you had all the data?
- What do you want to predict or estimate?
- Get the data
- How was the data sampled?
- Which data is relevant?
- Are there privacy issues?
- Explore the data
- Plot the data.
- Anomalies?
- Patterns?
- Model the data
- Build a model.
- Fit the model.
- Validate the model.
- Communicate and visualize the results
- What did we learn?
- Do the results make sense?
- Can we tell a story?
Probability Distribution Models
Definitions:
- Probability model — predicts values of uncertain quantities.
- Uncertain quantities are modeled as random variables (assuming the quantities are discrete and not continuous).
- Random variable — a variable with more than one possible value; each value occurs with some probability.
Notes:
P(x)
denotes probability distribution that random variableX
has valuex
.- May also be denoted as
P(X = x)
- May also be denoted as
- For a single random variable X, a probability distribution
P(x)
is the probability model for its value. - Probability Density Function (PDF) – Returns the probability distribution for a continuous random variable.
- Probability Density – The probability of a continuous probability distribution.
- Example: A man does not have a probability of being 6 feet tall, but he does have a probability of being between 5 and 6 feet tall.
Multivariate Probability Models
A multivariate probability model provides a conditional probability distribution for output y
given observed values of input x
.
P(y | x)
– conditional probability of y
given x
(R uses ~
instead of |
)
- x = independent variables (aka explanatory variables, predictors)
- y = dependent variable (aka response, risk, output)
- Predicted probability of y depend on (or are “conditioned on”) both the x values and the model assumptions.
- Examples of conditional probability relationships:
disease ~ symptoms
,future ~ past
,classification ~ features
,risk ~ factors
,behavior ~ offers
Cumulative Distribution Function (CDF) – Returns the probabilities that the outcome falls inside a specified interval.
Main Steps in Probability Modeling
- Select a model to describe data (or, learn models from data (ie: machine learning))
- Fit the model to data
- Estimate parameters (ie: regression coefficients)
- Provide best guesses (“point estimates”) and confidence intervals
- Validate model’s predictive accuracy
- Use cross-validation to characterize prediction errors (splitting data into subsets and fit/train model using subsets not used in other parts of the model)
- Error measures: Mean Squared Error for continuous variables; misclassification rates for binary or discrete variables
- Use model to make predictions and characterize remaining uncertainties
- A probabilistic prediction is a probability distribution
- Apply the model to decisions
Useful Probability Models
These are all probability models because they all calculate conditional probabilities for outputs given observed inputs.
Probability distribution models
- Binomial distribution (2 outcomes)
- Normal distribution (continuous outcome, sums of independent random variables)
- Poisson distribution (count outcome, number of rare events)
Regression models
- Logistic (binary dependent variable (ie: death, pregnancy))
- Linear (for continuous dependent variable)
- Poisson or generalized Poisson (for count dependent variables)
- Flexible regression models (nonparametric (smoothing) regression)
Others
- Time series and dynamic regression models (trends; changes over time)
- Bayesian networks (many variables affecting each other)
- Survival models
- Transition models and generalizations
- Dynamic causal models; simulation of changes over time
- Markov models of policy impacts