ECDF

If we do not have a probability model (like pnorm()), we can use the data itself as a model with ECDF.

Prediction via simulation using rdist() and ecdf()

ecdf(x)(y) returns a fraction of numbers in data vector x that are no greater than value y.

  • x is a data vector
  • y is where we evaluate ecdf(x)

rnorm(n, mean, sd) samples n times from a normal distribution with parameters mean and sd.

  • also: rbinom(), rexp(), rpois()

Example:

  • The waiting time at a shop is approximately normally distributed with mean = 75 minutes and SD = 25 mins. What is the probability that a customer’s waiting time will exceed 90 minutes?
set.seed(1)					        # set the seed value for the RNG
data = rnorm(100, mean=75, sd=25)	# samples 100 times from normal distribution model
ecdf(data)(60) = 0.233			    # empirical fraction of waiting times <= 60 minutes
1 - ecdf(data)(90) = 0.225			# empirical fraction of waiting times < 90 minutes

Plotting an ECDF using plot()

plot(ecdf(dv))

Goodness-of-Fit Testing

  • Goodness of fit – describes how well a model fits a set of observations.
  • This is hypothesis testing of whether a model fits the data.
  • Even the best-fitting model may not describe the data adequately.
  • It is a test of the null hypothesis that observations (data) are drawn from a specified distribution, or from a specified family of distributions (ie: normal).
  • We test the null hypothesis that the ECDF for the model we have fit does not differ from the ECDF of the data.