Covered in 9/12 lecture

Conditional probability:

Bayes rule:

Expected value: where is the values that takes and is the probability that takes the value .

Distributions

  • A distribution is a statistical function that gives the probability of a given outcome from an experiment
  • Continuous, but large enough sample size converges to the right thing
  • Distributions are like histograms

Types:

  • Uniform distribution
    • Just a horizontal line
  • Normal distribution (Gaussian)
    • Extremely common, often assume normal distribution unless reason to think otherwise
    • Continuous
  • Poisson distribution
    • Sums to 1, scrunched up on one side
    • Given the rate of some event occurring, poisson distribution tells you probability of the number of occurrences over a time period
    • Discrete
  • Zero-inflated poisson distribution
    • Poisson distribution with a spike at 0
  • Bernoulli distribution
    • When you just have one thing with that has a set probability of happening
    • Example: Flipping a coin once
  • Binomial distribution
    • The probability of getting a set of outcomes from a set of Bernoulli Trials
    • Discrete
    • Looks like a normal distribution (if you have enough trials)
  • Power law distribution
    • e.g. number of close friends

Central Limit Theorem

Central Limit Theorem

If you have a distribution, such as rolling dice (uniform distribution), and you repeatedly sample that and make a new distribution from the means of those samples, then that new distribution will eventually approach a normal distribution with a sufficiently large sample size.

Useful because if you have two funky-looking distributions, you can sample them a bunch of times, and now you have two nice normal distributions that you can compare.

Criteria for samples:

  • Picked at random
  • Representative of population
  • Big enough to draw conclusions (>=30)
  • Include less than 10% of the population, if you’re sampling without replacement
Link to original

Summary Statistics

  • Examples: Mean, median, mode
  • Cannot rely solely on summary statistics, need to first understand your data holistically

Measures of central tendency:

  • The Pythagorean means
    • Arithmetic mean - Very sensitive to outliers
    • Geometric mean - A measure of central tendency less sensitive to outliers
    • Harmonic mean - Primarily used for rates (we won’t use it)
  • Median
  • Mode

Measures of variance:

  • Variance:
  • Standard deviation:

Other descriptors:

  • Skew (left- or right-tailed): Whether distribution is shifted to left or right
    • Negative skew: Mean is shifted to right
    • Positive skew: Mean is shifted to left
  • Modality: How the modes are distributed (e.g. unimodal, bimodial, etc.)