Covered in 9/12 lecture
Conditional probability:
Bayes rule:
Expected value: where is the values that takes and is the probability that takes the value .
Distributions
- A distribution is a statistical function that gives the probability of a given outcome from an experiment
- Continuous, but large enough sample size converges to the right thing
- Distributions are like histograms
Types:
- Uniform distribution
- Just a horizontal line
- Normal distribution (Gaussian)
- Extremely common, often assume normal distribution unless reason to think otherwise
- Continuous
- Poisson distribution
- Sums to 1, scrunched up on one side
- Given the rate of some event occurring, poisson distribution tells you probability of the number of occurrences over a time period
- Discrete
- Zero-inflated poisson distribution
- Poisson distribution with a spike at 0
- Bernoulli distribution
- When you just have one thing with that has a set probability of happening
- Example: Flipping a coin once
- Binomial distribution
- The probability of getting a set of outcomes from a set of Bernoulli Trials
- Discrete
- Looks like a normal distribution (if you have enough trials)
- Power law distribution
- e.g. number of close friends
Central Limit Theorem
Central Limit Theorem
If you have a distribution, such as rolling dice (uniform distribution), and you repeatedly sample that and make a new distribution from the means of those samples, then that new distribution will eventually approach a normal distribution with a sufficiently large sample size.
Useful because if you have two funky-looking distributions, you can sample them a bunch of times, and now you have two nice normal distributions that you can compare.
Criteria for samples:
Link to original
- Picked at random
- Representative of population
- Big enough to draw conclusions (>=30)
- Include less than 10% of the population, if you’re sampling without replacement
Summary Statistics
- Examples: Mean, median, mode
- Cannot rely solely on summary statistics, need to first understand your data holistically
Measures of central tendency:
- The Pythagorean means
- Arithmetic mean - Very sensitive to outliers
- Geometric mean - A measure of central tendency less sensitive to outliers
- Harmonic mean - Primarily used for rates (we won’t use it)
- Median
- Mode
Measures of variance:
- Variance:
- Standard deviation:
Other descriptors:
- Skew (left- or right-tailed): Whether distribution is shifted to left or right
- Negative skew: Mean is shifted to right
- Positive skew: Mean is shifted to left
- Modality: How the modes are distributed (e.g. unimodal, bimodial, etc.)