- Introduction:
- Potential concepts:
- Data Exploration
- Data Visualization
- Skills expected
- You know the name and purpose of each step we’ve covered so far
- Experiment Design:
- Concepts:
- The role of optimization criteria, proxy measures, and Goodhart’s law
- Types of experiments and all associated vocab : Observational studies (Cross-sectional, Retrospective, Prospective), Experiments, Synthetic data
- Skills expected
- Understand potential issues with experimental design, including sampling, as well other potential sources of error. You should be comfortable designing an experiment, or pointing out flaws in an experiment you are shown.
- Generally, the potential issues with surveys and self-reported data (Max only)
- Understand what blinding and placebo are for
- Pandas
- Concepts:
- Pandas: Understand what a dataframe and series are, and what pandas is generally used for.
- Skills expected
- Be able to write pandas statements of similar difficulty to the easier parts of the homework. You will not be provided with any sort of pandas cheat sheet.
- Data Types
- Concepts
- Data definition
- Types of Data and their subtypes
- Data Formats - TSV, CSV, JSON, HTML, XML
- Images: Gif, Lossless and Lossy compression
- Databases and what they’re for
- Skills expected
- Be able to describe or identify any of the listed filetypes and their purposes.
- Probability
- Concepts
- Bayes Rule
- Conditional Probabilities
- Expected Value
- All types of distributions - Uniform, Gaussian, Poisson, Zero-Inflated Poisson Distribution, Bernoulli Trial, Binomial Distribution
- Central Limit Theorem
- Summary Statistics
- Measure of Central Tendency: Mean, median, mode (Including equation for arithmetic mean)
- Measures of Variance: Standard deviation, variance (You should know what these are, how they work and what they represent)
- Other Descriptors - Skew, Modality
- Skills expected
- Understand distributions conceptually
- Probability and statistics questions akin to those on the homework
- Name and identify distributions based on descriptions
- Understand and apply the central limit theorem
- Understand how mean, median and mode relate to each other, and how to use them to learn about data
- Hypothesis Testing
- Concepts
- Hypothesis Testing
- Null and Alternative Hypothesis
- Statistical Significance
- P-Values
- Type I and Type II Errors
- Different Types of Statistical Tests and when to apply them
- Z-Test
- T-Test ( two sample, one sample, paired)
- Chi Square Test
- ANOVA test
- Skills expected
- Given a scenario, you need to identify which hypothesis test will be applicable, its null and alternate hypothesis
- Understand what a p value is and what it represents
- Understand why hypothesis tests are important and what they do
- Data Visualization
- Concepts
- What Type of Visualization to Use?
- Comparison
- Correlation
- Part-To-Whole
- Data Over Time
- Distribution
- Design Principles
- Skills expected
- Given a scenario, identify the best type of visualization to use
- Be able to read, understand, and draw graphs
- Data Exploration
- Concepts
- What is exploratory data analysis (EDA)
- Pearson’s correlation coefficient and what it means
- Data Cleaning
- Concepts
- Duplicated records
- Evolving data practices
- Outlier detection, z-score
- Data missing at random / Data missing not at random
- Different types of imputation
- Boundary conditions
- Data errors and what causes them
- How to detect corrupt or incorrect data
- Skills expected
- Be able to reason about missing data
- Know when to replace / when not to replace outliers
- Be able to pick the correct imputation type for the situation
- Be able to identify data errors from graphs
- Knowing what type of information people might lie about