Covered in lecture on 9/28 and 10/3
Common Errors
- If data is in wrong format:
- Can sometimes use
df["column"].astype(<some type>)
- Otherwise, can use
df["column"].apply(...)
- Can sometimes use
- If labels change midway through the data, can do one of these:
- Divide the dataset in two
- Infer old ratings
Duplicates
Warning
Always check for duplicates
- Can detect duplicates by comparing relative lengths of
len(df)
andlen(df.drop_duplicates())
- If exact duplicates, do
df.drop_duplicates()
- If some of them are subtly different, identify the ones that are the correct values and drop the rest
Outlier Detection
You can use a box and whiskers chart to spot outliers
Use z-score for detecting outliers:
where is the observed value, is the mean of the sample, and is the standard deviation of the sample
The z-score of a point is basically how many standard deviations it’s away from the mean
Make sure you understand why your outliers are outliers
Once you have an outlier, you can keep it, delete it, or impute it (fill it in with something else, as if it’s a missing value)
Missing Data
Check for missing values. sum(df["column"].isna())
will tell you how many missing values are in a series
Data could be missing at random or not
Data missing at random
- No pattern
- If data is categorical, then you could just make “Missing” a category
- If data is numerical:
- If it’s a very small portion of your data, can just drop it (
df.dropna()
)- ⚠️ Can only do this if data missing at random
- Otherwise, can use imputation:
- Mean imputation: fill in missing values with mean
- Median imputation: fill in missing values with median
- Would use this over mean imputation if data is not uniform
- Mode imputation: fill in missing values with most common value
- Can use this when data is categorical
- Hot-deck imputation: find row most similar to the row with a missing value, then copy the value from that row
- Can use this when data is categorical
- Bayesian imputation: fill in with most likely value (probably won’t have to use this? Max never has)
- Time Series need to be handled specially
- If it’s a very small portion of your data, can just drop it (
Data not missing at random
Possible ways for data to be missing not at random:
- One particular value is missing
- One particular range of rows missing
- If you lost rows in the middle of your data, it might still be possible to do imputation
- Boundary conditions: e.g. birds flew out of sensor range or Geiger counter only goes up to a certain value
- ⚠️ Cannot do imputations with boundary conditions
- Possible ways to deal with boundary conditions:
- Drop everything outside, only work with stuff within boundary
- Extrapolate the same distribution outside the boundary
- Get more data
Incorrect Data
Types of incorrect data:
- People lied to you
- Faulty instrumentation
- To repair, you can look at past sensor data (or a similar sensor) and adjust distribution to be closer to the past one
- Only works if you understand what sort of error you have
- You’ve been recording the wrong metrics
- Two identical entries with different values
- Illegal values
- Unclear default values
Detecting Incorrect Data
- If there’s a discontinuity, it might be because people round when self-reporting
- Max called it an attractor
- Takes advanced statistics to normalize - can “smooth out” by taking the spike and spreading it down the curve
- e.g. someone who’s 5’11 or 6’1 might say they’re 6 feet because it’s nice and round
- e.g. people underreporting their income so they can get a scholarship, resulting in a spike before the scholarship threshold on one side
- Modes that don’t make sense (e.g. latitude/longitude 0, 0)
- Data outside valid bounds (e.g. someone playing a video game for a million hours)
- Signs of boundary conditions - Your data has a discontinuity at a certain value, after which there is nothing