Feature Engineering
Topics:
- One Hot Encoding
- Normalization:
- Put data into range
[0, 1]
- Put data into range
- Standardization:
- Each data point becomes the number of standard deviations away from the mean it was (z-score)
Example questions:
- Given some data, one hot encode it.
pd.get_dummies
- What sort of categorical columns might cause problems if you one-hot encoded them?
- Is it bad to one-hot encode ordinal data?
- Yes, because then it’s treated as categorical data
- Does the base matter when you do a log transform?
- No
- Normalize a few datapoints.
Supervised Learning
- Make sure you deeply understand accuracy, precision and recall.
- What happens if you flip precision and recall to be about the negative class, is that valid?
- Can you come up with three examples of when you’d use precision and recall?
- Precision:
- If you’re a (good) judge, you want to make sure everyone you convict is actually a bad person, even if you miss some bad people
- Loans
- Recall:
- If you’re a dictator executing people, you want to make sure all your enemies are executed, even if you execute a few innocent people
- Fraud alerts
- Precision:
- Do you ever care about both?
- Log Loss was mentioned briefly, but it would be fun to create a question that plays with the model’s confidence in its predictions, the thing log loss cares about.
- When might 80% accuracy not be meaningful? If your data is super heavy with one particular class, how might you better see what’s going on with the model?
- Class imbalance would make accuracy less useful
- Confusion matrix can help look at results. Also precision, recall
- Make sure you understand the difference between train test split and the different forms of cross validation. HW5 is a good guide here, since you got to practice a few different examples. These concepts are all a little similar, so make sure there’s no mental clashing between them. Would we ever do 2 fold cross validation? Why or why not?
- Over and underfitting are really important concepts. Try creating a dataset and designing and overfit tree and an underfit one. Understand the signs of over and underfitting, and how you might fix them.
- What do fit() and predict() do?
KNN
- What happens as k gets bigger or smaller? How does that relate to underfitting and overfitting?
- Draw a dataset and try classifying a point with k=3.
- When does spherical KNN do better than the normal algorithm?
- When there’s lots of data points, since searching all points is time-consuming
- When will a decision tree outperform KNN? When will KNN outperform a decision tree?
Decision Trees
- Try generating some data and drawing the decision tree it creates, doing the splits out at each level.
- Can you come up with the formula for info gain from scratch? Try writing out the algorithm and generating it without looking.
- Draw a strangely shaped decision tree. What would it say about your data?
- What does half a bit mean when it comes to entropy?
Unsupervised Learning
K-Means Clustering
- What happens as k goes up? What about down?
- What is the loss function? You should at the very least be able to describe it via words; worst case you should be able to construct an equation for it.
- Draw a bunch of points and some centroids, and mentally run through a few different iterations of your algorithm.
- How do you pick the best K? What is that one graph in the slides, and why are we looking for the minima?
- What is the point of doing k-means in the first place? When would you use agglomerative clustering instead of k-means?
PCA
- When do we use PCA, and why?
- When is PCA lossless and when is it lossy?
- When two or more columns are correlated, PCA could be lossless
- Otherwise, it will be lossy
- Draw some data with high covariance, and draw the first principle component.
- Draw some data with low covariance.
Regression
- What does mean? What’s the equation?
- Can ever be negative? What does it mean if it is?
- If is negative, then you’re doing worse than just predicting the mean for every data point, and something’s gone horribly wrong
- What does it mean if you have a situation where no single line can accurately classify your data?
- Maybe linear regression is the wrong choice, try other kinds of regression?
- Try finding a dataset and running your own regression on it.
Neural Networks
- Generate a small neural network and show the output for a small input.
- Generate a network that computes xor of its inputs.
- (can be done easily using ReLU)
- Be able to define neuron, activation value, activation function, sigmoid (no need to even look at the equation on this one), and other terms of that generality.
- Neuron: A computational unit; a node in the graph of the network
- Activation value: Represents how activated a particular neuron is
- Activation function: Calculates output of neuron
- Sigmoid function:
- An activation function with outputs in the range
[0, 1]
- Looks kinda like a very distorted S
- An activation function with outputs in the range
- Review how perceptrons work and how they classify things