Testing and Training

Generally split the data randomly into

  • 80% for training
  • 20% for testing

K-Fold Cross Validation

K-Fold Cross Validation

Warning

Don’t use this if you only have a little data

  • Cut the data into chunks
  • Make splits. In each, use one chunk for testing and the other chunks for training
  • Train on those, report test accuracy

K-Fold

Link to original

Leave One Out Cross Validation

Warning

This is computationally expensive, so prefer K-Fold Cross Validation

  • Only use when every ounce of training data counts
  • Will let you see exactly where your algorithm falls down

Validation Sets

One possible problem is that you overfit instead of the model overfitting

  • You might tune the hyperparameters to fit the test data too well
  • Now info about test set has been propagated back to algorithm

Solution: Split into training, test, and validation set

Validation data can be used to tune hyperparameters for your model, then testing data is used at the end to evaluate accuracy

Most people don’t actually do this because it’s too much work

Training Failures

Main reasons for model not working:

  • Overfitting
    • Unable to generalize
    • Cues in on patterns in training data that don’t actually exist in real world
    • Can think of a decision tree with way too many nodes/rules
    • Kinda like superstitions
  • Underfitting
    • Too general
    • Can think of a decision tree with only a couple levels
  • Lack of model power
    • Some patterns are just too complicated for, say, decision trees
  • Lack of signal in data
    • It really is impossible to learn anything from the data

Combating Training Failures

Overfitting

Decision trees have a really bad overfitting problem, can deal with it by:

  • Applying a depth limit where we chop everything off after 4 levels
    • Depth limit is a hyperparameter for decision trees
  • Trying different ones and see how it affects our accuracy on the test set

Play around with a bunch of hyperparameters and see what works best for the test set

Evaluation

Accuracy

Easiest way to see how model is doing is accuracy

Class Imbalance

But class imbalance is a problem

Happens when an overwhelming majority of your data is a single class

Examples of class imbalance:

  • Fraud - Most transactions aren’t fraudulent
  • Disease - Most people won’t have the disease
  • Product - Most people won’t buy any given product

It’s very common to have a rare positive signal surrounded by negatives

Confusion Matrix

Shows you true positives, false positives, true negatives, and false negatives

PositiveNegative
Classified PositiveTrue positive%False positive%
Classified NegativeFalse negative%True negative%

Precision

Precision is how many of the things that we classified as positive were actually positive

Use precision when we only care about being correct about the things we identify as positive, e.g., Google doesn’t care if it turns away 1000 good engineers, they only care if the ones it does hire are good

Recall

Recall is how many of our positive class we didn’t miss

Use recall when we want to make sure we don’t miss anything, e.g., identifying people with disease

Confidence

Most ML algorithms can tell you how confident they are in an answer

Depending on use case, you may want to take the confidence into account, e.g., launching nukes

Log loss: A measure of accuracy that penalizes overconfidence

F1 score

Harmonic mean of precision and recall, used when both are important

Warning

Max thinks this is an awful metric