Big problem for classifiers

Happens when an overwhelming majority of your data is a single class

Examples of class imbalance:

  • Fraud - Most transactions aren’t fraudulent
  • Disease - Most people won’t have the disease
  • Product - Most people won’t buy any given product

It’s very common to have a rare positive signal surrounded by negatives

Options for dealing with class imbalance:

  • Use class_weight parameter in sklearn
    • Penalizes the class with too many examples, helps the class with too few examples
  • Create multiple synthetic examples of your negative features with slight differences
  • Equalize your training set todo what does this mean? remove the extra rows?