Very simple, not very good, outdated (but a building block of one of our strongest algorithms: Random Forests)
Warning
Don’t actually use decision trees. Only good for demonstration purposes.
Use decision trees when you only want:
- A pretty picture to show to someone
- A general sense of your data you can look at
Training
Smaller is better:
- Fewer rules: more generalization
- Many rules: more specificity
We want to use as few rules as possible to classify our data
- Deep trees are bad (deep = lots of levels, rules)
- Bushy trees are fine (bushy = lots of splits at a single level)
Entropy
See sub-page: Entropy
Algorithm for picking a split
- We want to get as possible to cutting the data in half with every split to get the smallest possible tree
- So we want the splits with the most difference in information before and after the split
- If you had 1 bit of entropy before the split and have 0.97 bits of entropy after the split, your information gain is only 0.03
- You want to have an information gain of 1 (start with 1 bit of entropy, and have 0 bits of entropy after split)
- Splits can cause different sizes, e.g. you go from 1 bit of entropy to 0 bits of entropy for one child node but 0.99 bits of entropy for the other child node
- So information gain needs to be the weighted sum of each branch
Use a greedy heuristic:
- Start with all the data and no split
- For each column you can split on, compute the weighted information gain on splitting on that
- Find the column with the most information gain
- Repeat until you have no entropy left (all elements in your node have the same class)
Math for the algorithm
Weighted entropy:
Information gain is
The Classification Algorithm
Steps to use a decision tree once you’ve trained it:
- Get a piece of data with no label
- Start at the top of the tree
- Go down each branch based on the appropriate feature
- When you get to a leaf, classify