Term Frequency: How many times a word appears in a specific document Document Frequency: How many documents a word appears in
Motivation: We want to downweight words with a high document frequency, because words that are very common are not as useful to us
- For topic modeling, if only two books have the word “dragon,” that’s probably an important word for determining the genre
- On the other hand, every book has the word “they,” so it doesn’t give you much information about the book’s genre
So we weight words using TF-IDF:
- TF-IDF is term frequency times inverse document frequency
- IDF (inverse document frequency) is logarithm of 1/(document frequency)
- Logarithm because you want the extremes to be the same (don’t care if a word appears in 100 documents or 110 documents, both are bad)