One way to turn text into vectors using one-hot encoding

Warning

This is a bad approach

Max encoded whole sentences as vectors, not individual words

  • If a word appears twice, put a 2 in its column rather than 1

Produces very sparse vectors

Note

Max also refers to this as the “unigram” approach. Not sure if “unigram” approach will refer to bag-of-words specifically on the exam

Problem: Bag of words loses information about ordering of words within sentence

  • Partial solution: Use bigrams to keep some of the order
  • Now “School has homework” and “Homework has school” no longer the same vector
  • But now dimensionality is way bigger
    • Some people use trigrams, but no one uses anything past that because too much memory required