What is bag of words and how is it used in text classification?

Bag of Words is a representation of text that describes the occurrence of words within a document. The order or structure of the words is not considered. For text classification, we look at the histogram of the words within the text and consider each word count as a feature.

Advantages:

  1. Simple to understand and implement.

Disadvantages:

  1. The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
  2. Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons
  3. Discarding word order ignores the context, and in turn meaning of words in the document. Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”).

N-grams

The function to tokenize into consecutive sequences of words is called n-grams. It can be used to find out N most co-occurring words (how often word X is followed by word Y) in a given sentence.

Speak Your Mind