What is bag of words and how is it used in text classification?
Bag of Words is a representation of text that describes the occurrence of words within a document. The order or structure of the words is not considered. For text classification, we look at the histogram of the words within the text and consider each word count as a feature.
Advantages:
- Simple to understand and implement.
Disadvantages:
- The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
- Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons
- Discarding word order ignores the context, and in turn meaning of words in the document. Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”).
N-grams
The function to tokenize into consecutive sequences of words is called n-grams. It can be used to find out N most co-occurring words (how often word X is followed by word Y) in a given sentence.