What are word embeddings?
Word Embeddings are vector representations for words. Each word is mapped to one vector, this vector tries to capture some characteristics of the word, allowing similar words to have similar vector representations.
Word Embeddings helps in capturing the inter-word semantics and represents it in real-valued vectors.
Word2Vec is a method to construct such an embedding. It takes a text corpus as input and outputs a set of vectors which represents words in that corpus.
It can be generated using two methods:
- Common Bag of Words (CBOW)
If you have a sentence with multiple words, you may need to combine multiple word embeddings into one. How would you do it?
Approaches ranked from simple to more complex:
- Take an average over all words
- Take a weighted average over all words. Weighting can be done by inverse document frequency (idf part of tf-idf).
- Use ML model like LSTM or Transformer.