The world’s leading publication for data science, AI, and ML professionals.

Word Embeddings – Explained!

Word Embeddings, Explained

In natural language processing, we work with words. However, computers cannot directly understand words, necessitating their conversion into numerical representations. These numeric representations, known as vectors or embeddings, comprise numbers that can be either interpretable or non-interpretable by humans. In this blog, we will delve into the advancements made in learning these word representations over time.

1 N-grams

Figure 1: N-gram vector representation of a sentence (image by author)
Figure 1: N-gram vector representation of a sentence (image by author)

Let’s take the example of n-grams to understand the process better. Imagine we have a sentence that we want the computer to comprehend. To achieve this, we convert the sentence into a numeric representation. This representation includes various combinations of words, such as unigrams (single words), bigrams (pairs of words), trigrams (groups of three words), and even higher-order n-grams. The result is a vector that could represent any English sentence.

In Figure 1, let’s consider encoding the sentence "This is a good day". Say the first position of the vector represents the number of cases the bigram "good day" occurs in the original sentence. Since it occurs once, the numeric representation is "1" for this first position. In the same way, we can represent every unigram, diagram and trigram with different positions in this vector.

A major upside for this model is interpretability. Each number in this vector has some meaning humans can associate with. When making predictions, it’s not difficult to see what influenced the outcome. However, this numerical representation has one major downside: Curse of Dimensionality. This n-gram vector is large. If used for statistical modeling, specific parts of this vector need to be cherry picked. The reason for this is the curse of dimensionality. As the number of dimensions in the vector increases, the distance between representations of sentences increases. This is great for representing more information. But if it’s too sparse, it becomes difficult for a statistical model to tell which sentences are closer physically ( and hence in meaning ) to each other. Furthermore, cherry picking is a manual process and the developer might miss some useful n-gram representations in the process.

2 Neural Networks

Figure 2: Neural Probabilistic Language Model (Bengio et al., 2003)
Figure 2: Neural Probabilistic Language Model (Bengio et al., 2003)

To solve this shortcoming, a neural probabilistic language model was introduced in 2003. Language Models predict a word that comes next in a sequence. For example, a trained Language Model will take the sequence of words "I want a French" and should generate the next word "Toast". The neural language model illustrated in Figure 2 works in much the same way where use the context of N previous words to predict the next word.

For each word, we learn a dense representation, which is a vector containing a fixed number of numbers to represent each word. Unlike n-grams, these individual numbers in the vectors are not directly interpretable by humans. However, they capture various nuances and patterns that humans might not be aware of.

The exciting part is that since this is a neural network, we can train it end-to-end to grasp the concept of language modeling and learn all the word vector representations simultaneously. However, training such a model can be computationally expensive.

To illustrate, if we represent each word with a 100-number vector, and we need to concatenate all these vectors, it would involve thousands of numbers. Considering a vocabulary size of tens of thousands of words or more, we could end up with millions or even tens of millions of parameters to compute. This becomes a challenge when dealing with large vocabularies, a substantial number of examples, or higher dimensions for each word representation.

Ideally, larger dimensions would enable us to capture more intricate complexities of language, given its inherently complex nature.

Over the next decade, various architectures had been introduced to enhance the quality of word embeddings. One such architecture is described in the paper Fast Semantic Extraction Using a Novel Neural Network Architecture. It introduces the concept of incorporating positional information for each word to improve the embeddings. However, this approach also suffers from the drawback of being computationally expensive to train solely for learning word embeddings.

3 Word2Vec

Figure 3: Efficient Estimation of Word Representations in Vector Space (Milikov et al., 2013)
Figure 3: Efficient Estimation of Word Representations in Vector Space (Milikov et al., 2013)

In 2013, a significant breakthrough in generating word embeddings came with the introduction of Word2Vec. The paper presented two models, namely the Continuous Bag of Words (CBOW) and the Skip-gram model, which aimed to preserve simplicity while understanding word embeddings. In the CBOW model, the current word is predicted based on the two preceding and two succeeding words. The projection layer represents the word embedding for that specific word. The Skip-gram model, on the other hand, performs a similar task but in reverse, predicting the contextual surrounding words given a word. Again, the projection layer represents the vector representation of the current word. After training either of these networks, a table of words and their corresponding embeddings is obtained. This architecture is simpler with fewer parameters, marking the era of pre-trained word embeddings and the concept of word2vec.

However, this approach also has some limitations. Firstly, it generates the exact same vector representation for every occurrence of a word, regardless of its context. For example, the word "queen" in "drag queen" and "queen" in "king and queen" would have identical word embeddings, even though they carry different meanings. Additionally, the generation of these word embeddings considers a limited context window, only looking at the previous two words and the next two words during the training phase. This limitation affects the model’s contextual awareness.

4 ELMo

Figure 4: Deep contextualized word representations (Peters et al., 2018)
Figure 4: Deep contextualized word representations (Peters et al., 2018)

To enhance the quality of generated embeddings, ELMo (Embeddings from Language Models) was introduced in 2018. ELMo, a bidirectional LSTM (Long Short-Term Memory) model, addresses both language modeling and the creation of dense word embeddings within the same training process. This model effectively captures context information in longer sentences by leveraging LSTM cells. However, similar to LSTM models, ELMo shares certain drawbacks. Training these models can be slow, and they employ a truncated version of backpropagation known as BPTT (Backpropagation Through Time). Furthermore, they are not truly bidirectional since they learn the forward and backward contexts separately before concatenating them, which may result in the loss of some contextual information.

5 Transformers

Figure 5: Attention is all you need (Vaswani et al., 2017)
Figure 5: Attention is all you need (Vaswani et al., 2017)

Shortly before the introduction of ELMo, the Attention Is All You Need paper presented the Transformer neural network architecture. Transformers consist of an encoder and a decoder that both incorporate positional encodings to generate word vectors with contextual awareness. For example, when inputting the sentence "I am Ajay," the encoder generates three dense word embedding representations, preserving the word meanings. Transformers also address the downsides of LSTM models. They are faster to train since data can be processed in parallel, leveraging GPUs. Furthermore, Transformers are deeply bidirectional because they employ an attention mechanism that allows words to focus on preceding and succeeding words simultaneously, enabling effective contextual understanding.

The main issue with transformers is for different language tasks, we need a lot of data. However, if humans have some inherent understanding of language, then they don’t need to see a ton of examples to understand how to answer questions or translate.

6 BERT and GPT

Figure 6: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019)
Figure 6: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019)

To overcome the limitations of the Transformer model in language tasks, two powerful models called BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) were introduced. These models utilize transfer learning, which involves two phases of training.

In the first phase, known as pretraining, the models learn about general language understanding, context, and grammar from a large amount of data. They acquire a strong foundation of knowledge during this phase. In the second phase, known as fine-tuning, the models are trained on specific tasks by providing them with task-specific data. This fine-tuning process allows the models to specialize in performing the desired task without requiring a massive amount of task-specific data.

BERT is pretrained on two tasks: Masked Language Modeling and Next Sentence Prediction. Through this pretraining, BERT gains a deep understanding of the context and meaning of each word, resulting in improved word embeddings. It can then be fine-tuned on specific tasks such as question answering or translation, using relatively less task-specific data.

Similarly, GPT is pretrained on language modeling, which involves predicting the next word in a sentence. This pretraining helps GPT develop a comprehensive understanding of language. Afterward, it can be fine-tuned on specific tasks to leverage its language understanding capabilities in the same way BERT can.

Both BERT and GPT, with their Transformer architecture and the ability to learn various language tasks, offer superior word embeddings compared to earlier approaches. This is why GPT, in particular, serves as the foundation for many modern language models like ChatGPT, enabling advanced natural language processing and generation.

7 Conclusion

In this blog, we have explored how computers comprehend language through representations known as "embeddings". We have witnessed the advancements made in recent years, particularly with the rise of transformers as the foundation of modern language models. If you’re interested in building your own transformer model from scratch, check out this playlist of videos that delve into the code and theory behind it. Happy learning!


Related Articles