Word Representation in Natural Language Processing Part I

Nurzat Rakhmanberdieva
Towards Data Science
4 min readDec 9, 2018

--

In this blog post, I will discuss the representation of words in natural language processing (NLP). It is one of the basic buildings blocks in NLP, especially for neural networks. It has a significant influence on the performance of Deep learning models. In this part of blog post, I will describe relatively easy approaches and their characteristics.

Dictionary Lookup

The simplest approach is a word ID lookup in the dictionary. The basic steps of this approach are following.

First, take the corpus which can be collection of words, sentences or texts. Pre-process them into an intended format. One way is to use lemmatization, which is a process of converting word to its base form. For example, given words walk, walking, walks and walked, their lemma would be walk. Then, save vocabulary of pre-processed words into file such as “vocabulary.txt”.

After that, build the lookup dictionary by creating a mapping between words and IDs i.e. each unique word in the vocabulary is assigned an ID.

As result, a simple lookup dictionary will be constructed as shown below, from which one can look up for word IDs.

Example of sample lookup dictionary.

Then, for each given word, return the corresponding integer representation by looking it up in the dictionary. If the word is not present in the dictionary, the integer corresponding to the Out of Vocabulary token should be returned. In practice, usually the value of Out of Vocabulary token is set to the size of the dictionary plus one i.e. length(dictionary) + 1 .

While it is relatively easier approach, it has drawbacks which need to be considered. By treating tokens as integers, the model might incorrectly assume the existence of natural ordering. For example, the dictionary contains entries such as 1: “airport” and 2: “plane” . The token with greater ID value might be considered as more important by the Deep learning models than the tokens with less values which is a wrong assumption. Models which are trained with this type of data are prone to failure. On the contrary, data with ordinal values such as size measures 1: “small”, 2: “medium”, 3:“large” is suitable for this case. Because there is a natural ordering in the data.

One-Hot Encoding

The second approach of word representation is one-hot encoding. The main idea is to create a vocabulary size vector with filled zeros except one. For a single word only corresponding column is filled with the value 1 and the rest are zero valued. The encoded tokens will consist of vector with dimension 1 × (N+ 1), where N is the size of the dictionary and the extra 1 is added to N for the Out of Vocabulary token. Let’s look how words in our dictionary converted to one-hot encoding:

As we can see, only column corresponding to word is activated for each word.

The advantage of this encoding to ordinal representation is that it does not suffer from undesirable bias. However, its immense and sparse vector representation requires large memory for computation.

Distributional Representation

A third approach is a family of distributional representations. The main idea behind this approach is that words typically appearing in the similar context would have a similar meaning. The idea is to store the word-context co-occurrence matrix F in which rows represent words in the vocabulary and columns represent contexts. The context could be sliding windows over the training sentences, or even documents. The matrix entries consist a frequency counts or tf-idf (Term Frequency-Inverse Document Frequency) scores. Here is a simple example:

Sentence 1: Boston has available flights to major US cities.

Sentence 2: Flights to Boston were cancelled due to bad weather conditions.

Moreover, some function g could be applied to F to reduce noise, smooth the frequencies, or reduce the dimension of the representational vector. Function g could be doing a simple transformation like linear decomposition, but there are also advanced approaches such as Latent Dirichlet Allocation. Since the number of context could be very large, e.g. document might contain thousand of sentences, these methods are known for being inefficient.

Aforementioned methods are easy to use but have drawbacks that make them hard to train and require a large memory. Moreover, they do not incorporate word meanings into representation as advanced methods do. In the following series of word representation in natural language processing, I will describe more advanced methods. The next series can be found here.

--

--