Enabling our machines to understand text has been a daunting task. Although a lot of progress has been made on this front, we are still a long way from creating an approach towards seamlessly converting our language into readable data for our machines.
Let’s face it; our languages are complicated.
Unlike tables that are neatly composed of rows and columns and images that are composed of pixels with RBG values within a fixed range, the words we speak and write don’t adhere to a strict and structured system.
Given our flexibility, we have been able to get away with this for the past 2 million years of human civilization (it’s been a nice run). Unfortunately, the fallacies of our language now come to haunt us as we attempt to teach computers to make sense of our complicated system.
For a better understanding of the challenges facing NLP practitioners, it is ideal to examine our language from the perspective of a linguist.
Let’s examine some of the components that make our language convoluted and sometimes even nonsensical.
Note: This article mainly focuses on English. Some of the features discussed may not apply to other languages.
Word Sequence
In spoken and written language, the sequential order of our words is important. The semantic value of a text lies in the order of the words just as much as the words themselves.
Consider the following pairs of sentences:
Case 1:
"The cat ate the mouse."
"The mouse ate the cat."
Case 2:
"I had fixed my laptop."
"I had my laptop fixed."
In both cases, even though the sentences have the same words, their semantic meaning is different.
Synonyms
Synonyms refer to words that share the same or similar meaning to each other. You are most likely familiar with them, but here are a few examples:
1: {great, amazing, fantastic}
2: {big, huge, enormous}
3: {costly, expensive, pricey}
Having multiple ways to express the same information introduces added complexity that NLP models will have to account for.
Homonyms
Homonyms refer to words with the same spelling and pronunciation but multiple meanings.
In dialogue, it is easy to determine which meaning is being referred to given the context. Here is an example:
"The fishes are swimming in a tank."
"The military unit was supplied with a tank."
We can easily tell what the definition of "tank" is in each scenario due to the provided context. However, getting a computer to do the same is a challenge.
Sarcasm
Sarcasm, in layman’s terms, means to say something with the opposite meaning of what you want to say (usually as a form of mockery or ridicule). It has long been incorporated into our day-to-day dialogue. It is also present in text, seen often in forms of communication such as personal chats.
I’m sure we have all seen online reviews resembling the following:
"What a book! After 50 pages in, I only dozed off twice!"
Again, this phenomenon is easy to detect for humans, but not so much for computers. Unfortunately, failing to detect sarcasm would hamper the performance of NLP applications that require detecting emotion (e.g., sentiment analysis).
Unknown words
NLP models also face the chance of running into words that they don’t recognize. These are words that were not included in the data used to train the models. Examples of such words include new vocabulary, misspelled words, slang, and abbreviations.
Current Models in NLP
Researchers have conducted countless studies to develop algorithms and models that enable computers to vectorize text despite the intricacies of our languages.
Let’s go over a few of them.
Count-based models
Models like the bag of words or TF-IDF are often introduced to novices starting out with natural language processing. Their methods of vectorizing text are simple and mainly evaluate text based on the frequency of their words.
These models are easy to deploy at scale and can be used in many applications. However, their approach towards vectorization disregards the words’ sequential order as well as the semantic value of the individual words.
For example, here are two very simple sentences:
"Pigeons fly."
"Eagles soar.
Pretty similar, right?
Unfortunately, with the bag of words or TF-IDF model, these sentences would have a cosine similarity of 0.
With count-based models, synonyms like "fly" and "soar" and words from the same category like "pigeons" and "eagles" would be treated as completely different entities.
Deep learning models
To address the limitations of count-based models, some studies have turned to deep learning models as a means of vectorizing text.
The word2vec model, for instance, uses a shallow neural network to evaluate each word based on the words surrounding it. This addresses the count-based models’ inability to preserve the semantic value of a given text.
Unfortunately, the word2vec model comes with its own limitations.
Firstly, it is unable to properly recognize the different meanings in homonyms. The model is not capable of identifying the version of the word that is being used when found in a body of text.
Secondly, it is unable to accommodate unknown words that weren’t used to train the model.
Finally, as a deep learning model, it requires copious amounts of data. The performance of a deep learning model can only reach satisfactory levels when it is trained with data of high quality and quantity.
To demonstrate, let’s create word2vec models using a couple of BBC articles obtained from Kaggle (copyright-free). The data for this demonstration can be found here.
First, let’s load the .txt files with the OS module and merge all text in the business category into a corpus.
After preprocessing the corpus with the NLTK library, let’s use it to train a word2vec model.
For the model, let’s see 5 words are most similar to the word "finance".

As you can see, the 5 words deemed most similar to finance aren’t really that closely tied in the business context. However, such a result is to be expected as the output of the model is limited by the corpus used to train it. Such a model can not be expected to perform adequately.
Transformer-based models
Extensive research has led to the advent of transformer-based models. The encoder-decoder architecture of these models allows computers to comprehend text with much more complexity. They are able to deal with homonyms and even unknown words.
Such models can be used to perform advanced tasks such as text summarization and machine translation.
Examples of transformer-based models include Google’s Bidirectional Encoder Representations from Transformers (BERT) model and OpenAI’s GPT-2.
Although these models are very sophisticated, their performance comes at a high price. Literally.
These models have complex architecture and are trained with billions of words. As a result, training and deploying them would incur a high cost.
Naturally, they are reserved for more advanced applications.
Using transformers for simple NLP tasks would be akin to hiring a limo for a trip to the grocery store.
Conclusion

You now have some insight into how the makeup of our language makes it so hard for us to enable computers to comprehend it.
There are obviously other difficult hurdles to face in NLP (e.g., lack of accessibility to data), but it is important to realize how many obstacles stem from the disorganized nature of our languages.
As we continue attempting to realize more sophisticated NLP technologies, getting our computers to understand our languages despite their fallacies will remain a persisting challenge.
I wish you the best of luck in your NLP endeavors!
References
- Sharif, P. (2018). BBC News Summary, Version 2. Retrieved January 30, 2022 from https://www.kaggle.com/pariza/bbc-news-summary.