The world’s leading publication for data science, AI, and ML professionals.

What makes NLP a Unique Branch of Machine Learning?

The machine learning era started by analyzing the binary data stored in spreadsheets, traditional databases, and even CSVs. With the…

To keep up with evolving compliance needs, teams should prioritize automating data governance tasks. Image courtesy of Scott Graham on Unsplash.
Photo by Scott Graham on Unsplash

The Machine Learning era started by analyzing the binary data stored in spreadsheets, traditional databases, and even CSVs. With the splendid success of such models, data scientists extended these techniques to text data. What is so different for text data that makes it significantly different than processing binary data? In this brief article, I will give you a few convincing reasons and also give you an entire pipeline for processing text data. Let me first briefly describe the pipeline for binary data pre-processing.

Binary Data Pre-processing Pipeline

Data stored in conventional databases is rarely clean. Several data points may contain missing values (nulls), so you start with cleansing the data. The dataset may contain many correlated columns – you reduce your column count (features) by eliminating any unwanted columns. The data in individual columns may have wide ranges – you will need to scale such columns. After doing such pre-processing, your major task is to reduce the features count; the higher the count, the more is the training time and the resource requirement. Now, let us look at what kind of pre-processing is required for text data?

What Makes Text Pre-processing So Different?

A database table typically comprises a few columns – this number is in tens. Each column in a database is a potential candidate for the features in ML algorithm training. Now, consider the case of text data. For text data, each word or a sentence is considered a potential feature. To understand why, we need to consider the ML applications that we seek to develop on text data.

Consider the e-mail spam application that all of us are well aware of. The machine learning model looks for a few offending words in an unseen email and, if found, marks the corresponding email as spam. Note, we are looking for words. So for model training, we will tokenize the email text from the training dataset into word tokens. Using supervised learning, we will make the model learn to classify the email based on the presence of certain offending words.

Now, consider the situation where you are asked to train a machine to give it the capability of summarizing the novel. You are well aware of the back-cover passage in any book. An excellent summary compels the potential buyer to make a purchase. This is a very advanced level NLP (Natural Language Processing) application. You need to tokenize the entire novel text into sentences and pick up the most important sentences for the book summary.

These two examples would definitely convince you why we need to tokenize the text corpus into either words or sentences and not individual characters. Now, here comes the major problem. Any text passage would contain thousands and tens of thousands of words and sentences. Each such token is a potential candidate for a feature in the ML model. We must substantially reduce this feature count to make the training viable.

The second most important requirement for NLP is we need to vectorize our word and sentence tokens. Compare this with binary data, where the data is already in a machine-readable format and can easily be represented as tensors. Though the individual characters in the text data are in binary format (ASCII – machine-readable), in NLP we work with words and sentences. We need to convert these into vectors. We use utilities like Word2Vec, GloVe for this purpose.

To further understand why text processing is so different from processing binary data, I will give you the complete text pre-processing pipeline.

Text Pre-processing Pipeline

Our aim for text pre-processing is to reduce the size of our word vocabulary. So, the first thing that you may like to do is to remove all the punctuation marks from the text corpus. Note that a tokenizer will most likely spill out the punctuation marks as word tokens. You may use regular expressions to remove all such unwanted characters. Next, you would remove words like "the", "a", "this", "was" and so on. These words make little sense in having them as features. We call them stop words.

You may also like to remove the numbers/digits or convert them into text depending on your application type. Lower-casing all the words would also help in getting the word count down when you include only the unique words in your vocabulary. The words "John" and "john" would mean the same for a machine learning model, while the tokenizer creates two distinct tokens for these words.

Further, reduction in vocabulary can be done by using stemming and lemmatization. Both reduce the inflected word to its root form, except for the fact that lemmatization ensures that the root word belongs to the language. The inflection is the modification of a word by adding a prefix, suffix, or infix to it. As an example, the words "Playing", "Plays", "Played" would be reduced to the common root "play", which is also a valid word in English. A stemmer would reduce the word "troubling" to "troubl", likewise it will also reduce "troubled" to "troubl" after removing the suffix. In both these cases, the reduced word is not a valid word in our dictionary. If you want only the valid words in a language, use lemmatization.

After you finish the text pre-processing, you will tokenize the entire corpus, making it ready to input to your machine learning model.

All these processes reduce the word count substantially. I am listing the various steps below for your quick reference. It is not necessary to follow these steps in order as long as you understand the purpose behind each step.

  • Removing punctuations
  • Removing stopwords
  • Removing/Converting numbers
  • Lower casing
  • Selecting unique words
  • Stemming and Lemmatization
  • Tokenization

For a machine-learning algorithm, the word count after running all the above processes is still typically high. We consider each word a feature and we must further reduce its count. Depending on the type of NLP application that we are developing, we will use a few more techniques to reduce this count.

Advanced Processing

The text data is unstructured, the words in our vocabulary do not have a fixed length. The machine learning algorithms require fixed-length inputs. The length of this vector equals the number of words in the vocabulary. For a large text corpus, this number would be still dangerously high. To make it smaller, you use the technique called bag-of-words. In this technique, you collect the most common words from your text corpus in a bag. The frequency of their occurrences decides the most common words. Now, the size of this bag determines the vector length.

The word with high frequency does not mean that it is highly relevant to represent information in a document. The common words like "this" and "that" will have very high frequencies but carry no document-related information. So, there comes the concept of tf-idf (term frequency – inverse document frequency). You may also use bi-gram and n-gram to reduce your features count. The description of these techniques is beyond the scope of this article, but you can now certainly start appreciating why NLP has a special treatment in Machine Learning.

The requirements of text processing using the NLP do not stop here. You require NLU (Natural Language Understanding) to develop advanced applications like language translation. Neural network architectures like RNN, LSTM were developed for this purpose. This is now superseded by transformer architecture. We now have many models like BERT, GPT, etc. for advanced NLP applications that use text data type.

Concluding Remarks

Machine learning model development based on binary datasets is trivial because of the limited number of features inherent in a database table and machine-readable data formats they have. Contrary to this, developing models on text data is very challenging. This is because of the size of the text corpus and considering the fact that each word is a potential candidate for a feature in the training dataset. Second, representing the tokenized words and sentences into machine-readable formats that are numeric vectors is another enormous challenge.

As you have seen above, not only the exhaustive text pre-processing is required to reduce the features count, but you need to apply many advanced techniques like BoW and tf-idf while developing NLP applications. The more advanced NLP applications require language and context understanding and we have another branch called NLU. You will now easily appreciate why NLP/NLU requires special attention and treatment in Machine Learning and it is not a hype. So, as a data scientist or a machine learning engineer, you will definitely need to gain the NLP skills to work on text corpora.

Credits

Pooja Gramopadhye – Copy editing


Related Articles