The world’s leading publication for data science, AI, and ML professionals.

My First NLP Pipeline

Creating a NLP Pipeline for a Supervised Text Classification Problem

Photo by Sigmund on Unsplash
Photo by Sigmund on Unsplash

This post is based on a workshop created by Adi Shalev and me for CodeFest 2021.

Taking the first steps in the field of Data Science and specifically in NLP, we often encounter a whole new set of terms and phrases; "fit", "transform", "inference", "metrics" and many others. While the available information sources allow the understanding of those terms, it is not always clear how those different tools eventually join together to form a production machine learning based system we are able to use for new data.

The set of ordered stages one should go through from a labeled dataset to creating a classifier that can be applied to new samples (AKA supervised machine learning classification) is called the NLP pipeline.

In this post, we will create such a pipeline for a supervised classification problem: Classifying whether a movie is a drama movie based on its description. The full code can be found here.

NLP pipeline for supervised machine learning classification- What is it, anyway?

We start our journey with a dataset, a table of textual records for which the classification is known. Our data is like a stream of water we want to use to create our product. To do that, we will create a pipeline: a multi level system where every level gets its input from the previous level and forwards its output as input to the next level.

Our pipeline will be composed of the following stages:

Image by Author
Image by Author

While the way we build each stage can vary between problems, each one of these stages have a role in our final product, a shiny text classifier!

The Data

The data we will use is based on this movies dataset from Kaggle. We will use the data to determine if a given movie’s genre is drama- yes or no, a binary classification problem. Our data is a pandas Dataframe with 45466 records, 44512 after removing NULL entries.

1. Exploring the Data

Whenever we start working with a new dataset, and before we move on to making design decisions and creating a model, we have to get to know our data. Let’s answer some questions regarding our dataset:

  • What do overviews look like?

As we are hoping to use the overviews to determine if a movie is a drama movie, let’s look at some overviews.

  • How long are the overviews? The longest overview? Shortest overview?

  • How many drama movies do we have?

  • What are the most frequent words in the overviews? In the overviews of a specific genre?

The stage of exploring the data is important for the stages to come, as it allows us to gain the information we need to make the decisions in the following steps.

  • Anything else that can help us understand our dataset better!

2. Cleaning & Preprocessing the data

Now that we have a better understanding of the data in our hands, we can move on to cleaning the data. The goal of this stage is to remove irrelevant parts of our data before we use it to create our model. The definition of "irrelevant parts" varies between problems and datasets and, in real-world problems can be experimented to find the best way.

  • Removing entries shorter than 10

During data exploration, we noticed that some of our overviews are of short length. As we want to use the description to classify the genre, we can remove records with lengths shorter than 15, since overviews containing so few characters might not be informative.

  • Removing punctuation

As we want to capture the difference between the words used to describe drama movies and those used to describe other movies, we can remove punctuation.

  • Lemmatization

Lemmatization is used to group together the different inflected forms of a word so they can be analysed as a single item. Let’s see an example:

We can now apply lemmatization for our data:

After cleaning and preprocessing, our text is ready to be converted into features.

3. Train Test Split

Before we can create a model, we should split our data into a training set and a test set. The training set will be used to "teach" the model, and the test set will be used by us to evaluate how good our model is. In real world scenarios we often split a validation set as well, in which we can use for hyper parameters tuning.

The code below splits. More examples can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

4. Feature Engineering

At this point of building our pipeline, we are already familiar with our textual data, it is "clean" and preprocessed. It is still text, though. As our final goal is to train a classifier for our data, it’s time for us to convert our text to a format machine learning algorithms can process and learn from, a format also known by the name "vectors".

There are many ways of encoding textual data into vectors: from basic and intuitive ones to state-of-the-art neural networks based ones. Let’s examine some of them:

Count vectors

Describe text as a vector in which we save the number of occurrences of every word from our vocabulary. We can choose the length of the vectors, which will be also the size of our vocabulary.

You can read more about count vectors and the different parameters used when creating them here.

TF-IDF vectors

Using TF-IDF vectors, we can solve one of the problems we encounter when using count vectors: non meaningful words that appear many times within our textual data. When creating TFIDF vectors, we multiply the number of occurrences of every word from our vocabulary with a factor marking how frequent this word is in other documents in our corpus. This way, words that appear in many documents within our corpus will get smaller values within our feature vector.

You can read more about TF-IDF vectors and the different parameters used when creating them here.

5. Modeling

After using a vectorizer, our movie overviews are no longer represented as text, but are represented as vectors, which allows us to use our data to train a model.

We can now use our training set to train a model. Here we train Multinomial Naïve Bayes, but many other algorithms available in Scikit-Learn can be applied similarly, and when solving a new problem, we will often examine more than one algorithm.

6. Evaluation

After training a model, we want to evaluate how good it performs on new data. For this reason, we divided our data and set aside a test set. We will now use the predictions made by our model for the test set to evaluate our model.

We will create a confusion matrix based on the correct answers and errors our model have made.

Confusion Matrix for Binary Classification
Confusion Matrix for Binary Classification

We can now use the values created by our confusion matrix to calculate some other metrics:

Usually, our first attempt of building a model won’t have amazing metrics. This is a great time to go back to the earlier stages and try to do some of them differently.

7. Inference-Using Our Model For New Samples

After using our data to create a model to solve our problem of classifying if a movie is a drama movie based on its’ overview and evaluating our model, we can use it to determine for new samples- Drama or no drama?

As our original overviews have had a journey before we used them to train/evaluate our model, we need to process our new samples the same way:

We can now use our model to predict the genre of our new samples:

Our predictions are correct!

Summary

In this post, we created a pipeline for a supervised Text Classification problem. Our pipeline is composed of several parts that are linked to one another (like an actual pipeline!). We hope this post helped you understand the role each part of the process has in creating the final product- a text classifier.


Related Articles