Natural Language Processing Classification Using Deep Learning And Word2Vec

Published in

Towards Data Science

11 min readJun 15, 2019

Introduction

I experienced machine learning algorithms before for different problematics like predictions of money exchange rate or image classification. I had to work on a project recently of text classification, and I read a lot of literature about this subject. The case of NLP (Natural Language Processing) is fascinating. When you begin to think about it, you realized that it’s not that simple, and before the classification, there’s still this question:
“How the hell can an algorithm read words ?”. One solution is to transform words into vectors to have a numerical representation of them. This solution is far from new, and a few years ago, an article presented the Google Word2Vec unsupervised algorithm: Efficient Estimation of Word Representations in Vector Space (Mikolov & al., 2013). Many documentation about it can be found, but the point of this article is to detail from A to Z how to build machine learning algorithm for text classification. I will demonstrate how to use Word2Vec with the pre-trained Google news Dataset, and how to train it yourself with your data. I will then demonstrate two techniques; one is to do the mean of your document words, and the other is to keep your data like they are, which keep more information, but it’s a bit more complicated and requires more time to train. So it depends on you, what you think is better in your case and with your data.

1 FIRST WE NEED TO IMPORT THE DATA

For this step, make sure that the folder that contains your reviews is in the same folder as the notebook.

The data I used are movie reviews that can be found here: Movie reviews. I took the “sentence polarity dataset v1.0”. I took the “sentence polarity dataset v1.0”. I chose those ones because I can compare my results with the paper Convolutional Neural Networks for Sentence Classification (Yoon Kim, 2014). This paper has the advantage to present a Neural Network for this dataset, but it compares its result to other algorithms in Table 2, which is really interesting because we have many algorithms from different paper to compare our results.

Extract the file that you downloaded with the link.
Okay so basically now we have one folder that is named “rt-polaritydata”, and two files in it that are named “rt-polarity.neg”, and “rt-polarity.pos” (resp., negative reviews, and positive reviews). Our job here will be to put every data into pandas data frames to analyze them. Begin to convert them into CSV files.

Now we are creating the “labels” of our data, the 1, means a positive review, and the 0 a negative review.

Now the result should be as follow

Figure 1: Our Dataframe, with the text of the review, and its label

Okay, it seems great! Now we have every review into our pandas dataframe named “reviews”, with a specific label (1 for a positive review, 0 for negative review).

2 USING Word2Vec TO SEE SIMILARITY DISTANCE BETWEEN OUR WORDS

Word2Vec is a good model of neural networks used for word embedding. It is used principally to having the similarity context of words. We will train the model on our data to have a distance between all of our words, to see which ones are semantically close to each other. There are other models, but I chose this one for 2 reasons :

It’s the one used by Yoon Kim in his article
It’s the model developed by Google, it seems to be fully recommended, documentation can easily be found, and this article: Efficient Estimation of Word Representations in Vector Space (Mikolov & al, 2013) explains well all the process.

2.1 Tokenization

Now, your dataframe should look like that

Figure 2 : The dataframe, with the tokens

It is important for the training that every review are represented as a list of words like in the “tokens” column.

2.2 : Use the pre-trained Google news Dataset

First, you need to download the dataset here: Google news Dataset.
Then, extract it in your folder. I extract it in a subfolder named “model”

It’s as simple as that! Now you have your model named “w2v_model” that is trained and contains every word in the dataset represented as vectors.

2.2.1 Training the model on your data

You could also train the model on your personal data. However, I do not recommend this technique for small documents because Word2Vec will not be able to capture properly the context of your words, and it won’t give a satisfying result. I tested it on my data for this article, and results were notably better with the pre-trained Google Word2Vec. On another data set with a mean of 200 words per documents, it was more reliable and showed an even better result than the pre-trained model in some cases.

We will separate the work into 3 steps

Word2Vec(). Initialize the model with all its parameters
.build_vocab() Build the vocabulary from a sequence of sentences
.train() We train our model

2.3 The results

Now we can test our model with some words, to see which ones have the most similarity to them.
We test it with :

movie
fiction
good

For the word “good”, I have those results

Figure 3 : Words that are the most similar to “good”

Those results are obtained with the pre-trained Google news Dataset.

We can see however that the model is not perfect and does not capture the semantics of words, because we have [great, bad, terrific, decent]. This could be a problem because good is “semantically” close to bad here. Indeed they can be used in the same context, but their meaning is not the same.

2.5 A little bit of DATA Vizualisation

Above is a plot of 10 000 words of our dataset. Those who are semantically close are next to each other on the map. I used bokeh to make the map dynamic, we can interact with it and you can put your mouse on a dot to see the word that corresponds to it.
We can now clearly see the relations between all the words, and which ones are close or distant.

Figure 4 : Bokeh chart of 10000 words of our dataset

3 A LITTLE BIT OF WORK ON DATAS

3.1 Train test split

Now that we have our data frame, we need to separate our data into a training variable, and a testing variable. With the training, our algorithm will learn its parameters, and with the testing, we will test them.
We separate training and testing to see if there are no overfitting problems, which is recurrent in the domain of deep learning. It means that our model has great results with the data with whom it has learned, but it has a problem to generalize, and it will have bad results on other datasets, and this is clearly not the goal.

3.2 Building the vectors

What we do here is using the TfidfVectorizer from sklearn. This function is reflecting the strength of a word in a document.
We use the line tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) to put all the words in a vector named tfidf, as you can see just above if you execute it.
It was a tip that i found on this amazing blog by Ahmed BESBES . It is really interesting and it deserves to be read.

Now just for fun, and to visualize it, I used WordCloud to picture the 100 most important words of the dictionary we had. We can see words like play, movie, scene, and story that are obviously important for a dataset about movie critics.
I used another blog of Ahmed BESBES to use this library.

Figure 5 : The most “important” words in our corpus

Now we will build a function that will calculate the “mean” of a given critic. Our w2v_model gave us which words are close to each other, so for each of them, we multiply them with their importance in the “dictionary” : w2v_model[word].reshape((1, size)) * tfidf[word].
Note: we use the reshape function because we do this for every text of our corpus, so as we have for example 8529 texts in X_train, if we apply this function to it, we will have a two-dimensional matrix of shape (8529,300).

8529 stands for the number of texts in our corpus
300 stands for the size of the vector created by Word2Vec.

And that’s it, now we divide it by the number of observations and we’re good to have the mean of all that.

The calculus could be resumed as follow :

Figure 6: Formula of the mean of the words by ponderation with their Tf-idf

Where :

n is the number of words in the text
Wi is the vector Word2Vec of size 300 for a given word i
Ti is the value tfidf for a given word i

Now we apply this function to our data.
So as I said, buildWordVector has two arguments, tokens, and size. The size is 300 because of the word2vec model we have got a shape of 300. For the tokens, it will increase in a loop, to cover all of the 8529 texts of our training corpus, and 2133 of our test corpus.

4 THE FIRST NEURAL NETWORK

The first neural network is just a simple artificial neural network with only two dense layers, and a dropout of 0.7 to avoid overfitting. For this one, we take the mean vectors of each word in a given review as input.

4.1 Build the neural network

Here are the characteristics of this simple classifier.

Number of Dense layers : 2
Activation Function : relu, and sigmoid for the last dense layer
Dropout : 0.7
Optimizer : Adadelta
Loss : Binary Cross Entropy

4.2 Train the neural network

Now we train our neural network on our training data with a batch_size of 50, and with 20 epochs.
Do more epochs does not seem to change the accuracy. It may be useful to do a grid search with different batch_size and number of epochs to see the better parameters.

Finally, we plot the history of the training to see the evolution, and to compare the training and the test predictions

Figure 8: accuracy and loss for the first classifier

In the end, we have a training accuracy of 0.8342 and test accuracy of 0.7286. That’s not bad, and what is important to notice is that we don’t have a lot of overfitting.

5 A CONVOLUTIONAL NEURAL NETWORK

CNN is mainly used for image classification because it can recognize patterns with their filter maps. But in 2014, when Yoon Kim published his article, he showed that they could be useful for text classification too. Indeed this idea is not totally crazy, because sentences have patterns too.

5.1 Build the neural network

First, we try to find all the parameters to construct our neural network. This one will be a CNN, but instead of feeding him with the mean of all the words vector in a sentence, we’ll give him all the word vectors in a given sentence.
Also, the structure changes a little bit, with more neurons in each layer.

Our neural network is the same as the one constructed by Yoon Kim (2014) I described above.

Number of Convolutionnal layers : 3
Number of Dense layers : 2
Number of Feature maps : 128 per Convolution
Activation Function : relu, and sigmoid for the last dense layer
Filter size : 3, 4, and 5
Dropout : 0.5
Optimizer : Adadelta
Loss : Binary Cross Entropy

There are few differences between this CNN and the one used by Yoon Kim :
1. He just had 1 Dense Layer
2. He never used sigmoid
3. He used 100 feature maps per convolution instead of 128

However, I had better results with those little changes so I kept them like that.

To build it, we need some parameters, the embedding dimension (size of a word2vec vector), the max of vocab size (how many unique words we have), and the max sequence length (the maximum of words per review).
The code below gives you all of those parameters, if you test it with another dataset, just change the three variables with the result of this code :

Now we create the train and test inputs we will use in our CNN. For each document that has less than the maximum of words, we complete them with “0”. This does not change our results, because CNN recognize patterns, and the pattern will still be the same however it is at a certain point, or at another. For example for an image, this means that if an image is smaller than others, we will put black borders to it. This will not change the image.

5.2 Define the CNN

The result of the summary should be as follow :

And let’s go for a training session of 10 epochs and again a batch size of 50 !

Figure 11: accuracy and loss for the cnn

At the end of the 10 epochs, we have accuracy for the training set of 0.915, and of 0.7768 for the testing set. We have a little overfitting, and the validation loss is quite unstable, but the results are here. I trained it with more epochs but it seems to be the best we could have.

6 CONCLUSIONS

We can clearly see that the CNN is better for this task, and with my other data frames, I personally had the same results.
But, it still has an inconvenient, it is way deeper, have a lot more of parameters, and take more time to train. For this little dataset, the difference is not that important, but I had to train it on data for my job, and the simple classifier took 13 minutes to train when the CNN took 5 hours! So it’s up to you to decide which one you want to use.
Those two classifiers still show some good results, and I noticed that the more data they have, and the more the length of a document is important, the better they are. For a dataset of 70 000 data and a maximum length of document of 2387, my test accuracy was 0.9829, so it’s pretty encouraging!

7 PERSPECTIVES

I have two main ideas to try to have better results. First, with the first classifier, we could use another neural network more complex, like a Recurrent Neural Network (CA-RNN: Using Context-Aligned Recurrent Neural Networks for Modeling Sentence Similarity (Chen, Hu & al., 2018).), or an Attentional Network that begin to be used now ( Hierarchical Attention Networks for Document Classification (Yang & al., 2016)).
The second idea is for the word embedding, in 2018 Google showed a new model called BERT ( BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, Chang & al., 2018)) who have the advantage to use segmentation of tokens. For example, if we have the word archeologist in our data, it could memorize “archeology”, and when a word like “archeology” will appear, it will know that it’s related to archeologist, where word2Vec would just ignore word it doesn’t know.