Model Selection in Text Classification

Published in

Towards Data Science

9 min readMay 20, 2020

Update 2020–11–24: Added resource in the conclusion
Update 2020–08–21: The pipeline can now being used with binary and multiclass classification problem. There is still an error with the transformers when dealing with the save of the model.
Update 2020–07–16: The pipeline can now save models, bugs corrected inside the main function. Adaboost, Catboost, LightGBM, ExtratreesClassifier added

Introduction:

In the beginning, there was a simple problem. My manager came to me to ask if we could classify mails and associated documents with NLP methods.

Doesn’t sound very funky but I’ll start with thousands of sample. The first thing asked was to use “XGBoost” because: “We can do everything with XGBoost”. Interesting job if data science all comes down to XGBoost…

After implementing a notebook with different algorithms, metrics, and visualization something was still in my mind. I couldn’t select between the different models in one run. Simply because you can have luck with one model and don’t know how to reproduce the good accuracy, precision etc…

So at this moment, I ask myself, how to do a model selection? I looked on the web, read posts, articles, and so on. It was very interesting to see the different manners to implement this kind of thing. But, it was blurry when considering the neural networks. At this moment, I had one thing in mind, how to compare classical methods (multinomial Naïve Bayes, SVM, Logistic Regression, boosting …) and neural networks (shallow, deep, lstm, rnn, cnn…).

I present here a short explanation of the notebook. Comments are welcome.

The notebook is available on GitHub: here
The notebook is available on Colab: here

How to start?

Every project starts with an exploratory data analysis (EDA in short), followed directly by preprocessing (the texts were very dirty, signatures in emails, url, mails header, etc…). The different functions will be presented in the Github repository.

A quick way to see if the preprocessing is correct is to determine the most common n-grams (uni, bi, tri… grams). Another post will guide you on this way.

Data

We will apply the method of model selection with IMDB dataset. If you are not familiar with the IMDB dataset, it’s a dataset containing movie reviews (text) for sentiment analysis (binary — positive or negative).
More details can be found here. To download it:

$ wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz$ tar -xzf aclImdb_v1.tar.gz

Vectorizing methods

One-Hot encoding (Countvectorizing):
It’s the method where words will be replaced by vectors of 0's and 1's. The goal is to take a corpus (important volume of words) and make a vector of each unique word contained in the corpus. After, each word will be projected in this vector where 0 indicates non-existent while 1 indicates existent.

       | bird | cat | dog | monkey |
bird   |  1   |  0  |  0  |    0   |
cat    |  0   |  1  |  0  |    0   |
dog    |  0   |  0  |  1  |    0   |
monkey |  0   |  0  |  0  |    1   |

The corresponding python code:

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df[TEXT])  # text without stopwords# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

TF-IDF:
Term Frequency-Inverse Document Frequency is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus (source: tf-idf).
This method is powerful when dealing with an important number of stopwords (this type of word is not relevant for the information → I, me, my, myself, we, our, ours, ourselves, you… for the English language). The IDF term permits to reveal the important words and rare words.

# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=10000)
tfidf_vect.fit(df[TEXT])
xtrain_tfidf =  tfidf_vect.transform(train_x_sw)
xvalid_tfidf =  tfidf_vect.transform(valid_x_sw)

TF-IDF n-grams:
The difference with the previous tf-idf based on one word, the tf-idf n-grams take into account n successive words.

# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=10000)
tfidf_vect_ngram.fit(df[TEXT])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x_sw)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x_sw)

TF-IDF chars n-grams:
Same as the previous method but the focus is on the character level, the method will focus on n successive characters.

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char',  ngram_range=(2,3), max_features=10000) 
tfidf_vect_ngram_chars.fit(df[TEXT])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x_sw) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x_sw)

Pre-trained model — FastText

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (source: here)

How to load fastText? From the official documentation:

$ git clone https://github.com/facebookresearch/fastText.git 
$ cd fastText 
$ sudo pip install . 
$ # or : 
$ sudo python setup.py install

Download the right model. You have models for 157 languages here. To download the English model:

$ wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip& unzip crawl-300d-2M-subword.zip

When the download is complete load it in python:

pretrained = fasttext.FastText.load_model('crawl-300d-2M-subword.bin')

Word Embeddings or Word vectors (WE):

Another popular and powerful way to associate a vector with a word is the use of dense “word vectors”, also called “word embeddings”. While the vectors obtained through one-hot encoding are binary, sparse (mostly made of zeros) and very high-dimensional (same dimensionality as the number of words in the vocabulary), “word embeddings” are low-dimensional floating point vectors (i.e. “dense” vectors, as opposed to sparse vectors). Unlike word vectors obtained via one-hot encoding, word embeddings are learned from data. It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with very large vocabularies. On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or higher (capturing a vocabulary of 20,000 token in this case). So, word embeddings pack more information into far fewer dimensions. (source: Deep Learning with Python, François Chollet 2017)

How to map a sentence with int numbers:

# create a tokenizer 
token = Tokenizer()
token.fit_on_texts(df[TEXT])
word_index = token.word_index# convert text to sequence of tokens and pad them to ensure equal length vectors 
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=300)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=300)# create token-embedding mapping
embedding_matrix = np.zeros((len(word_index) + 1, 300))
words = []
for word, i in tqdm(word_index.items()):
    embedding_vector = pretrained.get_word_vector(word) #embeddings_index.get(word)
    words.append(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Model Selection

What is model selection in computer science? Specifically in the field of AI? Model selection is the process of choosing between different machine learning approaches. So in short, different models.

But, how could we compare them? To do that we need metrics (see this link for more details). The dataset will be split into train and test parts (the validation set will be determined in the deep learning models).

What sort of metrics do we use in this binary or multiclass classification model selection?

Aparté:
For classification we will use the terms:
- tp: True positive prediction
- tn: True negative prediction
- fp: False-positive prediction
- fn: False-negative prediction
Here a link for more details.

Accuracy: All positive predictions on all predictions
(tp + tn) / (tp + tn + fp + fn)
Balanced Accuracy: It is defined as the average of recall obtained on each class for an imbalanced dataset
Precision: The precision is intuitively the ability of the classifier not to label as positive a sample that is negative tp / (tp + fp)
Recall (or sensitivity): The recall is intuitively the ability of the classifier to find all the positive samples tp / (tp + fn)
f1-score: The f1 score can be interpreted as a weighted average of the precision and recall -> 2 * (precision * recall) / (precision + recall)
Cohen kappa: It’s a score that expresses the level of agreement between two annotators on a classification problem. So if the value is less than 0.4 is pretty bad, between 0.4 and 0.6 it,s equivalent to human, 0.6 to 0.8 it’s a great value, more than 0.8 it’s exceptional.
Matthews Correlation Coefficient: The Matthews correlation coefficient (MCC) is used in machine learning as a measure of the quality of binary (two-class) classifications […] The coefficient takes into account true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. (source: Wikipedia)
Area Under the Receiver Operating Characteristic Curve (ROC AUC)
Time fit: The time needed to train the model
Time Score: The time needed to predict results

Great !! We are our metrics. What’s next?

Cross-Validation

To be able to realize a robust comparison between models we need to validate the robustness of each model.

The figure below shows that the global dataset needs to be splitting into train and test data. Train data to train the model and test data to test the model. Cross-validation is the process of separate the dataset in k-fold. k is the number of proportions we need to realize the data.

Generally, k is 5 or 10 it will depend on the size of the dataset (small dataset small k, big dataset big k).

The goal is to compute each metric on each fold and compute their average (mean) and the standard deviation (std).

In python, this process will be done with the cross_validate function in scikit-learn.

What models will we compare?

Models evaluated

We will test machine learning models, deep learning models, and NLP specialized models.

Machine Learning Models

Multinomial Naïve Bayes (NB)
Logistic Regression (LR)
SVM (SVM)
Stochastic Gradient Descent (SGD)
k-Nearest-Neighbors (kNN)
RandomForest (RF)
Gradient Boosting (GB)
XGBoost (the famous) (XGB)
Adaboost
Catboost
LigthGBM
ExtraTreesClassifier

Deep Learning Models

Shallow Neural Network
Deep neural network (and 2 variations)
Recurrent Neural Network (RNN)
Long Short Term Memory (LSTM)
Convolutional Neural Network (CNN)
Gated Recurrent Unit (GRU)
CNN+LSTM
CNN+GRU
Bidirectional RNN
Bidirectional LSTM
Bidirectional GRU
Recurrent Convolutional Neural Network (RCNN) (and 3 variations)
Transformers

That’s all but, 30 models are not bad.

This can be resume here:

Let’s show some code

Machine Learning

I will use cross_validate() function in sklearn (version 0.23) for classic algorithms to take multiple-metrics into account. The function below, report, take a classifier, X,y data, and a custom list of metrics and it computes the cross-validation on them with the argument. It returns a dataframe containing values for all the metrics and the mean and the standard deviation (std) for each of them.

Function to compute different metrics with Machine Learning Algorithm

How to use it?

Here an example for multinomial Naive Bayes:

The term if multinomial_naive_bayes is present because this code is part of the notebook with parameters (boolean) at the beginning. All the code is available in GitHub and Colab.

Deep Learning

I haven’t found a function like cross_validate for deep learning, only posts about using k-fold cross-validation for neural networks. Here I will share a custom cross_validate function for deep learning with the same input and output as the report function. It will permit us to have the same metrics and to compare all models together.

k-Folds CrossValidation for Deep Learning with metrics for binary and multiclass classification

The goal is to take a neural network function as:

And call this function inside the cross_validate_NN function. All the Deep Learning received the same implementation and will compare. For the full implementation of the different models go to the notebook.

Results

When all the models computed the different folds and metrics we can easily compare them with the dataframe. On the IMDB data set the model performing better is:

Here I just show the results for accuracy > 80% and for accuracy, precision, and recall metrics.

How to improve?

Build a function within argument a dictionary of models and concatenate all the work in a loop
Use distributed Deep Learning
Use TensorNetwork to accelerate Neural Networks
Use GridSearch for Hyper-tuning

Conclusion

You have now an import pipeline to made model selection for text classification with lots of parameters. All of the models are automatics. Enjoy.

Another great resource about Model Selection in Machine Learning Era (more theoretic article) was written by Samadrita Ghosh on Neptune.ai blog.