fastText for Text Classification

I explore a fastText classifier for multi-class classification.

Published in

Towards Data Science

5 min readNov 5, 2020

I’ve explored 2 different NLP models for the task of text classification in my previous article. While I hadn’t planned on making it a series, I came across some newer models in the NLP space and decided to write about them.

If you’d like to, you can check the first article out, in which I focus on training my own word embedding and compare it against a pre-trained GloVe word embedding model.

I’m expanding with more posts on ML concepts + tutorials over at my blog!

We will use the fastText classifier to classify the quality of questions asked on Stack Overflow. Download the dataset from here, if you are following along with this tutorial.

What is fastText?

fastText is an open-source library, developed by the Facebook AI Research lab. Its main focus is on achieving scalable solutions for the tasks of text classification and representation while processing large datasets quickly and accurately.

Photo by Marc Sendra Martorell on Unsplash

I highly recommend going through Facebook’s own blog post and research paper regarding the motivation behind fastText and to understand how it does what it’s developed to do.

According to their research, fastText stacks impressively in both accuracy and training and testing times against previously published state-of-the-art models.

It achieves this computational efficiency and accuracy by employing 2 methods to address classification and training word representations of text.

1. Hierarchical Softmax

A Softmax function is often used as an activation function to output the probability of a given input to belong to k classes in multi-class classification problems.

Hierarchical Softmax proves to be very efficient when there are a large number of categories and there is a class imbalance present in the data. Here, the classes are arranged in a tree distribution instead of a flat, list-like structure.

The construction of the hierarchical softmax layer is based on the Huffman coding tree, which uses shorter trees to represent more frequently occurring classes and longer trees for rarer, more infrequent classes.

The probability that a given text belongs to a class is explored via a depth-first search along the nodes across the different branches. Therefore, branches (or equivalently, classes) with low probability can be discarded away.

For data where there are a huge number of classes, this will result in a highly reduced order of complexity, thereby speeding up the classification process significantly compared to traditional models.

2. Word n-grams

Using only a bag of words representation of the text leaves out crucial sequential information. Taking word order into account will end up being computationally expensive for large datasets.

So as a happy medium, fastText incorporates a bag of n-grams representation along with word vectors to preserve some information about the surrounding words appearing near each word.

This representation is very useful for classification applications, as the contextual meaning of a couple of different words strung together also results in a particular sentiment echoed by that piece of text.

Now that we have looked at some of the main features of fastText, let’s take a look at implementing fastText and achieving the classification task.

Installation

Follow these installation and setup instructions from FAIR. We will be implementing this project using Python.

Data Preparation

In order to train and evaluate this classifier, we’ll have to prepare our data in a format fastText expects.

fastText expects the category first, with the prefix ‘__label__’ before each category, and then the input text, like so,

__label__positive I really enjoyed this restaurant. Would love to visit again.

Of course, we will apply some NLP preprocessing techniques to remove unwanted symbols, punctuation and convert text to lower case.

The code below takes care of adding the prefix ‘__label__’ to each row in the category column. We will use gensim’s simple_preprocess method to tokenize our questions and remove symbols.

Training & Evaluation

After saving our DataFrames as text files, the next step is training and testing our model.

To improve the performance of our model the wordNgrams parameter is set to 2. In other words, the model is being trained on bigrams instead of considering individual words.

There are 2 methods to test our model, both of them being slightly different.

The predict method is used to predict the most likely label for a given input. In the above code, I chose an observation which belonged to the category “HQ” and tested the model against it. As you can see below, it correctly predicted the category of “HQ” and did so with a probability of 95%.

Testing the model on a single sentence.

I used the test method to evaluate my classifier on the entire test dataset (15000 samples), which yielded values for precision at one as 0.83 and recall of one as 0.83 as well.

Model’s performance on the test set.

The precision is the number of the correct labels predicted by the classifier among all the labels and the recall the number of labels successfully predicted among the real labels.
- Text Classification • fastText blog

In our case, as I haven’t specified the value of the parameter k, the model will by default predict only 1 class it thinks the given input question belongs to.

Conclusion

Compared to my previous models of training my own embedding and using the pre-trained GloVe embedding, fastText performed much better.

fastText was much, much faster than training neural networks on multi-dimensional word vectors, and also achieved good performance on the test set.

Thanks for stopping by and reading the article. Be on the lookout for more articles in this series, as I’ll be posting a couple more tutorials and learning about newer models.

Have a nice day and see you in my next article!