Accuracy and performance measures for classifying some text corpora indicate that naive bayes classifier is a strong contender…
This is the third and final article in this series on using word-vectors with document-vectors for NLP tasks. The focus here is text Classification. We summarize the results and recommendations thereof, when word-vectors are combined with document-vectors. The code can be downloaded from github. Let us jump right in with a quick summary of the past two articles.
- Similarity: Word-vector is a representation of a word as a numerical vector of some chosen length p. They are derived by applying tools such as Word2vec, Glove, and FastText against a text corpus. Words with similar meaning typically yield numerical word-vectors with a cosine similarity closer to 1 than 0.
- Order Reduction: Combining word-vectors with bag-of-words based document-vectors yields a large (usually a few orders of magnitude, equal to m/p where m is the number of unique words in the text corpus and p is the length of the word-vectors) reduction in the model Z for the text corpus. We built a scikit pipeline (vectorize => embed words => classify) to derive Z from the higher-order X with help from the word-vector matrix W. Each row of W is a p-dimensional numerical representation of a word in the text corpus.

With that we are ready for evaluating the impact of word-vectors for text classification and qualifying the results with options open to us in the pipeline steps. We stick to a few major options for each of the three steps.
- Tokenizaton. We will use scikit’s CountVectorizer and TfIdfVectozer.
- Word-Vectors. We will evaluate Word2Vec/SGNS and FastText. Both pre-trained and custom generated (via Gensim)
- Classifiers. We will use scikit’s Multinomial Naive Bayes, Linear Support Vectors, and Neural Nets
1. Simulations
We work with two document repositories. The large movie review data set from Stanford for binary sentiment classication, and the reuter 20-news from scikit pages for multiclass. To keep it simple we stick to a single training set and single test set. In case of 20-news we do a stratified split with 80% for training and 20% for test. The imdb movie review data set comes with defined train and test sets. Lines 9 and 10 in the code snippet below use a Tokens class (check the github code repo) that has methods to pull tokens from the index.

The model to run is an instance of the pipeline – a specific combination of vectorizer, transformer, and classifier. Lines 13 thru 15 define the model and run the predictions. The simulations are run via a shell script below that loops over the different options for the pipeline steps.

The mlp classier actually consists of 9 variants with 1,2, or 3 hidden layers each with 50, 100, or 200 neurons. The results reported here are for when 2 hidden layers are used each with 100 neurons. The other mlp classifier runs basically serve to verify that the quality of the classification was not very sensitive at this level of hidden layers and neurons.
Some combinations are not allowed however and they are skipped in the Python implementation. Those are:
- Naive Bayes classifier does not allow for negative values in the document vectors. But when we use document+word vectors, Z will have some negatives. It should be possible to translate/scale all vectors uniformly to avoid negatives, but we do not bother as we have enough simulations to run anyway. So basically naive bayes classifier is used ONLY with pure document vectors here.
- The pre-trained word-vectors are only available for normal words, not stemmed ones. So we skip the runs with the combination of stemmed tokens and pre-trained vectors.
2. Results
Th only metrics we look at are the F-scores for the quality of classification and the cpu time for efficiency. In case of multiclass classification for the 20-news data set, F-score is an average over all the 20 classes. The run time for a classifier+vectors combination is averaged across all the runs with the same combination.
2.1 Multiclass Classification of 20-News Data Set
The results for the 20-news data set are summarized in Figure 1 below.

There is a lot of detail crammed into the figure here so let us summarize point by point.
1. Document Vectors vs Document+Word Vectors: A glance at 1A and 1B would tell us that the classification quality in 1A is better, not by much perhaps, but nevertheless true and across the board. That is, if classification quality is paramount then document vectors seem to have an edge in this case.
2. Stopped vs Stemmed: Stemmed vocabulary yields shorter vectors, so better for performance for all classifiers. This is especially true for the mlp classifier where the number of input neurons equals the size of the incoming document vectors. When the words are stemmed, the number of unique words dropped by about 30% from 39k to 28k, a big reduction in the size of the pure document vectors.
- Document Vectors. Figure 1A shows that when pure document vectors are the basis for classification, there was no material impact on the F-scores obtained.
- Document+Word Vectors. There does seem to be some benefit however to using stemmed tokens in this case. While the improvements are small, the custom vectors obtained by training on stemmed tokens show better F-scores than the vectors trained on stopped tokens. This is shown in Figure 1B.
3. Frequency Counts Vs Tf-Idf: Tf-Idf vectorization allows for differential weighting for words based on how commonly they occur in the corpus. For keyword based search schemes it helps improve the relevance of search results.
- Document Vectors. While naive bayes was not impressed by tf-idf, both the linearsvc and mlp classifiers yield better F-scores with tf-idf vectorization. This is shown in Figure 1A.
- Document+Word Vectors. Figure 1B shows that there is a good improvement in F-scores when tf-idf vectorization is used. Both with pre-trained word-vectors and custom word-vectors. The only exception seems to be when pre-trained word2vec vectors are used in conjunction with an mlp classifier. But increasing the number of hidden layers to 3 (from 2) and neurons to 200 (from 100) tf-idf vectorization again yielded a better score.
4. Pre-trained Vectors Vs Custom Vectors: This applies to Figure 1B alone. Custom word-vectors seem to have an edge.
- Word2Vec: Custom vectors clearly yield better F-scores especially with tf-idf vectorization
- FastText: Pre-trained vectors seem to be marginally better
5. Timing Results: Figure 1C shows the average cpu time for the fit & predict runs.
- The much larger run time for mlp classifier when pure document vectors are used is understandable. There are many (39k if stopped and 28k if stemmed) input neurons to work with. This is one of the reasons for the emergence of Word Embeddings as we discussed in the earlier posts.
- With a smaller but dense Z, the linearsvc classifier takes longer to converge.
- Naive bayes classifier is the fastest of the lot.
2.2 Binary Classification of Movie Reviews
Figure 2 below shows binary classification the results obtained for the movie review data set.

The observations here are not qualitatively different from the above so we will not spend much time on it. The overall quality of classification here is better with the F-scores being north of 0.8 (compared to a around 0.6 for the 20-news corpus). But that is just the nature of this data set.
- Document Vectors vs Document+Word Vectors: Glancing at 2A and 2B we can say that document+word vectors seem to have an edge for classification quality overall (except for the one case when linearsvc is used with tf-idf). The opposite was the case for the 20-news data set.
- Stopped vs Stemmed: Stopped tokens seem to perform better in most cases. The opposite was true with the 20-news data set. Stemming results in a 34% reduction in the size of the vocabulary from 44k to 29k. We applied the same stemmer for both the data sets, but perhaps it was too aggressive for the nature of text in this corpus.
- Frequency Counts vs Tf-Idf: Tf-Idf vectors performed better in most cases, just as they did in the 20-news data set.
- Pre-trained Vectors Vs Custom Vectors: Custom word-vectors yield better F-scores than pre-trained vectors in all cases. Kind of confirms our assessment from the 20-news data set where it was not so clear cut.
- Timing Results: Naive Bayes is still the best of the lot.
3. So What are the Big Take Aways?
We cannot unfortunately draw definite conclusions based on our somewhat shallow testing on just two data sets. Plus as we noted above, there was some variance in classification quality for the same pipeline across the data sets as well. But at a high level we can perhaps conclude the following – take them with a grain of salt!
1. When in doubt – simplify. Clearly this deserved to make it to the title of the post. Do the following and you will not be too wrong and you will get your work done in a jiffy to boot. Use:
- Naive Bayes classifier
- Tf-idf document vectors
- Stem if you want to.
2. Understand the corpus. The specific pipeline (tokenization scheme => vectorization scheme => word-embedding algorithm => classifier) that works for well one corpus may not be the best for a different corpus. Even if the general pipeline may work, the details (specific stemmer, number of hidden layers, neurons etc…) will need to be tuned to get the same performance on a different corpus.
3. Word embeddings are great. For dimensionality reduction and the concurrent reduction in the run times, for sure. They work pretty well, but there is more work to do. If you have to use neural nets for document classification you should try these.
4. Use custom vectors. Using custom word-vectors generated from the corpus at hand are likely to yield better quality classification results