The world’s leading publication for data science, AI, and ML professionals.

Multiclass Classification with Word Bags and Word Sequences

SVM with tf-idf vectors edges out LSTM in quality and performance for classifying the 20-newsgroups text corpus.

Document is a specific sequence of words. But not all sequences of words are documents. Teaching the difference to an algorithm is a tall order. Taking the sequence of words into account for text analysis is in general computationally expensive. Deep Learning approaches such as LSTM allow us to model a document as a string- of-words and they have indeed found some success in NLP tasks recently.

On the other hand when we shred the document and make bags by word, we end up with a vector of weights/counts of these words. The mapping from a document to this vector can be many-to-one as all possible sequences of words yield the same vector. So the deciphering of the meaning of the original document (much less resurrecting it!), from this vector is not possible. Nevertheless this decades old bags-of-words approach to modeling documents has been the main stay for NLP tasks.

When the sequence of words is important in determining the class of a document, string-of-words approaches will outshine the bags-of-words. We have demonstrated this with synthetic documents where LSTM trounced the bags-of-words approach (Naive Bayes working with tf-idf vectors) for classification. But for a real text corpus of movie reviews for binary sentiment classification, we have shown that both LSTM and SVM (with tf-idf vectors) were comparable in quality even while the former took much longer.

The objective of this post is to further evaluate "bags vs strings" for a multiclass situation. We will work with the 20-newsgroups text corpus that is available from scikit-learn api. We will also look at the impact of using word-embeddings – both pre-trained and custom. We go through some code snippets here, but the complete code to reproduce the results can be downloaded from github.[LatexPage]

1. Tokenize the 20news Corpus

This corpus consists of posts made to 20 news groups so they are well-labeled. There are over 18000 posts that are more or less evenly distributed across the 20 topics. In the code snippet below we fetch these posts, clean and tokenize them to get ready for classification.

  • Lines #7–9. Tokenization. Remove all punctuation and NLTK stop words. Make sure all words/tokens start with a letter. And only retain those words between 3 and 15 characters long.
  • Line #13: Use scikit-learn api to fetch the posts but make sure to remove the "dead give away" clues as to what topic a given post belongs to.
  • Line # 22: Taking note of number of words in each document helps us choose a reasonable sequence length for LSTM later. The percentile stats on nTokens shows that over 92% of the documents have less than 200 words in them.

2. Word-Embeddings

Word-embeddings are short (of length p that is much, much shorter than the size of the vocabulary nWords) numerical vector representations for words. They allow us to reduce the dimensionality of the word-space from the length of the corpus vocabulary (about 107, 000 here) to a much shorter length like 300 used here. Pre-trained fasttext word vectors are downloaded, and the custom fasttext ones for the movies corpus are generated offline via Gensim. In either case once in hand they are simply read off of the disk.

  • Lines #6–7: We have the custom Gensim generated word vectors in a json file structured as { word : vector, ..} so we simply read it off as a dictionary
  • Lines #8–21: In case of pre-trained vectors, we read the downloaded file, and process with some error checking.
  • Lines #25–27: Prepare the nWords x 300 embedding matrix where each row represents the 300 long numerical vector for the corresponding word.

The end result is a matrix where each row represents a 300 long vector for a word. The words/rows are ordered as per the integer index in the _wordindex dictionary – {word:index}. In case of Keras, the words are ordered based on their frequency. In case of tf-idf vectorizer a word gets its index based on its alphabetical order in the vocabulary. Just book keeping, nothing complex.

3. Pack Bags and Sequences

Lstm works with word sequences as input while the traditional classifiers work with word bags such as tf-idf vectors. Having each document in hand as a list of tokens we are ready for either.

3.1 Tf-Idf Vectors for SVM

We use scikit-learn Tf-Idf Vectorizer to build the vocabulary (the _wordindex dict variable in Line #6 below) and the document vectors (Line #7) from the tokens.

Xencoded is a sparse nDocs x nWords matrix. When using word-embeddings we convert that to a dense nDocs x 300 matrix by multiplying with the embedding matrix we computed in Section 2. These shorter 300-long dense vectors are then classified.

3.2 Sequences for LSTM

The text processor in Keras turns each document into a sequence/string of integers, where the integer value indicates the actual word as per the {word:index} dictionary that the same processing generates. The index values start at 1, skipping 0 which is reserved for padding. We use 200-long sequences as the stats on the tokens show that over 92% of the documents have less than 200 words. In Line # 7 in the code below, the documents with fewer than 200 words are ‘post’ padded with the index value 0 that is ignored by the embedding layer (_maskzero=True is set for in the definition of embedding layer in Section 4).

4. Models

LSTM is implemented via Keras while SVM is implemented via scikit-learn. Both work with the same train/test split so a comparison would be fair. Twenty percent of the overall corpus (i.e 3660 documents) are set aside for test while training on the remaining 14636 documents.

4.1 LSTM

As in the earlier articles in this series, we use the simplest possible LSTM model, with an embedding layer, one LSTM layer and the output layer. When using external word-embeddings the embedding layer will not be trained i.e., the weights will be what we have read from the disk in Section 2.

Figure 1. A simple LSTM model for multiclass classification
Figure 1. A simple LSTM model for multiclass classification

The embedding layer in Figure 1 reduces the number of features from 107196 (the number of unique words in the corpus) to 300. The LSTM layer outputs a 150-long vector that is fed to the output layer for classification. The model itself is defined quite simply below.

  • Lines #3–7: Embedding layer is trained only when not using external word-embeddings.
  • Line #9: The dropout fields are to help with preventing overfitting

    Training is done with early stopping to prevent over training in Line #5 in the code below. The final output layer yields a vector that is as long as the number of labels, and the argmax of that vector is the predicted class label.

    4.2 SVM

The model for Svm is much less involved as there are far fewer moving parts and parameters to decide upon. That is always a good thing of course.

5. Simulations

The confusion matrix and the F1-scores obtained are what we are interested in. With the predicted labels in hand from either approach we use scikit-learn API to compute them.

While we have gone through some snippets in different order, the complete code for lstm-20news.py for running LSTM and svm-20news.py for running SVM is on github. As indicated in the earlier articles various random seeds are initialized for repeatability. The simulations are carried out with the help of a shell script below that loops over the variations we are considering.

5.1 LSTM

The embedding layer should contribute to

107196 * 300 weight parameters + 300 bias parameters = 32159100 params

This matches the number of non-trainable parameters in Line # 11 below for an LSTM run with external word-embeddings.

Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 200, 300)          32159100  
_________________________________________________________________
lstm_1 (LSTM)                (None, 150)               270600    
_________________________________________________________________
dense_1 (Dense)              (None, 20)                3020      
=================================================================
Total params: 32,432,720
Trainable params: 273,620
Non-trainable params: 32,159,100
_________________________________________________________________
None
Train on 14636 samples, validate on 3660 samples
Epoch 1/50
 - 224s - loss: 2.5373 - acc: 0.1837 - val_loss: 2.1770 - val_acc: 0.2757
Epoch 2/50
 - 223s - loss: 2.1440 - acc: 0.2913 - val_loss: 2.0411 - val_acc: 0.3437
Epoch 3/50
...
...
Epoch 35/50
 - 223s - loss: 0.5122 - acc: 0.8351 - val_loss: 0.9211 - val_acc: 0.7295
Epoch 00035: early stopping
...
...
micro avg     0.7295    0.7295    0.7295      3660
macro avg     0.7209    0.7167    0.7137      3660
weighted avg  0.7300    0.7295    0.7255      3660

Time Taken: 7859.074368476868

The run takes over 2 hrs, stops due to the early stopping criteria and obtains an F1-score of 0.73. Figure 2 shows the rate of convergence flattening out a good bit by about 20 epochs or so.

Figure 2 Convergence of LSTM model with fastext word-embeddings.
Figure 2 Convergence of LSTM model with fastext word-embeddings.

5.2 SVM

SVM has far fewer moving parts and it finishes much more quickly as well. With fasttext embeddings, it works with a 18296 x 300 dense matrix (Line# 7 below), and obtains F1-score of 0.68.

X, labels #classes classes 18296 (18296,) 20 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardwa
re', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med
', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Vocab sparse-Xencoded 107196 (18296, 107196)
# Bad Word Vectors: 1
Dense-Xencoded (18296, 300)
...
...
               micro avg     0.6899    0.6899    0.6899      3660
               macro avg     0.6757    0.6766    0.6722      3660
            weighted avg     0.6835    0.6899    0.6839      3660

Time Taken: 140.8366870880127

6. Results

We have the results in hand to not only compare bag & sequences for multiclass classification but also the impact of using pre-trained and custom word-embeddings. Figure 3 shows the F1-scores obtained and the time taken in all cases. SVM with direct tf-idf vectors does the best both for quality & performance. Pre-trained word-embeddings help LSTM improve its F1-score. The larger run times for LSTM are expected and they are in line with what we have seen in the earlier articles in this series.

Figure 3. Quality and performance comparison of bags vs strings for 20-news classification. SVM is clearly the leader in quality and performance. Word-embeddings seem to help LSTM achieve better ressults.
Figure 3. Quality and performance comparison of bags vs strings for 20-news classification. SVM is clearly the leader in quality and performance. Word-embeddings seem to help LSTM achieve better ressults.

Figure 4 below compares the best confusion matrices obtained by either approach.

Figure 4. The diagonal dominance is observed in either case. Interestingly both approaches seem to be confused between more or less the same pairs of classes.
Figure 4. The diagonal dominance is observed in either case. Interestingly both approaches seem to be confused between more or less the same pairs of classes.

7. Conclusions

So what are we to make of the results obtained in this three part series? For a synthetic text corpus dominated by sequences, word strings beat out word bags handily. For a binary classification task, the score was even. In this multiclass classification task, the scale has tilted towards word bags. Given that the deep learning approaches have so many knobs one can never be sure if the obtained results cannot be improved by tweaking some (which ones, pray tell me… units/layers/batches/cells/… and by how much too while you are at it…). So here are some loose assessments for whatever they are worth.

  • For a real text corpus word bags are tough to beat – especially given the much shorter run times
  • Word bag vectors do not really benefit from the use of word-embeddings. We have seen this in earlier articles such as Word Embeddings and Document Vectors: Part 2. Classification
  • Word-embeddings can improve the quality of results for word-strings based approaches.

Related Articles