Surprising Findings in Document Classification

Sometimes Simplicity Wins

Grant Holtes
Towards Data Science

--

Document Classification: The task of assigning labels to large bodies of text. In this case the task is to classify BBC news articles to one of five different labels, such as sport or tech. The data set used wasn’t ideally suited for deep learning, having only low thousands of examples, but this is far from an unrealistic case outside large firms.

Now normally this type of technical article would run through a few models, before concluding with a comparison of results and an overall evaluation, but today I thought I’d save you a scroll and start off with the unexpected results.

Simple models worked best. Like, really simple models. Logistic regression outperformed or performed equally to a variety of deep neural network approaches. I hypothesise that this is due to the smaller size of the dataset and the long texts in question. Unlike a task such as sentence classification where the number of words is low and the order of words greatly influences the meaning of the sentence, the document classification problem has a huge number of words to use to classify each document. This causes the order of the words to contain less predictive power over class, reducing the effectiveness of techniques such as Convolutional Networks, LSTMs, and Hierarchical Attention Networks (HAN), all of which perform highly on massive datasets. Instead methods that use just the presence of words in the text excelled, with counts or binary encoding (0: word not in document| 1: word in document) of the top n words working best.

Ok, now let’s check out some models. Those models which performed appallingly were not included.

Starting Complex | Convolutional Neural Network | 96.63%

The CNN approach aims to use the order of the words in the text to provide meaning. The same could be done with a fully connected layer, but that presents two issues:

  1. The number of weights to optimise would be enormous, risking overfitting.
  2. The weights are sensitive to the relative position of words in the text.

The CNN approach aims to solve these issues by assuming that sequences of words in the text have predictive power over class, but the position of these sequences in the text is irrelevant. By moving the same kernel over the words in the text, we get the same activation from the filter, no matter the position of the phrase in the text. The only positional change would be where the activation is in the column vectors in the first layer. This is eliminated by the use of a max pooling layer to reduce the column vectors to a single row vector. As a result, one can imagine that each of the 512 kernels is searching for a specific phrase in the text.

Of course, the exact same phrase is unlikely to be in more than a few texts. This is mitigated by using word vectors to encode the relative meaning of each word, hence the phrases “I love my phone” and “I adore my phone” will deliver similar activations despite not being the same word-for-word. This also reduces the dimensionality of the word representations, allowing a 3.5M word vocabulary in just 50 to 300 dimensions.

Take it down a notch | Dense Neural Network | 95.96%

Let’s simplify things a little. In the dense network we have just one hidden layer of 64 nodes, then an output prediction layer as in the CNN model. As mentioned in the introduction, the input used is also simplified as a binary encoding, where 0 represents ‘word not in document’ and 1 represents ‘word in document’ for the most common n words.

This method works extremely well, giving the similar performance to the CNN approach, but with far faster training and processing time. It is also more memory efficient, only requiring 24% of the parameters of the CNN model.

Simplify! | Multinomial Logistic Regression | 96.18%

An even simpler and more statistically rigorous model is logistic regression. In this case the same feature engineering as the dense network is used. This is fed into a logistic model, which bears a resemblance to a neural network with no hidden layers and a sigmoid activation on its output layer, although the methods and loss function used to fit it is specific to the model.

This model is extremely simple, disregarding information about the order of words and phrases that the CNN would capture. It also doesn’t have any context of word meaning, treating “love” and “adore” the same as it would treat “dog” and “sandwich”. However, despite all these concessions, it only sacrifices half a percent of accuracy and gains so much more: Minuscule memory usage, statistical rigour, and comparatively blistering training and evaluation times.

Maybe, just sometimes, simplicity does win and is more than good enough.

If you are seeing this, cheers for reading / scrolling this far. If you want a conclusion, just read the introduction again! ❤️

--

--