Fine-grained Sentiment Analysis in Python (Part 2)

In this post, we’ll generate explanations for various classification results for fine-grained sentiment using LIME

Prashanth Rao

Published in

Towards Data Science

14 min readSep 4, 2019

“Why Should I Trust You?” — Ribeiro et al.

This is Part 2 of a series on fine-grained sentiment analysis in Python. Part 1 covered how to train and evaluate various fine-grained sentiment classifiers in Python. In this post, we’ll discuss why a classifier made a specific class prediction — that is, how to explain a sentiment classifier’s results using a popular method called LIME.

To recap, the following six models were used to make fine-grained sentiment class predictions on the Stanford Sentiment Treebank (SST-5) dataset.

Rule-based models: TextBlob and VADER
Feature-based models: Logistic regression and Support Vector Machine
Embedding-based models: FastText and Flair

A linear process was used to analyze and explain the sentiment classification results using each method.

Sentiment classification: Training & Evaluation pipeline

The below sections explain the final step in the process — generating explanations for the predictions of each method.

Local Interpretable Model-Agnostic Explanations

Or, simply put, LIME, is a technique [original paper] that explains the prediction of a classifier in an interpretable manner. The term “explain” here means generating textual or visual aids, such as highlighted plots and charts, that provide a qualitative understanding of the relationship between a model’s features and its predictions. From the paper, the below key points make LIME very effective at interpreting a complex model’s classification results:

It is model-agnostic — it treats the original model as a blackbox, so it can be applied on pretty much any classifier
It is locally faithful — it corresponds to how the model behaves in the vicinity of the instance being predicted. So for a given set of features in a test sample, the explanation is meaningful for the decision space occupied by those features, which may or may not not apply globally.

The model explanation process

To provide an explanation for any classifier, LIME uses a very clever approach — the only thing it needs from the larger, blackbox classifier is its prediction probabilities for each class (from the model’s output layer after the softmax function is applied). The below steps explain the process in a nutshell.

Generate thousands of variations of the text sample in which random words are blanked out, as follows:

Use the original, complex blackbox model to generate class probabilities and labels for each individual variation (whose words are partially blanked out). For example, the case “It _ not horrible _ _ _ _” would have the class probabilities [0, 0, 0.6, 0.3, 0.1] (i.e. it would belong to class 3), whereas “It ‘s _ horrible _ _ _ _” would have the probabilities [0.9, 0.1, 0, 0, 0]and belong to class 1.
Train a smaller, simpler, linear model on the labels predicted by the blackbox model for each variation.
Observe the features that carry the most weight in the simple, linear model (which is much easier to interpret)
Finally, generate a per-word (or token) visualization of the words based on feature importance

How does explanation work on multiple classes?

For multi-class text classification such as this SST-5 dataset, LIME uses the prediction probabilities to highlight the effect of each feature (i.e. token) on the predicted class using a one-vs-rest method. An example multi-class explanation from LIME is shown below. The model predicts a class of 2 for this sentence. The highlighted colours in the text being explained are randomly generated, with more intense colours signifying a greater feature importance for that token (in the below example, “dullness”) for the predicted class.

Prediction class probabilities and feature importance plots generated by LIME

Explainer Class

Just as in the previous post, an object-oriented approach is applied to reuse code where possible. All code for the explainer is provided in this project’s GitHub repo [explainer.py]. A Python class is defined which takes in the list of variations generated by LIME (random text samples with tokens blanked out), and outputs a class probability for each sample as a numpy array.

Once the class probabilities for each variation is returned, this can be fed to the LimeTextExplainer class. Enabling bag-of-words (bow) would mean that LIME doesn’t consider word order when generating variations. However, the FastText and Flair models were trained considering n-grams and contextual ordering respectively, so for a fair comparison between models, the bow flag option is disabled for all explanations on SST-5.

The exp object returned by the LIME explainer is an internal method that converts the local linear model’s predictions (in numerical form) to a visual form that can be interpreted by humans, which is output as an HTML file.

Hack for the Rule-based Methods

Since the rule-based methods (TextBlob and VADER) do not output class probabilities (they output just a single sentiment score), in order to explain their results using LIME, we have to artificially generate class probabilities for these methods. Although this isn’t a formal procedure, a simple workaround to simulate class probabilities using a continuous-valued sentiment score (in the range [-1, +1]) would be to normalize the float score to be within the range [0, 1], and then convert it to a discrete integer class by scaling it 5 times in magnitude. This is done for TextBlob and VADER as follows.

In addition, a rule-based model outputs one and only one prediction, so to avoid outputting zero probabilities for the other classes, a normal distribution consisting of 5 points with the mean as the predicted integer class, is used to assign small non-zero probabilities to the remaining classes. The below plots show how this is done. This array of simulated probabilities can now be used by LIME for generating explanations on the rule-based model.

Example of simulated probabilities for rule-based classifier scores

What does each classifier focus on?

To understand the predictions, the file explainer.py is run for each of the six trained classifiers — this outputs an HTML file with visual content that helps us interpret the models’ feature understanding.

From the EDA done in the previous post, we know that classes 1 and 3 are the minority classes in the SST-5 dataset, so two samples are chosen from these two classes.

It’s not horrible, just horribly mediocre. (True label 1)
The cast is uniformly excellent … but the film itself is merely mildly charming. (True label 3)

Each of these samples contains modifiers, conflicting vocabulary and rapidly varying polarities across the sentence, so they should in principle, help expose what each classifier is focusing on.

TextBlob

Example 1: “It’s not horrible, just horribly mediocre.”

For the sentence “It’s not horrible, just horribly mediocre”, the use of a negation term before the word “horrible” convinces TextBlob that the item isn’t horrible — however, the second clause of the sentence “just horribly mediocre” reaffirms the fact that the item is indeed, very mediocre. The sentence as a whole is still overwhelmingly negative, making the true label 1, but TextBlob focuses heavily on the negation term to push the sentiment rating upward to 3.

Example 2: “The cast is uniformly excellent … but the film itself is merely mildly charming.”

The above sentence is really challenging, because the first half is overwhelmingly positive, while the second half is quite negative — making the overall sentence neutral in sentiment. The words “mildly”, “charming” and “excellent” contribute heavily to TextBlob incorrectly classifying this sentence as strongly positive (label 5). In general, multiple occurrences of strongly positive words tend to push the rule-based algorithm in TextBlob to classify the sentence as positive overall.

VADER

In the above example, VADER predicts a completely opposite sentiment rating to what was expected. VADER tends to heavily penalize negative polarity words (such as “horrible”), but also heavily reward negation terms that precede negative polarity words (“not horrible”). The presence of “horribly mediocre” in the latter half of the sentence has no impact whatsoever because the strong positive score in the first half outweighs everything that comes after — the rule-based approach in VADER proved a little too clever for itself in this example.

Once again, the rule-based approach in VADER heavily weights positive-sounding words like “excellent” and “charming” to incorrectly give this sentence a label 5. The words “merely” and “mildly”, which should be bringing down the overall sentence score, are ignored by VADER— this is most likely because these words do not appear in VADER’s sentiment lexicon or rule-based modifiers.

Logistic Regression

Logistic regression does seem to learn individual features (tokens) that modify the overall sentiment rating of the sentence — as can be seen above, the words “just” and “mediocre” contribute more towards label 2, whereas “horrible” and “horribly” contribute to “not 2” (presumably label 1). The prediction probabilities for both labels 1 and 2 are nearly equal, so the feature-based approach in this case errs by a very fine margin towards label 2.

In the second example, it is clear that the words “but”, “mildly” and “merely” are correctly identified by the logistic regression model as swinging the overall sentence sentiment towards a neutral state.

Support Vector Machine

The SVM, unlike the logistic regression, focuses more heavily on the strongly negative words “horrible” and “horribly” to assign this a label of 1. The negation term “not” and the word “mediocre” are identified correctly as pushing the rating away from 1— but their effect is low and the overall sentiment label is still 1.

In the second example, it is clear that the SVM learned the right features that make the overall sentence neutral. The words “merely”, “mildly” and “but” decrease the probability that the sentence is assigned labels 2, 4 and 5, making the label 3 the most likely.

FastText

FastText focuses heavily on the strongly negative words in example 1 — the other class labels have a zero probability. It does get the overall prediction right, but this could just have been because FastText learned that “horrible” is a word that regularly appears in strongly negative sentences.

In example 2, it is more clear why the FastText model correctly predicts the neutral label. Since the model was trained using word-trigrams and a context window of 5 (see the previous post for training parameters) — the model looks at sequences of tokens while making a prediction. For example, the words “uniformly excellent” are immediately followed by “but”, and the words “merely mildly” immediately precede “charming”, so looking at the highlighted words in the visualization, the model is aware of the sequences of word co-occurrences (and not individual words) that make this sentence neutral.

Flair + ELMo

The Flair + ELMo embedding model gets both example predictions wrong by a fine margin. In both the above visualizations, the model seems to give a high weight to the period (.) token at the end of each sentence— this wasn’t the case with the other classifiers. The probabilities for the predicted label are very close to the correct label in either case, so the model seems to be on the right track in terms of how it learns from the data.

An important point to note: The Flair + ELMo model was underfitting during the training stage, i.e. the validation loss was still decreasing even after 25 epochs of training (see Part 1 of this series) — this means that further training could push the classifier towards the correct probability outputs for these and other examples.

Analysis

Studying the explained results across the various methods, we can make some observations on the merits and demerits, as well as the effects of key variables on each method’s performance.

Effect of strongly polar words

TextBlob and VADER tend to heavily weight words with strong polarity, even when there are other words with a milder polarity (or negation terms) that alter the sentiment of the overall sentence. Hardcoded rules work well in many cases, but natural language from the real world has too much variability for these kinds of rules to work well in practice, at least for fine-grained sentiment analysis.

Embedding-based methods have the best handle on cases that involve strongly polar words. The FastText model was trained with trigrams, so it learned to pick up sequences of words that preceded or succeeded a strongly polar word — so it wasn’t as easily fooled, allowing it to make better predictions on the neutral class compared to the rule-based and feature-based methods. The Flair model, because of its contextual embeddings and strong underlying language model, was able to more accurately identify patterns involving sequences of strong polarity words.

Effect of sentence length

Long samples, especially multi-sentence samples can cause trouble when using rule-based methods, which tend to apply some kind of weighted averaging to capture overall sentence polarity. Hence, the longer the sentence, the greater the chance of diffusing the actual sentiment of individual clauses in the long sample.

Really short (single or two-word) samples pose a different challenge to the models — they either contain unseen words, or provide too little context such as punctuation or similar subwords that a model could have used to categorize the word. In general, rule-based models would fare badly in these situations, because the lack of modifiers (“very” or “too”) do not provide enough of a notion of sentiment intensity in our fine-grained classification scenario.

Effect of unseen words

In samples where there numerous unseen words — or really short samples, where the chances that a word was unseen during training are quite high — models that lack a sequential (n-gram) or contextual representation (embeddings) tend to struggle to make reliable fine-grained sentiment predictions. The feature-based models in scikit-learn showcase this issue quite clearly. Since they rely on word co-occurrence counts during training, if an unseen word shows up in the test set, the word is ignored by the feature-based classifier. This misses out on features that might have given the model more context.

FastText is better equipped to deal with unseen words due to its character n-gram subword representations, but it doesn’t have as deep a pre-trained representation as the Flair model does, to properly deal with unseen words. The Flair + ELMo embedding model uses contextual representations from a pretrained vocabulary (1 billion words of news crawl data). This gives it a significant advantage over the shallower models — the contextualized embeddings carry useful information that identify words similar in meaning as well as their subword representations.

Model cost (training and/or inference)

All rule-based methods involve zero-training time and are very quick during the prediction stage — however this comes at the cost of lack of stability in results with unseen, real-world data. The feature-based methods are also very quick to train and make inferences with, due to highly efficient vectorized representations in scikit-learn. However, these models also come with limitations in terms of not capturing relationships between words and poor handling of unseen words.

FastText is a good compromise between cost of computation (it is blazing fast due to its underlying C++ and Cython bindings) and classification accuracy. For fine-grained sentiment analysis, training the model using trigram representations is a must, in order to capture the finer gradations involved in sequences of words that occur in long sentences.

Flair is the most expensive option of all, mainly due to the fact that it is a large deep learning model that uses pre-trained representations from a combination of string and word embeddings. Training this model can take of the order of several hours (or days, depending on the size of the dataset). A part of the reason that the Flair + ELMo model is slow during training and inference is the way that ELMo embeddings are looked up during run time — this can be sped up using mini-batching — however, the reality is that models like Flair might well prove to be too expensive on real-world datasets for fast and efficient inference.

Interactive Dashboard for Model Testing

To make it easier to see explanations for a range of models on a number of difference examples, a dashboard was created using the Flask micro-framework. The dashboard was deployed using Heroku (in a similar way to this example). To test the dashboard interactively, input a text sample and choose the classifier whose classification prediction we want explained, as shown below.

Try out the dashboard here and make your own explanations!

Try out the dashboard! https://sst5-explainer.herokuapp.com

Conclusions

In this post, we discussed how to generate and interpret LIME explanations for fine-grained sentiment using six different classifiers in Python. The differences in the features focused on by each model are clearly apparent, making it a more straightforward task to choose the right classifier for the task at hand. The more complex the underlying representation of the vocabulary (especially contextual embeddings), the more reliable the model’s predictions across the five sentiment classes for the SST-5 dataset.

The below accuracy/F1-scores on the SST-5 dataset were obtained thus far:

While the Flair + ELMo embedding-based model does achieve a decent accuracy of 48.9%, this is still very far from the state-of-the-art (64.4%). In addition, the Flair model is very expensive to train and make predictions with, so a more computationally efficient, yet contextually aware model is desirable.

In Part 3 of this series, we’ll see how to improve on these results further using a transformer model with transfer learning. Thanks for reading!

Acknowledgements

The code for generating LIME explanations borrows from this excellent post by Adam Geitgey. For an in-depth yet easily understandable explanation on how LIME works, give it a read!
The Python/Flask dashboard was a quick-and-dirty implementation, whose code was inspired by this excellent Flask tutorial.

Code

All code for making sentiment predictions and generating LIME explanations is in this project’s GitHub repo.
All code for the Flask front-end app is available in this separate repo.