Using Transformer-Based Language Models for Sentiment Analysis

How to easily beat state-of-the-art sentiment models

Published in

Towards Data Science

5 min readApr 8, 2020

Sentiment analysis can be useful for many businesses in helping to assess the customers’ mood towards the company or a specific product. It is the task of categorizing texts according to their polarity, i.e. to identify whether the author’s feeling regarding the topic is positive, negative or neutral.

To be able to automatically do sentiment analysis on large scale, we need to train models on annotated datasets. This article will show how to beat current benchmarks by a significant margin (improvements of around 5 percentage points) by adapting state-of-the-art transformer models to sentiment analysis in a fast and easy way using the open-source framework FARM.

Datasets

As mentioned, we need annotated data to be able to supervisedly train a model. For this purpose, I used the datasets of two shared tasks: SemEval 2017 and Germeval 2017.
While the SemEval dataset consists of English tweets, the GermEval dataset contains German texts from different social media and web sources about Deutsche Bahn (German railway company). The texts of both data collections are labeled with regard to their polarity (i.e. positive, negative or neutral).

Fine-tuning Transformer Models with FARM

Transformer models like BERT are the current state-of-the-art when it comes to different NLP tasks, such as text classification, named entity recognition and question-answering. There is a variety of several different pre-trained models. For the Germeval 2017 dataset, deepset’s GermanBERT was used since it showed strong performance across a range of tasks.
For the SemEval-2017 dataset, however, since the training data consists of English texts but the purpose of the resulting model is to assess sentiment on texts from German customers, a zero-shot learning approach was attempted and therefore a multilingual model was needed. For this reason, XLM-RoBERTa-large was utilized, which is trained on 100 different languages. If you want to know more about XLM-RoBERTa, I highly recommend this blog article.

Since pre-trained models are only trained to capture a general understanding of language and not the nuances of specific downstream NLP tasks, we need to adapt these models to our particular purpose. To achieve this, the Framework for Adapting Representation Models (FARM) was used. FARM allows to easily adjust transformer models to different NLP tasks. To achieve the goal of building a reliable sentiment classifier, I followed FARM’s doc_classification example. The following code snippets show how I adapted GermanBERT to the task of sentiment analysis in a few number of steps:

Data Processing

To be able to load the training instances, the data set needs to be in a csv-file where each row consists of the training example together with its label. The TextClassificationProcessor loads and converts the data such that it can be used by the modeling components. For this purpose, we need to specify the set of possible labels, the data directory containing the train and test sets as well as the name of the column containing the label for each training instance. Here, we also have to indicate the maximum sequence length that is used in our model.
The converted data is then passed on to the DataSilo. Its purpose is to store the data and provide it to the model batch by batch. Therefore, the batch size has to be specified at this step.

Modeling

The next step is to define the model architecture. First, we need to decide which language model we want to fine-tune. In this case, we are using bert-base-german-cased. Then, we have to choose the right prediction head for our specific task. As we want to classify texts according to their sentiment using discrete values, we need to select a TextClassificationHead. The final step is to stack the prediction head on top of the language model, which is done by the AdaptiveModel.

Training

Now that the data is loaded and the model architecture is defined, we can start training the model. First, we have to initialize an optimizer. Here, we set both the learning rate and the number of epochs we want to train our model on. initialize_optimizer not only initializes an optimizer, but also a learning rate scheduler. The default is a linear warmup of the learning rate for the first 10% of all training steps.
Finally, we can feed all of the components to the Trainer, start the training and save the resulting model for later use. Fine-tuning GermanBERT on the Germeval-17 dataset took less than 16 minutes on a Tesla V100 16GB GPU; adjusting XLM-RoBERTa on the SemEval-17 data needed a bit more than 28 minutes.

Results

For the Germeval 2017 shared task, the performance of the submitted models was assessed with micro-averaged F1-score. Two different test sets were provided: one contains tweets from the same period as the training dataset (synchronic test set), the other test set contains tweets from a later period of time (diachronic test set).
The best submission (Naderalvojoud et al. 2017) achieved a micro-averaged F1-score of 74.9% on the synchronic test set and 73.6% on the diachronic test set. The model that was trained using GermanBERT and FARM outperforms these scores by over 5%, achieving a micro-averaged F1-score of 80.1% on the synchronic test set and 80.2% on the diachronic test set.

Evaluation results for the Germeval-2017 dataset

The performance of the submissions to the SemEval 2017 shared task, however, was assessed by macro-averaged recall. There, the best submissions (Cliche 2017, Baziotis et al. 2017) achieved a macro-averaged recall of 68.1%. Again, the model trained using XLM-RoBERTa-large and FARM outperforms these submissions by over 5%, achieving a macro-averaged recall of 73.6%.

Evaluation results for the SemEval-2017 data set

As the reason for using XLM-RoBERTa instead of a monolingual model was to apply the model to German data, the XLM-RoBERTa sentiment model was also evaluated on the Germeval-17 test sets. Here, we achieved a micro-averaged F1-score of 59.1% on the synchronic test set and 57.5% on the diachronic test set. This performance is worse than a basic majority class baseline with scores of 65.6% and 67.2% respectively.
One reason for this could be that the class distribution of both data sets differs a lot. While most instances in the Germeval dataset are labeled as neutral with almost no cases containing positive sentiment, the majority class of the SemEval dataset is positive.
Another problem might be that both data sets consist of thematically different texts. The Germeval data set is with regard to its topics very limited, containing mainly texts from different social media and web sources about Deutsche Bahn. The SemEval dataset, in contrast, is not topically restricted.

Class distributions for Germeval-2017 and SemEval-2017 data sets

All of the aforementioned results were achieved using the following hyperparameters:

Hyperparameters used to train the models

Conclusion

This blog article showed how we can train our own sentiment models using transformer models. We saw that fine-tuning a pre-trained model can be easily done with the help of the framework FARM and that we are even able to beat the leaderboard of different shared tasks. If you are interested in trying out the GermanBERT sentiment model yourself, you can access it via huggingface’s model hub.