The world’s leading publication for data science, AI, and ML professionals.

A three level sentiment classification task using SVM with an imbalanced Twitter dataset

As a news addict, I love seeing how politics gets such emotive responses across social media and wondered if this anecdotal sense of…

Machine learning to classify emotions from live reaction tweets to the first televised GOP debate in 2016

Photo by History in HD on Unsplash
Photo by History in HD on Unsplash

As a news addict, I love seeing how politics can garner such emotive responses across social media and wondered if this anecdotal sense of passion could translate into Machine Learning classification. I found a dataset of Tweets made in reaction to the first Republican Presidential debate in 2016 (here) and wanted to create a three level sentiment classifier that could interpret emotions from the text of the Tweets. This article is part of a suite of methodologies and techniques I put together, for now I will just be focusing on one aspect; the humble Support Vector Machine. As a secondary task, I noticed the dataset was severely imbalanced so wanted to try and upsample the minority classes in an effort to improve the usefulness of the classifier across all labels (this will hopefully help the classifier improve across all categories).


Data Exploration & Cleaning

Who would have guess Twitter was so Negative? Image by Author
Who would have guess Twitter was so Negative? Image by Author

From looking at the raw figures in the breakdown of the dataset we immediately see an issue in the spread of the Tweets. Negative Tweets are prevalent and are over twice the rate of neutral and positive Tweets combined. This could have ramifications on how well the classifier works in practise. (Most likely, if trained like this, the classifier will be great at understanding a negative tweet but won’t have much practise identifying anything else!)

The first step of ‘cleaning’ the data was to convert all letters to lowercase, then punctuation, numbers and URLs and usernames were removed from the Tweets.

Stop-words were removed from the tweets using the ‘NLTK’ stop-words corpus and white space was taken from the tweets, and each word was tokenised so as to represent an individual chunk of data to be considered as such.

Duplicate tweets were then omitted. A decision was made to remove duplicates after the other pre-processing steps as, due to the nature of Twitter which consists of ‘retweets’, replies to other users or remarksusernames may have the exact same content. This left 9,836 total unique tweets prepared for classification. Negative: 5692, Neutral: 2521 and Positive: 1623. The dataset was split into 80% for training and 20% for testing


Vectorisation – TF/IDF

For the purposes of most mathematical modelling performed on text and for the purposes of this experimentation different processes of ‘vectorisation’ were implemented.

Text content alone is not capable of being altered and coerced into mathematical space without being transitioned into numbers for the purposes of being read by a machine learning algorithm.

That is why for the purposes of the supervised methods in this project different types of vectorisation were used to convert qualitative data to quantitative data in order for it to be mathematically manipulated. These vectors become embedded features for the models.

Term Frequency/Inverse Document Frequency (TF/IDF)

This was the vectorisation technique used for the Support Vector Machine model. TF/IDF was deployed on the training data with a unigram approach which counts each individual word as a term. ‘Term frequency’ amounts to how frequently a certain word appears in the text, ‘inverse document frequency’ refers to reducing the significance of words which appear most often across all of the text.

This serves to make words that are seen frequently in a given document but not necessarily all of the documents.


Data Balancing & Sampling Techniques

Data balancing in the form of Random Oversampling, Synthetic Minority Oversampling and a real world unbalanced method were all utilised and compared.

Sampling Techniques

It is crucial to take data balancing issues and protocols into consideration as every action should be undertaken to reduce bias and increase true performance but also try and reduce overfitting and have a more nuanced representation of the model’s potential. It is important to consider upsampling techniques as it can make it easier for models to outline its decision boundary. It was decided not to use under sampling techniques as it was felt this would do little to improve performance in this case due to the fact the dataset was quite small initially and was again reduced further due to preprocessing measures and training splits.

No sampling

Classes may be left unbalanced with models being trained on exactly how the tweets would appear in a real life context. It would be naive to assume that models may not perform as well without upsampling previously. If data in a given domain are naturally severely unbalanced, then training on unbalanced data may produce optimal outputs.

Random Over Sampling

Random over-sampling is a process of taking duplicate examples from the two minority classes and adding these to the training set. Examples are chosen at random from the minority classes in the training set and then duplicated and added to the training set where they have the potential to be chosen again.

Because the duplicates are exact and there is the potential for duplicate examples to appear multiple times, there is a risk of overfitting the minority classes with this approach and for models implementing this technique to suffer from increased generalisation of data.

For the purposes of this experiment, the minority classes were both up-sampled to the same value as the majority negative class so that each class had 5,692 examples after upsampling was applied.

Synthetic Minority Oversampling Technique SMOTE

Researchers; Chawla, Bowyer, Hall, and Kegelmeyer created this upsampling technique in their paper named for the technique titled "SMOTE: Synthetic Minority Over-sampling Technique."(read it here!)

SMOTE is another useful upsampling method. As opposed to random oversampling which creates exact duplicates of data points from the minority class, SMOTE uses a type of data augmentation in order to ‘synthesise’ completely new and unique examples. SMOTE is implemented by choosing instances that are close to each other in the feature space and creating a boundary between these examples and creating a new sample at a certain point along that boundary. This approach tends to be effective, as the new synthetic tweets are closer to other examples in the feature space so are potentially closer in their polarity than the randomly up-sampled examples and because they are not exactly the same examples as in random over sampling, the likelihood of overfitting is reduced.


Evaluation

Experimental evaluation metrics vary and are usually dependent on the nature of the task being conducted. Some of the typically used evaluation metrics for analytics for analytical procedures include, but are not limited to; accuracy, precision, recall, Mean Squared Error, analysis of the Loss Function, Area Under the Curve, F1-Score. Different models in different domains will result in different results for each metric and suitable ones must be decided upon and must meet evaluation criteria necessary.

Accuracy

Accuracy is one of the most frequently measured evaluation metrics in classification tasks and is most often defined as the number of correctly classified labels in proportion to the number of predictions in total.

F1-Score

The ‘F1 Score’, ‘F-Score’ or the ‘F-Measure’, is a common metric used for the evaluation of Natural Language based tasks. It is often said to be ‘The Harmonic Mean of Precision and Recall’, or conveys the balance between precision and recall.

The F-Measure expresses the balance between the precision and the recall. As accuracy only gives the percentage of correct results of the model but does not show how adept the model is at finding true positive results, both measures have merit, depending on the need.

Receiver Operating Characteristic (ROC)

The ROC is a graph which displays the performance of a classification model in terms of its True Positive Rate (TPR) and its False Positive Rate (FPR). TPR is defined as the collective number of true positives output by the model divided by the number of true positives plus the total number of false negatives.

Area Under the Curve (AUC)

The AUC statistic is a measure of the dimensional space underneath the ROC curve. This figure gives the aggregated score of model performance across all of the potential classification thresholds. An interpretation of this would be that the model positions a positive random example higher than an example which is randomly negative. The AUC is always a figure between 0 and 1.

The ROC metrics is useful as it has invariance to prior class probabilities or class prevalence in the data, along with the AUC. This is important for this study as the classes are severely unequal. The large presence of the negative class indicates the probability of the models correctly classifying a positive tweet randomly is increased.


Results

The following is the results obtained from the Support Vector Machine trained with Term Frequency/Inverse Document Frequency vectors with various oversampling techniques to the minority classes.

The figure below shows the results for the Support Vector Machine model, trained on unbalanced training data. The overall accuracy of the model here is 60% but looking at the precision, recall and f1 score for this approach we see how the model has poor performance when categorising the smaller classes here. The model understands the negative class but fails to learn much from the smaller classes as is clear from a quite low 18% and 19% ‘f1’ for the neutral and positive classes.

Image by author
Image by author

From the ROC curve and AUC shown below we see a more rounded perspective of the model’s performance. The true positive rate of all three classes was almost the same except for a 1% lower AUC for neutral compared to the other classes. The model’s overall ability at classification is not particularly apparent even though the model has an overall accuracy of 60% true positive rate.

From the confusion matrix shown we see the actual predicted values of the SVM model. The model clearly shows its best performance when classifying the negative class.

1086 correct predictions in this class. However, we see only 37 correct predictions of the negative class which is slightly less than 10% of correct predictions here. The neutral class has 55 correct predictions, a slight improvement on the negative class here. It is interesting to note how this model incorrectly labeled a tweet as negative substantially more than any other class. Showing how the model relied heavily on the negative class to influence its decision making.

Image by Author
Image by Author
Image by author
Image by author

Support Vector Machine TF/IDF Randomly Oversampled Classes

The classification report displayed below shows the results of the SVM TF/IDF model when the random oversampling technique is applied to up-sample the minority classes. It is noteworthy that the overall accuracy for this approach does not differ from the same approach with unbalanced classes but the models performance in correctly classifying the smaller classes does improve slightly, as shown by the improved f1-score on the negative classes.

Image by Author
Image by Author

Shown below is a representation of the ROC curve and AUC figure for the SVM model with random oversampling. The true positive rate of this model across all classes is an improvement of at least 4% across all classes. Not only did the model improve its classification of negative classes by increasing samples of the minority classes but the model improved considerably across all classes. The model is still best at finding the negative class, but it also did not lose any of this knowledge when presented with more diverse training examples.

Image by Author
Image by Author

The figure below shows a confusion matrix for the SVM ROS model. It is noteworthy that the classifier is marginally worse at correctly classifying the negative when compared with the unbalanced dataset (1086 versus 977 respectively) but almost doubles its correct classification of the positive class (37 versus 70). The number of correct predictions of the neutral class improved considerably (from 55 to 140). It is also relevant to point out that the total number of incorrectly classified negative examples was reduced significantly.

Image by Author
Image by Author

Support Vector Machine TF/IDF SMOTE

The figure below shows the classification report of the SVM-TF/IDF with the SMOTE upsampling technique applied. The overall accuracy of the model remains static at 60% however we do see an improved f1 score for the two minority classes when compared to the unbalanced approach but not the randomly up-sampled method.

Image by Author
Image by Author

The figure below displays the ROC curve and AUC number for the SVM with SMOTE. It is clear by comparing the two graphs and metrics that follow, that there is a clear drop in true positive numbers when compared to the ROS approach. This approach is only marginally improved when assessed next to the unbalanced approach. Classification of the negative class had an improvement of 2% when compared to no upsampling and generally the model is between 1% to 2% less capable of classifying any of the classes.

Image by Author
Image by Author

Finally, the plot below shows the confusion matrix for the SVM with SMOTE. The negative class is still the label that the classifier correctly identities but it is interesting to note how the correct predictions for the neutral class drop almost by half when using this technique compared to ROS (72 versus 140). The classifier here also misclassifies Tweets as negative a lot more using this technique compared to ROS. Rather than diversifying the range of predictions that it makes, the classifier relies heavily on the negative class label in this instance. It is also worth noting how this model not only misclassified the positive class more than the ROS model but makes far less total predictions for this label.

Image by Author
Image by Author

Evaluation & Conclusion

The Support Vector Machine found it extremely difficult to make correct classifications when trained on imbalanced training data. Not using a parameter tuning technique and just a simple linear approach may have caused issues also.

SVM’s are sensitive to imbalanced data and work best with naturally balanced classes. That may have caused decreased performance. It also explains how the unbalanced experimentation yielded less useful results.

In terms of overall accuracy in a general sense, all three upsampling techniques gave the same accuracy metrics but it is clear intuitively that the best performance was consistently on labelling the negative class.

Referring to the ‘F1-Measure’ of the classifier, the Randomly upsampled model gave the best results. However it must be remembered that randomly oversampling data creates exact recreations of instances, which has the potential to lead to overfitting.


Jupyter Notebooks with all of the Python code to accompany this report can be found on the GitHub repo here! 🙂


Report and code made by Alan Coyne, Freelance Data Scientist based in Dublin, Ireland


Related Articles