
In this article, we will go through a Classification problem that involves classifying a review as either positive or negative. The reviews used here are the reviews made by customers on a service ABC.
Data Collection and Pre-processing
The data used in this particular project were scraped from the web and data cleaning done in this notebook.
After we scraping the data was saved to a .txt file
and here is an example of one line of the file (to represent one data point)
{'socialShareUrl': 'https://www.abc.com/reviews/5ed0251025e5d20a88a2057d', 'businessUnitId': '5090eace00006400051ded85', 'businessUnitDisplayName': 'ABC', 'consumerId': '5ed0250fdfdf8632f9ee7ab6', 'consumerName': 'May', 'reviewId': '5ed0251025e5d20a88a2057d', 'reviewHeader': 'Wow - Great Service', 'reviewBody': 'Wow. Great Service with no issues. Money was available same day in no time.', 'stars': 5}
The data point is a dictionary and we are interested in the reviewBody
and stars
.
We will categorize the reviews as follows
1 and 2 - Negative
3 - Neutral
4 and 5 - Positive
At the moment of data collection, there were 36456 reviews on the site. The data is highly imbalanced: 94% of the total reviews are positive, 4% are negative and 2% are neutral. In this project, we will fit different Sklearn models on the imbalanced data and also on balanced data (dropping positive excesses so that we have the same number of positive and negative reviews.)
Below is a plot showing the composition of the data:

In Fig 2 and the figures above, we can see that the data is highly imbalanced. Could this be a sign of a problem? We shall see.
Let’s start by importing necessary packages and also define the class Review
that we will use to categorize a given review message
Here, we will load the data and use the Review
class to categorize the review message as positive, negative or neutral
Wow. Great Service with no issues. Money was available same day in no time.
POSITIVE
Split Data into Training and Test Set
Size of train set: 25519
Size of train set: 10937
Before we continue further, we need to understand the concept of Bag-of-Words.
Bag-of-Words
As we know, a computer only understands numbers and therefore we need to convert the review messages we have into a list of numbers using bag-of-words model.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words.
The bag-of-words model is a supporting model used in document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
Example:
Consider these two reviews
- Excellent Services by the ABC remit team.Recommend.
- Bad Services. Transaction delayed for three days.Don’t recommend.
From the above two sentences, we can derive the following dictionary
[Excellent, Services, by, the, ABC, remit, team, recommend, bad, transaction, delayed, for, three, days, don’t]
We now tokenize this dictionary to generate the following two data points that can now be used to train the classifier

In Python, tokenization is done as follows
['abc', 'bad', 'by', 'days', 'delayed', 'don', 'excellent', 'for', 'recommend', 'remit', 'services', 'team', 'the', 'three', 'transaction']
[[1 0 1 0 0 0 1 0 1 1 1 1 1 0 0]
[0 1 0 1 1 1 0 1 1 0 1 0 0 1 1]]
Now that we understand the concept of Bag-of-Words, lets now apply that knowledge on our train_x
and test_x
Training the Models in Imbalanced data
At this point, we have the vectors that we can use to fit the models and we can go ahead and do just that
Support Vector Machine
Review Message: easy efficient first class
Actual: POSITIVE
Prediction: ['POSITIVE']

Other models trained includes Ensemble Random Forest, Naive Bayes, Decision Tree and Logistic Regression. A link to the full code.
Evaluation of Model Performance on Imbalanced data
- Accuracy
The models were evaluated using accuracy metric and the results were as follows

We are getting an accuracy of 90%, is that right or there’s something wrong? The answer is, there is something wrong.
The data is imbalanced and using accuracy as an evaluation metric is not a good idea. Below is the distribution in the categories
----------TRAIN SET ---------------
Positive reviews on train set: 23961 (93.89%)
Negative reviews on train set: 1055 (4.13%)
Neutral reviews on train set: 503 (1.97%)
----------TEST SET ---------------
Positive reviews on test set: 10225 (93.48%)
Negative reviews on test set: 499 (4.56%)
Neutral reviews on test set: 213 (1.95%)
What happens if the classifier predicts all positive reviews correctly and none for negative and neutral reviews in the test set? The classifier will attain an accuracy of 93.48%!!!!!!
This means that our model will be 93.48% accurate and we will think that the model is good but in reality, the model "just knows best" how the predict one class (positive reviews). In fact, from Fig 4, SVM predicted no neutral review at all
To understand this problem further, let us introduce another metric: F1 score, and use it to evaluate our models.
- F1 Score
The F1 score is the harmonic mean of the precision and recall Wikipedia.
Precision and recall measures how well the model correctly classifies the positive cases and the negative cases. Read more here.
When we evaluate our models on this metric the result is as follows


From Fig 6 and Fig 7, we now know that the models are very good in classifying positive reviews and poor in predicting negative and neutral reviews.
Working with Balanced Data
As a way of balancing the data we decided to randomly drop some positive reviews so that we use evenly-distributed reviews on training the model. This time round we are training the model on 1055 positive reviews and 1055 negative reviews. We are also dropping the neutral class.

(You can also consider using oversampling techniques to fix the problem of imbalanced data)
After training the models we ended up with the following results

SVM attains the best result of 88.9% accuracy and upon checking the F1 score (below) we can now realize that the models predicts negative reviews as good as positive ones.


If we look at the confusion matrix showing the results of SVM we notice that the model is good in predicting both classes

Find the full code here
Conclusion
After going through this project I hope you were able to learn that:
- Fitting a model or models on an imbalanced data might (in most cases it does) lead to undesirable results.
- Accuracy is not a good metric when dealing with imbalanced data.
- Most of the work is done in the pre-processing stage.
Thank you for reading 🙂