The world’s leading publication for data science, AI, and ML professionals.

End to End Machine Learning Project: Reviews Classification

A project to classify a review as either positive or negative

Fig 1 (Source: Author)
Fig 1 (Source: Author)

In this article, we will go through a Classification problem that involves classifying a review as either positive or negative. The reviews used here are the reviews made by customers on a service ABC.

Data Collection and Pre-processing

The data used in this particular project were scraped from the web and data cleaning done in this notebook.

Web Scraping: Scraping Table Data

After we scraping the data was saved to a .txt file and here is an example of one line of the file (to represent one data point)

{'socialShareUrl': 'https://www.abc.com/reviews/5ed0251025e5d20a88a2057d', 'businessUnitId': '5090eace00006400051ded85', 'businessUnitDisplayName': 'ABC', 'consumerId': '5ed0250fdfdf8632f9ee7ab6', 'consumerName': 'May', 'reviewId': '5ed0251025e5d20a88a2057d', 'reviewHeader': 'Wow - Great Service', 'reviewBody': 'Wow. Great Service with no issues.  Money was available same day in no time.', 'stars': 5}

The data point is a dictionary and we are interested in the reviewBody and stars.

We will categorize the reviews as follows

1 and 2 - Negative
3 - Neutral
4 and 5 - Positive

At the moment of data collection, there were 36456 reviews on the site. The data is highly imbalanced: 94% of the total reviews are positive, 4% are negative and 2% are neutral. In this project, we will fit different Sklearn models on the imbalanced data and also on balanced data (dropping positive excesses so that we have the same number of positive and negative reviews.)

Below is a plot showing the composition of the data:

Fig 2: Data composition (Source: Author)
Fig 2: Data composition (Source: Author)

In Fig 2 and the figures above, we can see that the data is highly imbalanced. Could this be a sign of a problem? We shall see.

Let’s start by importing necessary packages and also define the class Review that we will use to categorize a given review message

Here, we will load the data and use the Review class to categorize the review message as positive, negative or neutral

Wow. Great Service with no issues.  Money was available same day in no time.
POSITIVE

Split Data into Training and Test Set

Size of train set:  25519
Size of train set:  10937

Before we continue further, we need to understand the concept of Bag-of-Words.

Bag-of-Words

[[link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)] [link]

As we know, a computer only understands numbers and therefore we need to convert the review messages we have into a list of numbers using bag-of-words model.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words.

The bag-of-words model is a supporting model used in document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

Example:

Consider these two reviews

  • Excellent Services by the ABC remit team.Recommend.
  • Bad Services. Transaction delayed for three days.Don’t recommend.

From the above two sentences, we can derive the following dictionary

[Excellent, Services, by, the, ABC, remit, team, recommend, bad, transaction, delayed, for, three, days, don’t]

We now tokenize this dictionary to generate the following two data points that can now be used to train the classifier

Fig 3 (Source: Author)
Fig 3 (Source: Author)

In Python, tokenization is done as follows

['abc', 'bad', 'by', 'days', 'delayed', 'don', 'excellent', 'for', 'recommend', 'remit', 'services', 'team', 'the', 'three', 'transaction']
[[1 0 1 0 0 0 1 0 1 1 1 1 1 0 0]
 [0 1 0 1 1 1 0 1 1 0 1 0 0 1 1]]

Now that we understand the concept of Bag-of-Words, lets now apply that knowledge on our train_x and test_x

Training the Models in Imbalanced data

At this point, we have the vectors that we can use to fit the models and we can go ahead and do just that

Support Vector Machine

Review Message:  easy efficient  first class
Actual:  POSITIVE
Prediction:  ['POSITIVE']
Fig 4: Confusion matrix resulting from fitting SVM (Source: Author)
Fig 4: Confusion matrix resulting from fitting SVM (Source: Author)

Other models trained includes Ensemble Random Forest, Naive Bayes, Decision Tree and Logistic Regression. A link to the full code.

Evaluation of Model Performance on Imbalanced data

  1. Accuracy

The models were evaluated using accuracy metric and the results were as follows

Fig 5 : Performance of the models on imbalanced data (Source: Author)
Fig 5 : Performance of the models on imbalanced data (Source: Author)

We are getting an accuracy of 90%, is that right or there’s something wrong? The answer is, there is something wrong.

The data is imbalanced and using accuracy as an evaluation metric is not a good idea. Below is the distribution in the categories

----------TRAIN SET ---------------
Positive reviews on train set: 23961 (93.89%)
Negative reviews on train set: 1055 (4.13%)
Neutral reviews on train set: 503 (1.97%)
----------TEST SET ---------------
Positive reviews on test set: 10225 (93.48%)
Negative reviews on test set: 499 (4.56%)
Neutral reviews on test set: 213 (1.95%)

What happens if the classifier predicts all positive reviews correctly and none for negative and neutral reviews in the test set? The classifier will attain an accuracy of 93.48%!!!!!!

This means that our model will be 93.48% accurate and we will think that the model is good but in reality, the model "just knows best" how the predict one class (positive reviews). In fact, from Fig 4, SVM predicted no neutral review at all

To understand this problem further, let us introduce another metric: F1 score, and use it to evaluate our models.

  1. F1 Score

The F1 score is the harmonic mean of the precision and recall Wikipedia.

Precision and recall measures how well the model correctly classifies the positive cases and the negative cases. Read more here.

When we evaluate our models on this metric the result is as follows

Fig 6(Table) : F1 scores for different classifiers (Source: Author)
Fig 6(Table) : F1 scores for different classifiers (Source: Author)
Fig 7(Plot) : F1 scores for different classifiers (Source: Author)
Fig 7(Plot) : F1 scores for different classifiers (Source: Author)

From Fig 6 and Fig 7, we now know that the models are very good in classifying positive reviews and poor in predicting negative and neutral reviews.

Working with Balanced Data

As a way of balancing the data we decided to randomly drop some positive reviews so that we use evenly-distributed reviews on training the model. This time round we are training the model on 1055 positive reviews and 1055 negative reviews. We are also dropping the neutral class.

Fig 8: Distribution of balanced data (Source: Author)
Fig 8: Distribution of balanced data (Source: Author)

(You can also consider using oversampling techniques to fix the problem of imbalanced data)

After training the models we ended up with the following results

Fig 9: Model accuracies on balanced data (Source: Author)
Fig 9: Model accuracies on balanced data (Source: Author)

SVM attains the best result of 88.9% accuracy and upon checking the F1 score (below) we can now realize that the models predicts negative reviews as good as positive ones.

Fig 10(Table) : F1 Scores for different classifiers (Source: Author)
Fig 10(Table) : F1 Scores for different classifiers (Source: Author)
Fig 11(plot): F1 scores for different classifiers (Source: Author)
Fig 11(plot): F1 scores for different classifiers (Source: Author)

If we look at the confusion matrix showing the results of SVM we notice that the model is good in predicting both classes

Fig 12: Confusion matrix generated using SVM results
Fig 12: Confusion matrix generated using SVM results

Find the full code here

Conclusion

After going through this project I hope you were able to learn that:

  • Fitting a model or models on an imbalanced data might (in most cases it does) lead to undesirable results.
  • Accuracy is not a good metric when dealing with imbalanced data.
  • Most of the work is done in the pre-processing stage.

Thank you for reading 🙂


Related Articles