Review Rating Prediction: A Combined Approach

Combining Review Text Content (RTC) and User Similarity Matrix to obtain more information and to improve the Review Rating Prediction (RRP)

Yereya Berdugo
Towards Data Science

--

Source: pixabay

Opening

The rise in E — commerce, has brought a significant rise in the importance of customer reviews. There are hundreds of review sites online and massive amounts of reviews for every product. Customers have changed their way of shopping and according to a recent survey, 70 percent of customers say that they use rating filters to filter out low rated items in their searches.

The ability to successfully decide whether a review will be helpful to other customers and thus give the product more exposure is vital to companies that support these reviews, companies like Google, Amazon and Yelp!.

There are two main methods to approach this problem. The first one is based on review text content analysis and uses the principles of natural language process (the NLP method). This method lacks the insights that can be drawn from the relationship between costumers and items. The second one is based on recommender systems, specifically on collaborative filtering, and focuses on the reviewer’s point of view. Use of the user’s similarity matrix and applying neighbors analysis are all part of this method. This method ignores any information from the review text content analysis.

In an effort to obtain more information and to improve the prediction of the review rating, the researchers in this article proposed a framework combining review text content with previous user’s similarity matrix analysis. They then did some experiments on two movie review datasets to examine the efficiency of their hypothesis. The results that they got showed that their framework indeed improved prediction of the review rating. This article will describe my attempt of following the work done in their research through examples from the Amazon reviews dataset. The notebook documenting this work is available here and I encourage running the code on your computer and report the results.

The data

The dataset used here was made available by Dr. Julian McAuley from the UCSD. It contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 — July 2014. The product reviews dataset contains user ID, product ID, rating, helpfulness votes, and review text for each review.
The data can be found here.

The hypothesis

In this work my goal was to check the researchers’ thesis. It was not to find the best model for the problem. I will try to prove that combining formerly known data about each user’s similarity to other users, with the sentiment analysis of the review text itself, will help us improve the model prediction of what rating the user’s review will get.

Source: pixabay

The workflow

As a first step, I will perform the RRP based on RTC analysis. The next step will be to apply a neighbors analysis to perform RRP based on the similarity between users. The final step will be to compare the three methods (RRP based on RTC, RRP based on neighbors analysis and the combination of the two) and to check the hypothesis.

Preprocessing

Preprocessing is a key step in any analysis and in this project as well.
The head of the primary table is as follows:

The head of the primary table

First, I deleted rows with no review text, duplicate lines and extra columns that I will not be used.
The second step was to create a column that contains the results from the division of helpful numerator and helpful denominator and then to segment these values into bins. It looked like this:

reviews_df = reviews_df[~pd.isnull(reviews_df['reviewText'])]
reviews_df.drop_duplicates(subset=['reviewerID', 'asin', 'unixReviewTime'], inplace=True)
reviews_df.drop('Unnamed: 0', axis=1, inplace=True)
reviews_df.reset_index(inplace=True)

reviews_df['helpful_numerator'] = reviews_df['helpful'].apply(lambda x: eval(x)[0])
reviews_df['helpful_denominator'] = reviews_df['helpful'].apply(lambda x: eval(x)[1])
reviews_df['helpful%'] = np.where(reviews_df['helpful_denominator'] > 0,
reviews_df['helpful_numerator'] / reviews_df['helpful_denominator'], -1)

reviews_df['helpfulness_range'] = pd.cut(x=reviews_df['helpful%'], bins=[-1, 0, 0.2, 0.4, 0.6, 0.8, 1.0],
labels=['empty', '1', '2', '3', '4', '5'], include_lowest=True)

The last step was to create a text processor that extracted the meaningful words from the messy review text.

def text_process(reviewText):
nopunc = [i for i in reviewText if i not in string.punctuation]
nopunc = nopunc.lower()
nopunc_text = ''.join(nopunc)
return [i for i in nopunc_text.split() if i not in stopwords.words('english')]

After being applied this had -
1. Removed punctuation
2. Converted to lowercase
3. Removed Stop words (non-relevant words in the context of training the model)

A look at the data

The head of the primary table, after all the preprocessing, looks like this:

The figures below shows how the users helpfulness range is distributed over the product rating:

Heatmap
Barplot

One can easily see the bias towards the higher ratings. This phenomenon is well known, and it is also supported in the same survey from above. According to that survey:

“Reviews are increasingly shifting from being a place where consumers air their grievances to being a place to recommend items after a positive experience”.

Later on, I will explain how the problem of the skewed data was solved (resampling methods).

Step one: RRP based on Review Text Content

The Models

In order to check and choose the best model, I constructed a pipeline that did the following steps. The pipeline will first perform a TF-IDF term weighting and vectorizing and will then run the classification algorithm. In general, TF-IDF will process the text using my “text_process” function from above, and then convert the processed text to a count vector. Afterwards, it will apply a calculation that will assign a higher weight to words of more importance.

pipeline = Pipeline([
('Tf-Idf', TfidfVectorizer(ngram_range=(1,2), analyzer=text_process)),
('classifier', MultinomialNB())
])
X = reviews_df['reviewText']
y = reviews_df['helpfulness_range']
review_train, review_test, label_train, label_test = train_test_split(X, y, test_size=0.5)
pipeline.fit(review_train, label_train)
pip_pred = pipeline.predict(review_test)
print(metrics.classification_report(label_test, pip_pred))

Note that I chose ngram_range = (1, 2) and that the algorithm was Multinomial Naïve Bayes. Those decisions were taken according to the results of a cross-validation test. The cross-validation test that I did is beyond the scope of this article, but you can find it in the notebook.
The models checked were:
1. Multinomial logistic regression, as a benchmark
2. Multinomial Naïve Bayes
3. Decision Tree
4. Random forest

Multinomial Naïve Bayes gave the best accuracy score¹ (0.61) and therefore the predictions made by it were chosen to represent the RRP based on RTC.

The final part in this step is to export the predictions made by the chosen model to a csv file:

rev_test_pred_NB_df = pd.DataFrame(data={'review_test': review_test2, 'prediction': pip_pred2})
rev_test_pred_NB_df.to_csv('rev_test_pred_NB_df.csv')

Step two: RRP based on User Similarity

Preprocessing

In this step is the user similarity matrix is constructed and is the basis on which I will calculate the cosine similarity between each user. Some problems occurred when I constructed the matrix using the names of the items but were solved by converting to a unique integers sequence (same as IDENTITY property in SQL).

temp_df = pd.DataFrame(np.unique(reviewers_rating_df['reviewerID']), columns=['unique_ID'])
temp_df['unique_asin'] = pd.Series(np.unique(reviewers_rating_df['asin']))
temp_df['unique_ID_int'] = range(20000, 35998)
temp_df['unique_asin_int'] = range(1, 15999)
reviewers_rating_df = pd.merge(reviewers_rating_df, temp_df.drop(['unique_asin', 'unique_asin_int'], axis=1), left_on='reviewerID', right_on='unique_ID')reviewers_rating_df = pd.merge(reviewers_rating_df, temp_df.drop(['unique_ID', 'unique_ID_int'], axis=1),left_on='asin', right_on='unique_asin')reviewers_rating_df['overall_rating'] = reviewers_rating_df['overall']
id_asin_helpfulness_df = reviewers_rating_df[['reviewerID', 'unique_ID_int', 'helpfulness_range']].copy()
# Delete the not in use columns:
reviewers_rating_df.drop(['asin', 'unique_asin', 'reviewerID', 'unique_ID', 'overall', 'helpfulness_range'], axis=1, inplace=True)

Construct the matrix: I used pivot to bring the data to the proper shape and then “csr_matrix” to convert it to sparse matrix in order to save processing time.

matrix = reviewers_rating_df.pivot(index='unique_ID_int', columns='unique_asin_int', values='overall_rating')
matrix = matrix.fillna(0)
user_item_matrix = sparse.csr_matrix(matrix.values)

KNN model

I used the K-Nearest Neighbors algorithm to produce the neighbors analysis. The KNN model is easy to implement and to interpret. The similarity measure was cosine similarity and the number of the desired neighbors was ten.

model_knn = neighbors.NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)
model_knn.fit(user_item_matrix)

After the training stage I extracted the list of the neighbors and stored it as a NumPy array. That yielded a 2-d array of users and the ten users that are most similar to them.

neighbors = np.asarray(model_knn.kneighbors(user_item_matrix, return_distance=False))

The next step was to grab the ten closest neighbors and store them in a dataframe:

unique_id = []
k_neigh = []
for i in range(15998):
unique_id.append(i + 20000)
k_neigh.append(list(neighbors[i][1:10])) #Grabbing the ten closest neighbors
neighbors_df = pd.DataFrame(data={'unique_ID_int': unique_id,
'k_neigh': k_neigh})
id_asin_helpfulness_df = pd.merge(id_asin_helpfulness_df, neighbors_df, on='unique_ID_int')
id_asin_helpfulness_df['neigh_based_helpf'] = id_asin_helpfulness_df['unique_ID_int']

Finally, to calculate the mean score of the reviews that the ten closest reviewers wrote I coded a nested loop that would iterate through each row. The loop would then iterate through every one of the user’s ten neighbors and calculate the mean score that their reviews got.

for index, row in id_asin_helpfulness_df.iterrows():
row = row['k_neigh']
lista = []
for i in row:
p = id_asin_helpfulness_df.loc[i]['helpfulness_range']
lista.append(p)
id_asin_helpfulness_df.loc[index, 'neigh_based_helpf'] = np.nanmean(lista)

Step three: Combination

Photo by ALAN DE LA CRUZ on Unsplash

As a third step I exported the results from the calculation above and merged them with the predictions from the chosen model. I then had a file that consisted of four columns:
1) The original reviews
2) The scores they got (the ground truth)
3) The predictions from the first step (NLP approach)
4) The predictions from the second step (users similarity approach)
The combination of the two methods can be done in many different ways. In this paper I chose the simple arithmetic average, but other methods would work as well.
In addition to the four columns above, I now have a fifth column:
5) The arithmetic average of each row in columns 3) and 4)

Final step: Reporting

The metric used for comparing the models was Root Mean Squared Error (RMSE). It is a very common and good measure for comparing models. In addition, I chose to present the Mean Absolute Error (MAE) because it uses the same scale as the data being measured and therefor can be easily explained. The results² are shown below:

RMSE for neigh_based_helpf: 1.0338002581383618
RMSE for NBprediction: 1.074619472976386
RMSE for the combination of the two methods: 0.9920521481819871
MAE for the combined prediction: 0.6618020568763793

The RMSE for the combined method was lower than the RMSE of each method alone.

Conclusion

In conclusion, my thesis was proven to be correct. Combining the formerly known data about each user’s similarity to other users with the sentiment analysis of the review text itself, does help improve the model prediction of what rate the user’s review will get

¹ This paper goal is to compare the methods and to see if the framework offered by the researchers will improve the predictions accuracy. It was not to find the most accurate model for RRP based on RTC.

² Although MAE of 0.66 is not good, the main aim of this work was to check the hypothesis and not necessarily to seek the best RRP model.

--

--