Wine Ratings Prediction using Machine Learning

Published in

Towards Data Science

5 min readJun 15, 2018

There hasn’t been a day I hadn’t heard “Machine Learning”, “Deep Learning” or “AI” from a colleague, hacker news etc…the hype is super strong nowadays!

After reading through a few books, articles, tutorials about ML, I wanted to graduate from this theory beginner level. I needed to experiment on a real life example.
It always worked better when the subject is something that passionates me. So for this practice, I picked wine (><).
Wine is awesome, I have to say it! Can I say it? Well, it’s awesome!

Through all these years drinking wines, there’s been one thing that I always sought before buying a bottle: ratings. In all forms: points, descriptions etc…

A simple goal was set: is it possible, through machine learning, to predict a wine rating (in point) based on its description?
Some people call this sentiment analysis, or text analysis. Let’s start!

Dataset

Ok, I have to admit, I was lazy. I didn’t want to write a scraper for a wine magazine like Robert Parker, WineSpectactor…
Lucky though, after a few Google searches, the providential dataset was found on a silver plate: a collection of 130k wines (with their ratings, descriptions, prices just to name a few) from WineMag.

By the way, thanks to zackthoutt for this awesome dataset.

First look at the data

As usually with a dataset, I learned to remove duplicates and NaN values (null stuff):

We’re left with 92k wine reviews, which is plenty enough to play with!
Let’s now look at the distribution of our dataset. In our case, it would be the number of wines per points:

A lot of wines from 83 to 93 points. Which also matches the market (not a lot of excellent wines).

As a funny note, just by reading through some data, I found out that the better the wine, the longer the description seemed to be. It’s kinda logical that people would be eager to leave a longer comment to a wine they really appreciated, but I did not think it would be this significant in the data:

THE model

It looks like our dataset has too many possibilities. This would probably burden the predictions. A 90 points wine is not that different from a 91 points wine after all, so the description is probably not that different also.

Let’s try to simplify the model with 5 different values:
1 -> Points 80 to 84 (Under Average wines)
2 -> Points 84 to 88 (Average wines)
3 -> Points 88 to 92 (Good wines)
4 -> Points 92 to 96 (Very Good wines)
5 -> Points 96 to 100 (Excellent wines)

Now let’s look at our new distribution:

Vectorization

One of the simplest method to classify texts with ML nowadays is called bag-of-words, or vectorization.

Basically, you want to represent your texts in a vector space, associated with weights (number of occurrences etc…), so your classification algorithm will be able to interpret it.

A few vectorization algorithm are available, the most famous (to my knowledge) being:
- CountVectorizer: simply weighted by word counting as stated by it’s name
- TF-IDF Vectorizer: the weight increases proportionally to count, but is offset by the frequency of the word in the total corpus. This is called the IDF (Inverse Document Frequency). This allows the Vectorizer to adjust weights with frequent words like “the”, “a” etc…

Training and testing the model

In Machine Learning, that’s the last part of your tests.
You want to train your model with a percentage of your dataset, and then test its accuracy by comparing the remaining of your dataset with the predictions.

For this experiment, 90% of the dataset will be used for training (about 80k wines). 10% of the dataset will be used for testing (about 9k wines).

The classifier we’ll be using is a RandomForestClassifier (RFC), because it’s cool and it works well in a lot of situations (><).
Seriously though, RFC is not as performant (memory and cpu wise) as some other classifiers, but I always found it very efficient with small datasets.

Results

Sugoiiii! That’s some amazing results! 97% of the time we were able to predict correctly the quality of a wine based only on its description.
Let’s go quickly through these numbers and their meanings:
- Precision: 0.97 -> we did not have a lot of false positive
- Recall: 0.97 -> we did not have a lot of false negative
(F1-Score considering both precision and recall)

End note

Those results are pretty awesome, but we could definitely improve it:
- All the data (training and tests) are from WineMag. Having some other wine magazines’ ratings would improve the model and make it more generic
- RFC is a great classifier, but is pretty memory and CPU heavy. Maybe with a larger dataset a Multinomial Naive Bayes would be as good and more performant
- We did not look at the other columns too much (regions, price etc…). We could binarize / encode them to be classified.
- Publish the code as a Flask or Django API would be a nice thing to do

Kaggle Notebook

All the dataset and python code is available at:
https://www.kaggle.com/olivierg13/wine-ratings-analysis-w-supervised-ml

See you soon in a winery! (><)