Dionysos 2.0 đ: Wine Recommender System and Interactive Tastespace
How to match a personâs tastes with wines theyâll potentially like (plus: the most advanced visualization of wines according to their taste)
Hello there. Today I will follow up on my previous adventure in the world of wine and what can be done with its data. Indeed, in my previous postâs closing remarks I wondered about how interesting it would be to have an individual taste profile for each bottle, which could then be matched to the taste preferences of a person. Turns out thereâs no need to work at a wine-selling website to do this. By the way, itâs called a ârecommender systemâ.
The style of todayâs anecdote will differ. It wonât be a coding tutorial as there are plenty of those, made by more qualified teachers than me. Although it may be of some interest to tell you that everything here below can be done through Python, and thereâs many free resources online to learn how to.
Anyhow, Iâd like to talk about two things: the first, is how to match a personâs tastes to wines theyâll potentially like, which turns out to be a mind-numbingly simple process. The second, is an innovative, interactive, data-driven map of wines according to their taste, as judged by hundreds of thousands of users. To my knowledge, this is the first time that a map like this has been done, so Iâm quite excited to present it.
But first things first: we need data about wines, specifically about individual bottles of wine to which people assign some tastes to. There isnât such a dataset out there in the wild, so it has to be built from scratch. One way to do this is by coding a âwebscraperâ to scrape such data off a wine-selling website. Now this is technically not allowed, but thereâs plenty of blog posts out there that openly say theyâve done it, so count me in too. Also, in a romantic way it could be seen as a modern twist on the Robin Hood legend, where the websites that extract data from us everyday are themselves the victims of their scheme. Immodest literary comparisons aside, Iâm not earning anything out of this, so it should be alright.
Thus, we magically have a dataset of wines with their taste features estimated, which is pretty neat. Obviously, to mitigate noise in the data, weâll pick some threshold to include only wines that have a certain number of user votes. Then, weâll need the taste profile of a user, which consists in all the ratings he or she has left on bottles of wines, which have their own taste features. Luckily for us, this too is easily and freely accessible. I randomly picked a user as case study, since he happens to like my favorite bottle of wine, thereby clearly having good taste. Weâll call him âEmâ.
Em likes a lot of different wines, the taste features of which can differ considerably, so weâll narrow down the model on his favorite type: Bordeaux Reds. The next step will consist in running a linear regression (that is âa machine learning algorithmâ in marketing speak) on his user data, and then generalize our model to the rest of the bottles in the wine data.
Letâs walk through the process:
In the first place, there are way too many taste features assigned to the wines, so weâll need to compress them without losing signal. These features are also highly collinear, which essentially means theyâre correlated and thatâs bad for the model. As a simple example, if we consider flavors such as âcoffeeâ, âmochaâ and âespressoâ, then it makes sense that there will be votes from various users scattered across all these 3 features that actually come from the same origin: a coffee-like taste. The sole thought of dealing manually with this issue is daunting. Luckily, there are better options; one of these is Principal Component Analysis (PCA).
What PCA does, in a very non technical explanation, is âsquashingâ the features in the dataset, while preserving their descriptiveness. Exactly what we need! Thanks to PCA, we can go from over 200 taste features to just 20, so it is actually discovering new, prototypical, tastes along the way. The drawback is that these compressed features arenât interpretable as the original ones, however that isnât an issue here and youâll see why in a moment.
We apply PCA to all the data together, that is the user, or âtrainâ, dataset and the wines, or âtestâ, dataset. Otherwise, the compressed features wouldnât match between datasets. Then, we can run the linear regression, or âhypothesisâ, on the user data, with the rating score as dependent variable. What this does is fit a line between the userâs ratings and the compressed taste features. By doing so, it âlearnsâ what numbers to assign to the coefficients that get multiplied to the values of the taste features to determine the ratings. If that wasnât 100% clear, and I have a feeling it might not be, Iâll try to make it so with the good olâ linear regression formula:
Anyhow, as you can see, these betas/thetas/coefficients we obtained can be multiplied with the values in the corresponding taste features of any wine, giving Emâs predicted rating for that bottle.
In other words, they represent Emâs combination of unique taste preferences for Bordeaux Reds. I mentioned it was going to be easy, didnât I?
Obviously, this model should be applied to wines that are coherent with the training set. I did that for 307 bottles of Bordeaux MĂ©doc variety (which is Emâs most rated). Here below you can scroll and see the values for all the taste features of Emâs potentially liked and disliked wines (i.e. their predicted rating is above or below some thresholds). The names and prices of the wines have been purposely omitted.
Weâre looking at the original data before PCA compression, therefore the taste features still have meaning. Indeed, the results seem to suggest that Em would prefer earthy and smoky wines with blackcurrant and red fruit hints and Petit Verdot grape, while tending to dislike those with a strong chocolate and leathery taste with hints of tomato.
Of course the only way to actually test this would be to have Em taste these wines and let us know what he thinks. As I donât know him personally, Iâm keeping track of which wines I like and dislike to eventually experiment on myself. If you have this type of data and want to try let me know!
Recently, one of the biggest wine-selling websites introduced such a taste-wine matching service for their users. Iâm willing to bet it doesnât work too differently from what I presented here.
P.S. Machine learning aficionados will notice I skipped the âcross-validationâ step, where I would have avoided using part of the wines in the user data to train the algorithm, using this subset to instead assess the validity of the learnt betas, by comparing the predicted rating versus the actual rating. However, given the scope of this post, this step was glossed over. While in a more rigorous approach, it would have been fundamental: if we predicted poorly the ratings, then some corrections would have been warranted, such as considering another learning algorithm. The general principle stays the same though!
Now that we dealt with that,
let us proceed to the second part of the post. It is linked to the previous part in the sense that weâll apply an evolved version of PCA to all the wines in the dataset. Think of this as a PCA on steroids which can be used to compress the data, while representing the various relationships between the entities as distances on a 2 (or 3) dimensional map. In this case, I am using UMAP, and if youâre drawn by this sort of algorithm, I suggest you look into the work of Alex Telea, who is doing some rather interesting things on the topic.
Anyways, through this algorithm I made a meaningful, explorable, map of wines arranged according to their taste, as judged by thousands of users. Or, to be more specific, a map of 3834 wines, from 110 different specialties, arranged according to their scores in 169 different tastes, as voted by approximately 492.516 unique people.
Not to beef with any sommelier and their rigorous study of the subject, but this half a million human hive-mind is likely a more reliable estimator of the taste of wines than the brief descriptions generally found on bottles.
Let me present you the interactive plot where you can explore this map. As far as I know youâll be among the first to have the occasion to do so. It isnât the same as exploring a new continent, but itâs still quite fascinating, no?
With some guidance from a friend knowledgeable in JavaScript, I included buttons that highlight the wines linked to the corresponding taste(s). These tastes could be assigned on a bottle-by-bottle level, or to entire varieties. I opted for the latter, as it gives more insight on why the algorithm chose to arrange the wines in the 2-dimensional space as it did, although the only informations it used were the individual bottlesâ taste scores.
Unfortunately, I canât embed directly the interactive plot here, so youâll have to go to this link: https://dionysus-stempio.netlify.app/. In the meantime, hereâs a GIF preview:
Once again, names and prices of wines were purposely omitted. Although I briefly tinkered with the idea of turning the plot into a tool where one could see inexpensive alternatives to high luxury wines, supposed to be judged similarly in terms of taste. However, in the end I decided not to. Humans are notoriously bad at separating the price payed for a wine from its objective taste (proof here). Besides, this is true also for other luxury goods (did someone say Apple or Gucci?).
I think it may be a neat idea to have a more visually embellished version of this map, perhaps even in 3 dimensions, at a museum like Bordeauxâs CitĂ© du Vin for visitors to play around with. I proposed this idea to them, but never heard back ÂŻ\_(ă)_/ÂŻ
This concludes the series on wine and its data, unless I get some new inspiration! Albeit few things are motivating as good wine, Iâll be looking out for new serendipities in the age of data.
Thanks for reading!
bonus colorful figure in case you were wondering which are the most appreciated wine specialties :