The world’s leading publication for data science, AI, and ML professionals.

How I built a Cannabis Recommendation app using Topic Models and Latent Dirchlet Allocation (LDA)

This is a non-technical article that will appeal to entrepreneurs, developers and data scientists!

How I built www.rightstrain.co, a cannabis recommendation tool that used by online dispensaries
How I built www.rightstrain.co, a cannabis recommendation tool that used by online dispensaries

Background:

On October 17th, 2018, cannabis became legal in Canada.

As an entrepreneur, I’m always reading about the latest tech startups, following how markets are developing and sniffing out emerging opportunities. As a data scientist, I’m always looking for data driven solutions to problems I’ve identified.

As a resident of Toronto, naturally, I began to look at the cannabis market.

The following article will be a brief summary of how I built a cannabis recommendation system. I’ll keep it practically technical so those that would like to build a similar app can follow the code – while those who are just interested will still find it enjoyable to read.

The Problem:

In 2013, talks began about legalization of cannabis in Canada. It was at this time I knew legalization was going to happen, so I became an early stage investor and kept up with the market.

Fast forward to legalization in 2018, leveraging my network in the industry, I got on the phone with a number of cannabis dispensaries and learned that they all had one problem in common.

When customers made orders online, and found that their favorite item (e.g., a certain cannabis strain) was sold-out, they would stop shopping and find the same item at another retailer.

I learned that this was called shopping cart abandonment which, according to Statista costs e-commerce stores over 75% of their sales.

Doing further research, I found that Amazon attributes 35% of its revenue to recommender systems.

So I built a recommender system for cannabis products.

Building a recommender system 101:

What is a recommender system?

Basically, for e-commerce purposes, a recommender system simply finds products that are similar to products the shopper is purchasing and recommends similar items in hopes of increasing sales.

So, how does it do this?

Imagine you own an e-commerce shop on Amazon that sells home accessories. You notice that when customers purchase, say, toilet paper (A), they also always purchase towels (B). Further, you find that this pattern repeats itself across different customer segments, with some variation.

This variation occurs when other customers purchase product A, B and, a new product, C.

Knowing this pattern, we can now recommend product C back to customers that traditionally only purchase product’s A and B.

That’s it!

How to build a recommender system:

First, you need to start with data.

In my case, I wanted to build a recommender system for cannabis products, so I needed data on all various cannabis strains on the market. Doing a quick search, I found a number of databases that contain reviews on cannabis strains.

In order to get at the data, I wrote a simple Python Scraper and tunneled it under the Tor network. This allows the scraper to run continuously and reduces the chances your IP will be banned. Please see my write-up on developing a python scrape tool (TBA).

IMPORTANT NOTE: when scraping data from the internet, please respect the server you’re scraping from, as you can easily overwhelm a server that isn’t prepared to handle the amount of requests you throw at it.

Exploratory Data Analysis:

I managed to scrape 500k reviews that contained both a 1–5 star rating and a comment. I figured I could create a recommender system by looking at the patterns of ratings users gave to each strain (see previous section for the logic behind this).

However, looking at the data:

Figure 1: Ratings of cannabis strains (x-axis) vs number of reviews (y-axis)
Figure 1: Ratings of cannabis strains (x-axis) vs number of reviews (y-axis)

I found a severe range restriction. In other words, the majority of the reviews were in the range of 4–5 stars.

I guess cannabis enthusiasts are easily impressed 🙂

Given this finding, I decided against using the quantitative ratings because the recommender system would basically be recommending other strains with high ratings – which virtually all strains were rated highly.

Which would defeat the purpose of a recommender system.

Enter Natural Language Processing:

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence that focuses on enabling computers to understand and process human languages.

So how does a computer read human languages?

Let’s start with understanding how computers… compute. Basically, computers think in terms of binary, 1s and 0s. This makes things complicated because words aren’t numbers.

So how does NLP work?

In its simplest form, NLP works by transforming words into numbers! For instance, if we had a document with three words in it, "dog eats food", each word would be converted into a vector (e.g., think of it as a string of numbers). So the word "dog" may be represented by (1,0,1,1,1,0) and "eats" by (1,1,0,0,0,0)..etc. Now you can imagine that once all words are vectorized, a computer can now recognize that a vector (1,0,1,1,1,0) represents the word "dog".

The tricky thing now is getting the computer to understand meaning. This topic is out of the scope of the current article, but I may follow up with another article on this topic. Although the NLP can successfully vectorize words to allow computers to recognize words, it is extremely difficult to get computers to understand the meanings of words. If you’re interested in this topic, read more into Deep Learning and NLP.

Topic Modelling & Latent Dirchelet Allocation:

Now that all our text data has been vectorized, we begin looking for patterns in the data.

Topic modelling is perfect for this type of task, it’s a statistical modelling technique used to discover the abstract "topics" that occur in a collection of documents. A very simple explanation of this is that it combs through a document and recognizes: 1) the most frequently appearing words and 2) words that appear next to those frequently appearing words. The logic here is that if these words always appear together, they must form some sort of topic.

Now you’re probably wondering, how many topics does the algorithm create?

Well, that’s up to you.

The art of topic modelling comes into play when you choose how many topics you want to keep in your model.

I generally look at two things: 1) the Coherence Value and 2) The Intruder test. Let’s elaborate a bit more on the two.

The Coherence Value can be thought of as the probability of each topic being a "good" topic. To read more about it, check out this great article on Coherence Values.

In order to choose the best fitting model, you need to qualitatively evaluate each topic using the Intruder Method.

Above, I’ve plotted the number of topics and their corresponding coherence values. Notice a large drop after 14 topics. The optimal model here wold be 14 topics according to coherence values, however, an 8 topic model only reduces the CV by 1 point. For sake of parsimony and explanatory power, I always choose to stick with the simpler model.

So what’s the Intruder Method?

The Intruder test is a great follow-up test to using Coherence Values. Once you determine how many topics you want, you then look at the topics individually and assess them qualitatively. In other words, you want to be asking.. "What words don’t belong in these topics?"

Let’s look at Topic 1 (or 0 in this case). It’s a bit confusing because we see words like "great" and "favorite", but also words that appear opposite in meaning like "stress" and "depression". This is an example of a topic that isn’t very interpretable by humans but scores high on a Latent Dirchelet Allocation (LDA). Meaning that I’ll have to fine tune the hyper parameters of the model to get a better output.

Let’s look at topic 2 (model 1), this topic is a bit clearer on what it’s getting at. It looks like it could refer to a topic like "likeability".

You continuously do this for the number of topics in your model, evaluating them individually, looking for words that may or may not fit in a topic.

Creating the Recommender System:

Although It may not be clear from above, my final model produced 8 topics with some very interesting insights. For instance, I found that cannabis consumers enjoy smoking because of a few things: 1) some people smoke because they enjoy the flavors and aromas of cannabis, 2) while others do so because it makes them feel creative, 3) another segment of users do so because it makes them feel energized and finally 4) the majority of users do so because it helps with pain-relief.

That’s so cool!

Essentially, what the topic models did was separate my data into customer segments. If I was in the business of marketing and/or writing copy, I’d be better able to target customer segments with this information.

Anyways, back to the Data Science.

Using these 8 topics, I predicted how much of each strain’s review contained those topics. Doing so gave me 8 features to separate my strains on. In other words, I created a dataset based on how each strain (e.g., I was working with over 200 strains) differed on the topics I found (e.g., some strains reportedly provided more creative thinking abilities, while others increased energy).

Once this was complete, I was ready to create the recommender system based on similarities.

Choosing a similarity metric:

Now that we have our strains described by 8 different features (e.g., topics) it’s time to choose how we recommend them. There are a number of similarity metrics that can be used (e.g., Cosine Similarity, Euclidean Distance, Manhattan distance).

The important consideration when choosing distance metrics is: 1) how does it deal in high dimensional space? (e.g., euclidean distances begin to fail in high dimensions) and 2) how accurate is the recommendation?

This brings us to the question, how do we validate a recommender system?

There are a number of ways to do so. I think the best bang for your buck is to back-test your data. For example, say our dataset contains data of customer A who purchased product A,B,C & D.

One method of validation would be to use data from Customer A and predict what they would’ve purchased next after product A. Then validate it against what they actually purchased!

Conclusion

I used topic modelling and an LDA approach to find customer segments in the emerging cannabis market. From this, I created a recommendation system. The most difficult part of this project was procuring and cleaning the data – something that is common in all data science projects.

If I get enough interest in the article, I’ll write a technical post where I can share my code of my MVP. The product has since evolved and is currently being used by a number of dispensaries! Check it out here: www.rightstrain.co

If you would like to see a technical post, let me know in the comments!


Related Articles