Topic Model Based Recommendation Systems

A very quick and (hopefully) easy to follow introduction into the intuition (and very low level Maths) involved in Topic Model Based Recommendation Systems.

--

Check out my GitHub for a working simple recommendation system based on Topic Modelling.

In today’s world, sometimes it feels like we are plagued with never ending decisions. Whether it be the Friday night movie or the next song to keep people dancing at an NYE party.

So how do recommendation systems actually work? In this article I’m going to explain one approach based on Topic Modelling using a Latent Dirichlet Allocation (LDA).

Topic Modelling

Before we talk about how to model a topic, we need to first understand what a topic actually is.

Photo by David Pupaza on Unsplash

This is not an intuitive idea to think about so we will describe it in terms of collections of words.

If we have a collection of documents randomly selected from a database, we can imagine that some of the words contained in these documents may be semantically similar, or be related to the same area.

For example, if these documents were a collection of film reviews. We can imagine that we might be able to form groups of positive and negative reviews based on the words contained within them. Alternatively, we may wish to form collections of documents relating to Sci-Fi, Comedy, Romance etc.

So, as we are starting to reorganise this collection of documents into many smaller collections. At the same time, we are starting to see that there are many layers of possibilities. Where each possibility is a selection of topics.

You may then ask the question, how can we get a computer to organise these documents into topics and how do we know what topics it will pick? To answer this, we are going to think at a slightly deeper level…

The Less General Idea

In this article, I am going to describe one option for how to get a computer to perform topic modelling. However there are many other algorithms, approaches and methodologies out in the wild.

Going back to our collection of documents and sticking with the approach of looking at the words contained within them, we can build up a vocabulary containing all of the unique words in the database.

Say we have 1000 different unique words across 4 documents and we want to characterise each document by which words are contained within them (and by how many of each word).

Photo by Nice M Nshuti on Unsplash

So now we can imagine that for each document, we have a vector with dimension 1000 (one dimension for each unique word). And at each position in each vector there is the count of how many times the word that this position corresponds to, appears in the document.

For example, the first position in the vector corresponds to the first word in the vocabulary, which we will say is “Robot”. The first document is a film review about Terminator 23 (or whatever number we are on now…) and so the word “Robot” is mentioned 19 times. Therefore in the first position of the vector corresponding to the first document, we have (19,…).

Diagram showing connections between documents and words in vocabulary.

The second position corresponds to the word “Sport” and is mentioned zero times in the Terminator review which gives us (19, 0, …) and so on…

The second document happens to be a review for a Tennis documentary and so for this document we have the vector (0, 10, …), since “Robot” is mentioned 0 times.

Too many topics?

Yes, way too many. I agree, as will your computer.

Diagram showing the fabled ‘middle ground’.

At the moment we have essentially defined a “topic” for each word in the vocabulary. Which clearly is not ideal and will not provide us with many clearly separated topics to play with later on.

The next step then is to find a middle ground where each of our reviews belong to a broader topic which is defined by a number of words in the vocabulary.

Latent Dirichlet Allocation

We are now looking to reduce the number of connections coming from each document by introducing a hidden layer of topics between the individual words in the vocabulary.

This is exactly what we are going to use Latent Dirichlet Allocation (LDA) for.

LDA requires us to define a required number of topics that we want to have. This is what we call a hyperparameter (a parameter which is defined before the algorithm is run).

In the real world we can imagine scanning over many numbers of topics to find the best outcome (aptly called a hyperparameter scan).

Let’s say we are looking for 10 topics. This means we wish to add a hidden layer with 10 nodes between the previous connections we had. Which were linking documents to words, remembering our large vectors (19, 0, …) and (0, 10, …).

Mathematically and computationally this is very desirable for us, since we can replace these huge vectors describing each document with new vectors of size 10 (or however many topics you have chosen).

To describe what the LDA is attempting to achieve, it is easiest to look at a matrix formulation below…

Matrix formulation for what LDA is attempting to achieve.

So in our original perfect description of the documents, we had the matrix S. This matrix is the most complete picture we can have of all the documents. There is no information lost since each word is its own topic and if we ignore the orderings of the words, we are able to perfectly recreate each document.

However, some of the information is too fine grained. For example, we don’t really need a separate topic for “Robot” and “Android”. These can just be combined into a coarser topic of “Sci-Fi” or whatever you want to name it.

In this case, we can see that we have sacrificed some information. So if we were to recreate the document, there is no guarantee that we would get back the word “Robot”, since we only have information that a similar word from the same topic was mentioned.

Diagram showing the end goal of LDA with words connected to topics which are then connected to the documents.

This is what matrices M and N are doing. The Latent dimension K is our hidden layer of topics (i.e. 10 topics). And given this dimension K, the LDA is learning the matrices M and N in an attempt to best recreate the matrix S.

It is not essential you understand the maths here to know what is going on, the takeaway points are:

  • LDA creates coarser grained topics based on the documents given to it.
  • As the LDA model does this, we lose specific information about the individual documents.

If you think about how you would sort film reviews into 10 topics, this will hopefully start to make sense. Imagine if I asked you to summarise one of the topics you had created. You wouldn’t be able to recite every word of each document, but you’d probably be able to give a few of the most common words describing the overall topic. Hence, you have lost information.

Why do we want to lose information?

Losing information may sound like a bad thing, but it actually helps Machine Learning models find patterns in data that they would have missed otherwise.

The trick is to not lose the useful information, which in some cases can be learnt by an algorithm or it must be controlled via a hyperparameter (as for LDA with the number of topics).

Let’s look at an example… say we have a noisy set of data points loosely describing a quadratic function (plot A).

Example of losing information being a good thing. Basically overfitting and underfitting in a different context.

In plot B, here we have tried to keep all the information we have to describe the data points, but does this look right? Probably not, our model is too specific and has not really caught the general trend.

In plot C, we have lost too much information. The model is too simple for the data.

In plot D, we have a good description of the data. We have found a good hyperparameter to be able to lose the less useful information and keep the important parts.

This may all sound familiar because it is exactly the description of underfitting and overfitting data, just in the context of NLP and topic modelling.

Back to Topics…

Diagram showing the end goal of LDA.

Hopefully that brief interlude was useful. If not, sorry about that but we’re back on track now!

So we have our LDA model which has sorted all these documents into a distribution of topics. Note that these are not hard clustered topics, they are distributions. So if 3 of the 10 topics we had were Sci-Fi, Documentary and Technology, we could have a film review for a Documentary about astronauts that would have a distribution of (0.2, 0.4, 0.3,…).

On the diagram above, this corresponds to the top layer of connections between the documents and the topics.

These distributions are actually called Embeddings. Since we have embedded information about the document into a usable mathematical format.

Take our embedding from earlier, a = (0.2, 0.4, 0.3,…). Now if we have two more documents with embeddings b = (0.1, 0.5, 0.2,…) and c = (0.8, 0, 0.1,…). Is b or c more similar to embedding a?

There are a few ways to answer this question but we will choose the simple (and very effective) cosine similarity. Which you may remember from Maths courses at school or college. This basically calculates the distance between the two embeddings and if they are closer together, they are more similar.

In this case a and b are more similar than a and c. So if we were to ask the computer which one out of b and c would it recommend given a. We would expect that it may recommend to us the document b.

This is how we can use topic modelling to create recommendations.

Round up

So we have looked at what topics are, then at what the LDA algorithm gives us and then finally how we can use these mathematical objects to produce recommendations.

The key point for recommendations in topic modelling based on similarity is that we are assessing how similar the encoded information of each document is. As we have discussed, these recommendations may be completely terrible depending on how our topics are found and distributed (plot B or plot C from the crude explanation earlier). Or they might be fantastic and hit the sweet spot (plot D).

The other point to note is that although we can control the number of topics, we have less control on what these topics are. This is determined from the data, which we are able to manipulate by removing unimportant words for example. But if we don’t have a representative sample of Sci-Fi film reviews in our database then the likelihood is that this topic will not exist.

Another reason why its all about the DATA…

This is a really short and low level insight into how these types of algorithms can be used to give recommendations. There is loads of much more extensive descriptions out there so I encourage you to read around.

If you are interested, I have a working recommendation system code on my GitHub.

References

[1] — D. Blei, et. al., Latent Dirichlet Allocation (2003)

--

--

Research Scientist at MediaTek Research. Postgraduate PhD student in Theoretical Particle Physics at UCL, London.