The world’s leading publication for data science, AI, and ML professionals.

How Advertisements Read Your Brain: An Introduction to Recommender Systems

Recommender systems can be scarily accurate. Let's figure out how they work.

Photo by Wojtek Witkowski on Unsplash
Photo by Wojtek Witkowski on Unsplash

Online Advertising is ubiquitous these days. You can’t go many places online without being offered to buy something – perhaps a shirt, or maybe a pair of headphones. What’s more, these ads have uncanny levels of accuracy. I’m often surprised at how many things I actually would buy from the ads that are shown to me. I’ve been equal parts creeped out and impressed by these ads and decided to figure out how they worked. It turns out that recommender systems, as these ads are called, are actually quite intuitive. In this article we’ll take a look at a simple yet effective recommender system and explain how it works.

The Problem

We first have to define a concrete problem. Let’s say we’re working for Amazon, and we have access to a database of product ratings in the form of (Username, Product Id, rating). Since Amazon is such a large company, we have a lot of these ratings. Our goal is to predict the ratings for pairs of (Username, Product Id) that do not exist in the database yet.

How might we approach this problem? Well, for each (Username, Product Id) pair we’re trying to predict, we might look at the other ratings that Username has given out. For example, if we see that Username has given high ratings to guitar strings and guitar picks, we can probably infer that this person is a guitar player, and that he would rate other guitar accessories like guitar tuners highly as well. We could do the same thing for other hobbies as well. For example, if someone were to give baseball bats and baseball shoes a bad rating, we’d except a bad rating also for baseball gloves.

So how do represent this idea mathematically? We can use something called an embedding. An embedding is a compact representation (usually a vector) of a complicated object – for example, a user’s preferences. Concretely, what we’re going to do is represent both the user and the product as vectors. Each element of the user vector represents how much the user likes a particular type of product. For example, we could set the first element of the user vector to represent baseball products, and the second element to represent guitar products. Then the user vector (3, -2) would mean that this user likes baseball products, and doesn’t like guitar products.

We set each element of the product vector to represent how much of that type of product it is. For example, a product vector of (5, 0) would represent something that is definitely a baseball product, and definitely not a guitar product – something like a baseball bat. A product vector of (2, 2) would represent something that is related to both baseball and guitars – maybe a baseball-themed guitar.

Once we have the user and product vectors, we can take the dot product to get a single number representing the user’s preference for a product. For example, with the user vector (3, -2) and the product vector (5, 0) we would get a dot product of 15, meaning the user would like this item and give it a high rating. If the product vector was (2, 2) instead, we would get a dot product of 2, meaning the user is lukewarm about this product. This makes sense, as a (2, 2) product is equally related to baseball and guitars, and our (3, -2) user likes baseball but doesn’t like guitars. We can extend this idea to vectors of arbitrary size. Instead of just baseball and guitars, we could have 100-dimensional vectors, representing preferences for 100 different types of products.

Now that we have a way of representing users and products, as well as a way to compute a user/product pair rating, we need to figure out how to train our system. In other words, we need to come up with a way to get accurate user and product embeddings. At the beginning of training, we have no information about the user or product embeddings, so it seems reasonable to set those embeddings to random values. Going back to our baseball/guitar example, let’s say we randomly set a user embedding to (3, -3) and a product embedding to (1, 1). The predicted rating is 0, and let’s assume this user’s true rating is 5.

Our predicted rating is too low. We can then adjust both the user embedding and the product embedding to better match the true rating. For example, we can change the user embedding to (3.1, -2.9) and the product embedding to (1.1, 0.9). This moves us closer to the true rating, but we aren’t there yet. We can then try (3.2, -2.8) and (1.2, 0.8), check again, readjust, and so on and so forth. This process we’ve described turns out to be gradient descent, which is great since it is easy to implement.

We have one more problem, though, which is how to choose what types of products should be included in our embeddings. Obviously, for a real-life system, we need more than baseball-type and guitar-type. However, we can’t possibly include every single product type in the universe, as that would make things computationally infeasible. Instead, we avoid the problem by only setting the size of the embeddings, and letting the training algorithm decide what product types to use. Now let’s take a look at the full training algorithm.

Training Algorithm:

  • Decide a dimension k for the user and product embeddings (k = 40 is reasonable).
  • For each user in the database, initialize a k-dimensional vector with random values as that user’s embedding.
  • For each product in the database, initialize a k-dimensional vector with random values as that product’s embedding.
  • For each user/product combination in the database that already has a rating (our training set), take the dot product between the appropriate embeddings and get a predicted rating.
  • Get the sum of the squared errors between all our predicted ratings and our actual ratings. This is the total error.
  • Take the gradient of the total error with respect to the user and product embeddings. Do gradient descent on all the embeddings.
  • Repeat until total error meets some predefined criteria.

As you can see, in our algorithm we don’t need to handpick the product types. We let gradient descent do that for us. Intuitively, this algorithm is quite simple. All we are doing is computing our predicted ratings based on our embeddings, comparing those ratings to the true ratings, and then updating the embeddings accordingly. On the practical side of things, this algorithm performs well in practice. A variation of it placed top 10 in the Netflix Movie Rating Competition. In the years following the competition, many new algorithms have been proposed, but they don’t seem to provide much (~2%) improvement and have reproducibility issues.

There is one last thing. It turns out that our algorithm is related to singular value decomposition (SVD), a well-known matrix factorization method. Because SVD is so widely used, we can take advantage of heavily optimized SVD implementations. The full topic is for another time, but it deserves to be mentioned.


In this article, we reasoned from scratch and came up with a recommender system that is competitive with far more complicated models. Additionally, our system has the advantage of being intuitive and easy to implement. Please leave any comments or questions, and let me know what you want to see next!


Related Articles