“What Should I Watch Next?” — Exploring Movie Recommender Systems, part 1: Popularity

Eve Ben Ezra
Towards Data Science
10 min readJul 8, 2019

--

Recommender systems. What are they, and why should you care?

Well, it turns out, everywhere uses recommender systems these days. The New York Times, Reddit, YouTube, and Amazon (to name a few) all make use of these systems in various ways to drive traffic and sales, and bring you, the user, what you’re looking for.

When people think of movie recommenders, they’ll most frequently think about Netflix, whose algorithm is such to keep users coming back again and again to watch new and exciting things.

I decided to make my own recommender system, so that I could recommend myself new movies to watch. I created four different systems, from simple to more complex: A popularity filter, a content-based recommender, a collaborative recommender using SVD matrix factorization, and a hybrid recommender of both the collaborative and content-based.

In the first post of this series, I’ll be talking about the data process, and exploring how I made my popularity filter.

Deciding on data

There are several ready-to-use movie datasets out there. The most well known are probably the MovieLens datasets. If you had a website of your own like Netflix, you could use internal, explicit data from users to help recommend movies. Youtube, for example, uses average view duration (how many minutes, on average, of your video is watched) in their algorithm. Reddit uses weighted user votes to put popular posts on their ‘front page’. Instagram looks at which ads you click and how you navigate around the world wide web as a whole (apparently it has to do with if you’re logged into your Facebook account, since Facebook owns Instagram). As you might already know if you’ve found your way here: data is valuable.

Of course, I do not have access to internal metrics like ratings, demographics, or anything else, since I don’t have an app or site that actually recommends movies. Therefore, for that portion of my recommender system, I decided to use the MovieLens 20M Dataset from GroupLens. This gave me 20 million ratings from 138,000 users (138,001, since I added my own ratings) on 27,000 movies. A good set to start with, for sure!

However, it was missing a few key features that I really wanted to use for my content-based recommender. First, I didn’t want to just use 27,000 movies. I wanted a movie recommender that could deep dive into the pits of filmography and pull out some dusty diamond of a movie that would appease even the mustiest of critics (think Anton Ego from Pixar’s 2007 movie, Ratatouille). Therefore, for my popularity and content-based recommenders, I scraped my own data from IMDB using IMDB’s available datasets to gather movie IDs, and then OMDBAPI to scrape metadata from 265,000 movies.

In recommender system terms, I wanted to keep an eye on four metrics: diversity, coverage, serendipity, and novelty. While I won’t be talking in this post about how to explore these metrics mathematically, I’ll define them quickly below.

Diversity measures how dissimilar recommended items are to each other. Am I putting in Iron Man and getting back Iron Man 2, Iron Man 3, The Avengers, The Avengers: Age of Ultron, The Avengers: Infinity War, etc? Or am I getting some weird, thematically-linked movie about an eccentric rich bachelor who gets into a sticky situation, à la 20,000 Leagues Under The Sea?

Consider as well, though, that 100% diversity might just look like picking films at random and spitting them back out. That would not a very good recommender system make.

Coverage addresses how much of your catalog is actually being used and recommended. For example, generally most human behavior follows a long-tail Pareto distribution, where there is a 20–80 rule. The MovieLens 20M rating dataset follows this long-tail distribution, as did my scraped data. However, for my scraped movies dataset, containing 265,417 movies with 733,228,772 total votes, 79.57% of all votes are accounted for by just 1.5% of movies in the list. By the time we get to the top 20% of our dataset by vote count, we’ve accounted for 99.16% of the total votes.

The KDE of movie votes from IMDB that shows the majority of the votes going to a few of the movies. This kind of distribution is widely explanatory of human behavior and aspects of societal life like wealth

Serendipity is the measure of how surprising and relevant returned recommendations are.

Novelty determines how unknown recommended items are to a user. It’s the ‘surprise’ without the strict ‘relevance’, which can be hard to quantitatively evaluate.

Rec. System 1: Popularity

Image from Pexels

If I were starting a theoretical website to recommend movies, I’d have to have somewhere to start while I gathered internal user data, whether explicit (votes/ratings) or implicit (links clicked on, minutes watched, purchases made, etc).

A place to start is a popularity filter. This returns ‘top hits’. On Reddit, it’s their front page. The New York Times includes popularity filters like their ‘Most Emailed’ and ‘Popular on Facebook’ lists (scroll to the bottom). IMDB has their ‘top 250 movies’.

Simply put, a popularity filter works like this: You decide on a threshold. What is ‘popular’ to you? The top 5%? Top 1%? If you have the info available, you can use the number of ratings or votes (how IMDB does it, and how I did it) to filter this. If you’re starting from scratch, you can use metrics like total sales or box office. Once you’ve set your thresholds, you decide how to return the results.

On my scraped data (265,000) movies, I had access to how many votes a movie had received, and the movie’s average rating on IMDB. Here’s what the votes looked like, in brief:

Votes a movie had received on IMDB. The average number of votes was 4,049. The fewest votes an individual movie had received was 5, the most votes an individual movie had received was 2,084,570

I decided to look at movies in the 95th percentile and above. This gave me an initial threshold of 7,691 votes. I also decided to look at movies whose average unweighted rating on IMDB was the mean or above. In this case, an unweighted rating at or above 6.14.

But how do we decide, then, which movies to display? Generally, these movies will have fairly stable average ratings. But another way to do it is to create a weighted rating, accounting for the fact that if we did decide, at some point, to use rating for our top 250 charts, we wouldn’t want a movie with 5 votes of 10 each to overcome a movie with 1,000,000 votes and an average of 8.9. Stability is important in a popularity chart. Therefore, we can transform the raw average ratings into weighted ratings, using a True Bayesian Estimate formula:

(𝑊𝑅)=(𝑣÷(𝑣+𝑚))×𝑅+(𝑚÷(𝑣+𝑚))×𝐶

Where:

R = average rating for the movie
v = the number of votes cast for the movie
m = the minimum vote threshold required
C = the mean rating of all movies in the dataset

If that doesn’t make much sense, here’s essentially what’s happening: We’re deciding on a threshold (in my first case 7,691 votes). For movies with more votes than this threshold, not much changes. For movies with far fewer votes x, however, the average rating the movie has is then padded with the remaining threshold-x votes, where each of these remaining votes is C, the mean rating of the entire dataset. This adds a lot of stability to the ratings, and therefore stability in the chart.

We’ll also use the weighted rating later, in the content-based recommender and our hybrid.

The transformation of ratings on movies in my dataset with the most votes on IMDB vs the least votes. It’s clearly seen that the weighted rating of movies with many more votes than our threshold barely changes, whereas movies with very few votes move toward the mean of the dataset as a whole

What we’re left with after we filter movies by our threshold (and average unweighted rating, if you feel so inclined) is a smaller list of qualifying movies. From there, we simply sort and return the recommendations based on our preferences. For example, here are the top 20 from the qualifying list gathered by using the entire dataset of 265,000 movies, returned by weighted rating:

Popular movies returned

A couple things about this first popularity result: it includes all movies, worldwide, and includes movies made between 1891 and 2019. A useful list, but is this a popularity list that someone would expect to see as soon as they enter the site? Probably not. A lot of US-based users, for instance, probably think Hollywood when they think movies, and don’t want to be returned foreign flicks (The Chaos Class is a Turkish comedy) and maybe would be confused if a silent movie from 1891 was recommended right after Avengers: Endgame or Lion King.

Fortunately, this problem is easily solved through further filtering. For instance, here’s another popularity chart accounting for movies only released in 1990 or later. Note that because I’m changing the shape of my initial set, the thresholds for my qualified movies will change. In this case, my threshold of total votes changed from 7,691 to 11,707 votes, and I looked at movies with an unweighted average rating of 6.13 or above.

Top 12 ‘popular’ movies from 1990 on

We can also choose to filter further, creating popularity lists for unique countries, directors, actors, genres, languages, and more, making sure to change the thresholds accordingly each time.

For these filters, in python, since often you’ll have in your dataframe several items per feature (eg: The Dark Knight has action, crime, drama, and thriller all attributed as genres) a solution is using pandas.DataFrame.stack, which will allow you to then include all movies, filtered by only one subset of your target feature.

Top 10 movies for Japanese language and top 10 movies for Comedy

Overall, popularity or Top N lists are a great starting point for recommendation systems, whether you’re compiling them for a personal blog or movie recommendations, with some key points to keep in mind:

By default, Top N lists might not return the metrics you want. For film buffs or critics looking for their next diamond in the rough, a popularity chart is not going to cut it. It returns the popular, and the known. Remember the kernel density estimate plot above, with the long tail?

Accounting for our thresholds on the total set of 7,691 votes and an average unweighted rating of 6.14, the initial qualifying list contains 6,532 movies. Recall that the entire dataset has 265,000 movies (more precisely, 265,417). This means that only 2.46% of movies were even considered for these Top N charts (and remember, 1.5% of movies in my entire scraped dataset accounted for nearly 80% of the total votes). Not exactly the metrics we were hoping for in terms of coverage. Other metrics are a bit more tricky to look at in this case: we have diversity as much as we look for it. Our very first chart has different countries, years, genres, actors. Filtering down ‘Japanese comedies’ might give us less diversity, while also noting that this is what the user is looking for.

Novelty, in a popularity chart, is also probably lacking. Someone might look at a 250 Top Movies chart and think, ‘Oh yeah, I forgot about that movie.’ But they probably won’t be surprised to see Titanic or Pulp Fiction.

The second thing to consider with popularity charts is that they are 100% unpersonalized. If you implement a popularity filter or chart on your blog or website, everyone who comes to that website will see the same results. Looking at Reddit, the top posts in the history of the website are all years old, which is why you aren’t presented with those years-old posts when you visit their ‘hot’ page, whose algorithm accounts for and weights by the age of the post and generally only displays posts that are less than 6 hours old.

The last thing to consider with straight popularity charts is that we’re making an assumption about what people like. The entire idea behind a popularity recommender is that because a lot of people liked it, a person at random will also like it. This obviously isn’t always the case. Just because 1,000,000 people rave about The Lord of the Rings: Fellowship of the Ring, doesn’t make a person who dislikes the franchise suddenly like it. I don’t like horror movies, and no metric of popularity is going to make me change my mind. Additionally, while further-filtered charts do address this, I doubt very much a parent of a five-year-old would look at our very first chart and find anything suitable to watch with their child.

Overall, popularity charts are: simple, easy to implement, and good first steps into recommending products, pages, or other services to users. Popularity charts are not: personalized, deep-diving, or going to recommend to you the movie you never knew you needed (unless that movie is Avatar).

Thanks for reading! Leave any questions in the comments below, and check out my python notebook or my github repo for this project if you’re so inclined.

My next post will deal with content-based recommenders: using movie metadata and tags such as genre, MPAA rating, plot keywords, cast & crew, languages, and more to recommend movies using NLP vectorization and distance functions, while considering scaleability. Ready to read it? Click here.

--

--

Backend developer (Go, Python, SQL) and data engineer, who is learning all the time.