MovieLens-1M Deep Dive — Part I

A hands-on recommendation systems tour using the popular benchmark dataset

Published in

Towards Data Science

12 min readJun 8, 2022

Recommendations. We all consume them. Be it through our favorite movie streaming apps, online shopping, or even passively as a target of advertising campaigns. How are these recommendations created? How does a recommendation system utilize enormous datasets of internet transactions to generate high quality and personalized recommendations? I find these questions fascinating, hence I decided to embark on a learning journey and this post is intended to share my findings with you, dear readers :)

I find that there are sufficient theoretical blog posts on the basics of recommendation systems and decided that this will be a hands-on practical tour of one of the most popular recommendation-focused datasets available — MovieLens-1M [1] (used with permission). In this dataset, we are given ~1 million historical ratings of 2894 movies by 6040 unique users.

The structure of this post is as follows:

Part I — Exploratory data analysis (EDA)
Part II — Data pre-processing (retrieving movie content information)
Part III — Setting up the recommendation problem and obtaining a baseline score
Part IV— Recommending movies with collaborative filtering
Part V — Recommending movies with content-based filtering
Part VI — Conclusions and thoughts ahead

The notebook containing all of this project’s code and more is available here: https://colab.research.google.com/drive/132N2lGQTT1NnzAU6cg1IfwpaeFKK9QQe?usp=sharing

This is the first of a two-part deep dive into MovieLens-1M. The next part, which will be released in the near future, will focus on more advanced algorithms and continue the never-ending EDA process. Happy reading!

Part I — EDA

First, download the MovieLens-1M dataset from here. Then, let’s make some necessary imports and get that out of the way for this project:

The MovieLens-1M dataset consists of 3 files — users.dat, ratings.dat and movies.dat. Let’s explore each of these files and understand what we are dealing with. We will start with the movie's data.

Let’s plot the distribution of movies per genre. For this, we will have to turn the genres column into a list and “explode” it. Note that each movie may be of several genres, and in this case, it will be counted multiple times. We will also plot the distribution of movies per movie release year.

Let’s move on to the users’ data and inspect it. The “occupation” column is encoded with a number representing each occupation, but for EDA we are interested in the actual data. We will get it by extracting the relevant rows from the README file of the dataset and swapping values in the dataset accordingly.

Let’s view the ratings now.

Let’s rank movie genres by their average rating. We will also plot how many times each genre was rated.

Now let’s combine all three tables (movies, users, and ratings) and see if there are any differences between male and female ratings per movie genre. Notice that here too we will normalize by the total amount of males/females to get a better answer to our question.

Again, we see some typical male/female differences in this dataset (men gave higher ratings to actions movies than women and vice versa for romance movies). This is a good sanity check for the data as the plots make sense.

Part II — Data pre-processing (retrieving movie content information)

In this part we will retrieve an important data source that will help us recommend movies later, and this is the movie plot summaries. We will obtain them from Wikipedia, using the Wikipedia python package.

There are 615 NaN movie plots

We see that we were not able to obtain movie plots for 615 movies. This might be because their Wikipedia page doesn’t have a “Plot” category or for some other reason. Since web crawling isn’t the main point here, we’ll just drop these movies and all ratings associated with them.

Part III — Setting up the recommendation problem and obtaining a baseline score

We will choose a user that will be acase study for which we will make actual predictions and see if the make sense to us. Most conveniently, we will choose the user with ID #1. Shown below in the user and the movies that she rated. We can see that she is a female K-12 student (age of 1 is an error in the data but I don’t think it’s very important anyhow). The top-20 movies she likes are mainly classical children’s hits with some anomalies here and there (Schindler’s List, One Flew Over the Cuckoo’s Nest).

We will also shine our spotlight on the movie Toy Story (1995), and for each prediction method we will check which are the movies that are closest to this movie in the embedding space.

Finally, let’s start recommending stuff! We will create two dataset splits:

1. Standard train/test split. This will be used for the rating prediction regression task

2. Leave-one-out cross-validation split. This will be used for hit-rate prediction (which is generally regarded as a more relevant metric in the field of recommender systems). The split is performed by taking one movie per user out of the dataset and using this as a train set. All the other user’s movies will be on the trainset. Then, the hit rate is the percentage of times the movie we took out was in the top-K ratings for that user.

For each method we will calculate:

RMSE on test set
hit-rate on leave-one-out cross validation set
Top predictions for user #1 (case study)
Top-10 most similar movies to Toy Story (1995)

Now, let’s define some auxiliary functions to help us evaluate our algorithms (mainly — the hit rate). We will inspect two measures for each algorithm — the RMSE and hit rate. Some of the code here is taken from the fabulous course on recommender systems by Sundog Education. I will also implement a simple tweak to the original KNN and SVD algorithms because calculating the hit rate takes a long time and I want to at least be able to measure the progress in real-time.

Let’s obtain results for a completely random recommender system, so we will be able to evaluate if our algorithms are better or worse than random (after all, we want to make things better not worse…)

Note — hit rate takes a lot of time to compute

RMSE: 1.4960
HitRate: 0.02185430463576159

Part IV — Recommending movies with collaborative filtering

We will use the classic vanilla algorithms, out-of-the-box from the Surprise package (with some minor tweaks to show progress in KNN). I won’t go into the algorithmic details here because I think there are many great tutorials out there for learning these things (just google recommender systems SVD/KNN and you're good to go). I will check these algorithms:

SVD
User-based KNN
Item-based KNN

We will start with SVD:

RMSE: 0.8809
HitRate: 0.03923841059602649

Nice! That’s a major improvement over the random score. But when we look at the predictions for user #1 they still feel a bit “off”. The only children’s movie is Babe (1995) and I personally would expect more of these to appear in a good recommendation for this user.

Now onto KNN. We will start with user-based KNN. This means that predictions for movie m and user u will be based upon the tastes of users that are similar to u.

RMSE: 0.9595
HitRate: 0.002152317880794702

The RMSE is better than random but the hit rate is worse. We can also see that the top predictions for user_1 seem even further away than those that were given by the SVD algorithm. In user-based KNN there is no notion of similarity between items, so we won’t calculate the most similar movies to Toy Story (1995) for this algorithm.

Let us now check the performance of item-based KNN, which recommends items to users by the similarity of the items they already rated to other items they have not yet rated. Code-wise only a slight change to the class instantiation is required.

RMSE: 0.9851
HitRate: 0.002152317880794702

The RMSE is a bit worse than the user-based KNN and the hit rate reaches almost zero. The most similar movies to Toy Story (1995) seem reasonable but the predictions for user_1 are once again unsatisfying. So far the KNN algorithms have failed to outperform the SVD, which really is regarded as a much better algorithm for collaborative filtering.

Up until now, we haven’t taken into account the features of the movies themselves. Maybe if I know something about the details of the movies it will help me make better recommendations? Let’s check it out. This is called content-based filtering.

Part V — Recommending movies with content-based filtering

For the content-based filtering we will use KNN-based algorithms in three approaches (two of them item-based and one user-based):

1. Movie plots (item-based): Create a vector representation of all of the movies based on the plot descriptions. We will do this by first stemming all of the words in the plot description and then applying TF-IDF to vectorize each document. The similarity matrices we will generate will be based on:

a. Using the complete TF-IDF matrix

b. Using the TF-IDF matrix after feature selection

c. Using the TF-IDF matrix after feature selection and after removing peoples’ names

2. Movie genres (item-based): We will use the movie genres as the only source for recommendations and see how that goes.

3. User age+gender (user-based): We will use user data as features for our KNN predictor.

We will create a class that inherits from Surprise’s KNNBasic. Its functionality will be the same as item-based KNN for collaborative filtering, with the only difference being that we will supply a pre-calculated similarity matrix to the fit function, instead of having it calculated from the rating data.

Part V.I — Content-based filtering using movie plots

Now, let’s create the TF-IDF similarity matrix

Now, get the results for approach #1 (using the complete TF-IDF matrix). Note that we need a different cosine similarity matrix for the regular trainset and the leave-one-out trainset which is used to calculate hit-rate because they contain different inner ids for each movie.

RMSE: 1.0268
HitRate: 0.003973509933774834

Worse than collaborative filtering, without a doubt. We also see that the similarity matrix doesn’t really generate meaningful relationships between movies (the only children’s movie which is similar to Toy Story is Toy Story 2). Let’s see if reducing the number of features helps.

Number of selected features:  784List of selected features:   ['abbi', 'abel', 'ace', 'adam', 'adel', 'adrian', 'adrien', 'affair', 'agent', 'agn', 'al', 'aladdin', 'alan', 'albert', 'alex', 'alfr', 'ali', 'alic', 'alicia', 'alien', 'allen', 'alli', 'alvin', 'alyssa', 'amanda', 'amelia', 'american', 'ami', 'amo', 'andi', 'andrea', 'andrew', 'angel', 'angela', 'angelo', 'angus', 'ann', 'anna', 'anni', 'antoin', 'anton', 'antonio', 'ape', 'archer', 'archi', 'ariel', 'arjun', 'armstrong', 'arni', 'arroway', 'art', 'arthur', 'arturo', 'ash', 'audrey', 'aurora', 'austin', 'axel', 'babe', 'babi', 'balto', 'bambi', 'band', 'bank', 'barbara', 'barn', 'barney', 'bastian', 'bate', 'bateman', 'batman', 'beach', 'beal', 'bean', 'beatric', 'beckett', 'becki', 'beldar', 'bella', 'ben', 'bendrix'...RMSE: 1.0314
HitRate: 0.003642384105960265

Didn’t really help…We see that a large portion of the informative features are names. We don’t want that because names don’t really say anything about the movie content. Let’s remove the names.

RMSE: 1.0284
HitRate: 0.0041390728476821195

OK, this is somewhat more reasonable than the previous two approaches. First of all the hit-rate is higher (though still way less than the collaborative filtering methods), and second the similar movies to Toy Story(1995) are actually similar. Third, the top predictions for user_1 even seem passable. Let us now try the movie-poster approach and see how that goes.

Part V.II— Content-based filtering using movie genres

In this part I will tune down the algorithmic complexity and create a KNN recommender based only on the movie genres. This means that if a person rated children’s movies very high, we should expect the algorithm to recommend children’s movies. We will create all possible combinations of genres using pythons itertools package and then apply our old friend TF-IDF to generate a feature matrix from this data.

RMSE: 1.0138
HitRate: 0.0031456953642384107

Unsurprisingly the most similar movies to Toy Story (1995) are those that have identical or almost identical genres. Other than that the results are quite confusing, why did user_1 get recommendations only for war movies? Let’s remind ourselves of the table in part III of this article (which shows all of the ratings of user_1), we see that there are two movies with genre “war” in user_1’s rating list and both of them got a score of 5. The other genres that user_1 has rated appear more frequently and thus if we average user_1’s scores by genre — the “war” genre will receive a perfect score while the other genres won’t.

Because the item content-based KNN recommender predicts new items for user_1 by their similarity to other items that have been scored by user_1 it will give a perfect score for each war movie it will see in the context of user_1! The KNN recommender ignores scores belonging to 0 similarity items, it will only take into account the war movie scores when generating the recommendations for war movies and thus always predict perfect scores.

Part V.III — Content-based filtering using user User age+gender

Let us start by creating a feature vector from the users’ age and gender. We will assign 0 to ‘Male’ and 1 to ‘Female’ and we will take the age column as-is. Finally we will normalize the columns so as not to be affected by the difference in column magnitudes. Using these values we will create a cosine-similarity matrix as we did for the movies, but this time we will calculate a similarity score between each user. This similarity score will be an input to our user-based KNN recommender.

RMSE: 0.9784
HitRate: 0.00347682119205298

We see that the RMSE is a better than for the item-based methods, similarly to the results we got in part IV (collaborative filtering), where the user-based recommendations outperformed item-based ones. but the hit rate is a bit worse than some of the item-based methods we tried. It seems that each method has its strengths and weaknesses and a good approach will probably be to somehow combine them all, and this will be the focus of the next part.

Part VI — Conclusions and thoughts ahead

In this post, a lot was covered.

First, we inspected the MovieLens-1M dataset and got some pretty interesting insights from the graphs we saw, such as which movie genres tend to score higher than others in average. We then used some vanilla recommender systems algorithms from the Surprise python package, and got some pretty good results with the SVD algorithm.

After that, we tried our luck with content-based filtering, but unfortunately, this turned out to be a futile attempt. Despite that, in the research process we unveiled an inherent weakness in KNN recommender systems (or content-based filtering in general), which is that the recommendations for a specific user are based solely on the items that he/she has interacted with in the past, and if there was bias in this user’s interactions it will show up in the KNN recommendations. This is why our spotlight user (user_1) got top scores for war movies when we used genres as the basis of our recommendations, even though we as humans wouldn’t necessarily agree. A more probable recommendation for this user would be children’s/musicals and the like.

Additionally, we saw that when dealing with natural text, such as movie plots, it is worthwhile to understand which features are most affecting the recommendations. When we removed person names which were ranked very high in the TF-IDF algorithm, we suddenly got better results and this was clearly visible when the movies that were most similar to Toy Story (1995) suddenly made sense.

In the next part, I will combine content-based filtering and collaborative filtering and thus have the best of both worlds. I already have my eyes on the TensorFlow-recommenders package and I cant’ wait to give it a go.

Until next time!

Elad

References:

[1] Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872