Transformation of a simple movie dataset into a functional Recommender System

All the steps from the creation of the models to the deployment of the web application using Python

Amine Zaamoun
Towards Data Science

--

Started as a dataset, now a Movie Recommender System by Amine Zaamoun
(Left) Small version of the latest ratings.csv movie dataset created and indexed by MovieLens | (Right) Recommender system web application I implemented and deployed locally, movie poster images from IMDb

Why a Recommender System?

Who among you hasn’t already spent several minutes or even hours trying to choose a movie to see alone or with your family, unfortunately without success? Admit that you would like to have someone to make up your mind for you at such times, well that’s exactly the role of a recommender system. This is one of the main reasons for the current success of the giants Netflix and Amazon. I designed this article to show you that anyone with a little bit of creativity and experience in Data Science and programming, can implement their own recommender system by following the few steps I’m going to describe. I realized this project during my 8-month internship at DEUTSCHE TELEKOM AG, Innovation Hub (IHUB), in Data Science. The idea is also to focus on the practical aspect, rather than on the theoretical and mathematical aspects to which you can find scientific documentation everywhere on the Internet.

Overview and architecture of the system

Architecture of Amine Zaamoun’s Movie Recommender System
Movie Recommender System architecture, schema by author

The recommender system presented in this article was realized in 4 major steps:
- Step 1: Calculation of the weighted average score of each movie in order to propose to the end-user a catalog of the 100 most popular movies of the Cinema
- Step 2: Setting up the recommendation of 5 “popular” movies using a machine learning algorithm: k-Nearest Neighbors (kNN) with Scikit-learn
- Step 3: Setting up the recommendation of 5 “less known” movies recommended by a deep learning algorithm: Deep Neural Matrix Factorization (DNMF) using Tensorflow and Keras
- Step 4: Deployment of the final system using the pre-computed results from the precedent models on Flask, the Python web development framework

But first of all, let’s briefly explain why we have particularly used a dataset in which users have rated the movies they have seen.

The Collaborative Filtering method

Amine Zaamoun’s Movie Recommender System Collaborative Filtering method
Collaborative Filtering method, image by Emma Grimaldi on Towards Data Science

This method makes it possible to build a model based on a user’s past behavior and similar decisions made by other users. Indeed, it is based on the movies selected in the dataset and the numerical ratings given to these movies. The model is then used to predict the movies that the user may have an interest in, through the predicted ratings of these movies.

MovieLens’ ratings.csv dataset

MovieLens’ ratings.csv dataset used for Amine Zaamoun’s Movie Recommender System
MovieLens’ ratings.csv file, source here

The highlighted line in this dataset reads as follows: user number 4 watched movie number 21 and gave it a rating of 3.0/5.0.

All information about this dataset has been directly retrieved from the “Summary” section of the “README.html” page from the following link: https://grouplens.org/datasets/movielens/latest/

I cite: “This dataset [1] (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996, and September 24, 2018. This dataset was generated on September 26, 2018.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.”

Please also note that for the recommender system presented in this article, only the ratings of the movies were used, not the tag applications.

Step 1: Calculation of the weighted average score of each movie

The goal of this very first step is to offer to the end-users of our recommender system a catalog of popular movies from which they can choose their favorites.

Data visualization of the most popular movies by weighted average score used by Amine Zaamoun’s Movie Recommender System
Most popular movies by weighted average scores, chart by author

The code itself is quite self-explanatory, the only notable element is the use of PySpark to perform this calculation. Indeed, this library allows the use of the “mean” and “col” functions native to the SQL language, thus facilitating the organization and readability of the code. However, this same calculation is also perfectly feasible with the Pandas library, which is a little more popular among Data Science beginners.

code of the first step in our Movie Recommender System implementation

Step 2: Setting up the recommendation of 5 “popular” movies using k-Nearest Neighbors (kNN)

The objective of this second step is to recommend to the end-user a series of movies that can be described as “popular”.
First of all, it helps to reassure the user in the sense that he will recognize at least one of the recommended movies. Indeed, if he doesn’t recognize any of the recommended movies, he may reject the usefulness of our system. This factor, psychological and human, is unfortunately not quantifiable. It also proves that the best mathematical and statistical model may not be suitable for some users if the cultural aspect is not taken into account.
Secondly, the fact that the movies recommended using the kNN algorithm are all “popular” is a direct consequence of prior filtering made on the data before training the machine learning model. As a matter of fact, the frequency of evaluations in our dataset follows the “long tail” distribution. It means that most movies were rated only very rarely, while an “overwhelming minority” were rated much more times than the rest of the other movies combined (More details in this excellent article by Kevin Liao). This filter, therefore, allowed only the most popular movies to be used to train the kNN algorithm, so the resulting recommendations could only be popular movies as well.

This algorithm also has the advantage of being fairly easy to understand and fairly easy to explain as well. This is especially true with non-technical people such as your company’s sales team or simply your friends and family who are not necessarily in the field of Data Science. As Kevin Liao explained in his article: “When KNN makes inferences about a movie, KNN will calculate the ‘distance’ between the target movie and every other movie in its database, then it ranks its distances and returns the top K nearest neighbor movies as the most similar movie recommendations”.

Schema that explains how the kNN algorithm used by Amine Zaamoun’s Movie Recommender System works
Illustration of how KNN makes classification about new sample”, schema by Kevin Liao on Towards Data Science
Top 10 recommendations for “Iron Man (2008)” resulting from the kNN model used by Amine Zaamoun’s Movies Recommender System
Top 10 closest neighbor movies to “Iron Man (2008)” according to the kNN algorithm used in my recommender system, results by author

As you can see in this example, the closest neighbor movie to “Iron Man (2008)” is “The Dark Knight (2008)”, with a cosine similarity (or simply “distance”) of approximately 0.33. This result, from a subjective and personal point of view, seems very coherent in the sense that they are two superhero movies. We can also notice the presence of “Avatar (2009)” and “Inception (2010)”, which are also 2 science fiction movies. I thank it is necessary to note the magic of this machine learning algorithm because, as I remind you, only the ratings given on a scale of 1.0 to 5.0 have been used. As a matter of fact, the genres of these movies have not been used in order to provide these recommendations. Here is an associated snippet of the code to show you how to implement this algorithm using the Scikit-Learn library and obtain recommendations according to a chosen movie title:

snippet of the kNN algorithm from the second step in our Movie Recommender System implementation

Step 3: Setting up the recommendation of 5 “less known” movies using Deep Neural Matrix Factorization (DNMF)

The purpose of this third step and the choice of this algorithm is to recommend to the end-user a series of movies that tend to be “less known”. Without going into too much detail, just remember that there is no need for prior filtering and that a movie can be used as training data regardless of its popularity. Indeed, this algorithm is mathematically very complex, combining two models frequently used in Data Science. The first model is Matrix Factorisation, for example, the Alternating Least Squares (ALS) algorithm. The other model is an example of a deep neural network, such as the Multi-Layer Perceptron (MLP). It would then be necessary to write a whole article to explain it correctly but as I already announced previously, the goal is not to make a statistical course. So I let you read these two resources that have already explained these concepts very well: the article “Prototyping a Recommender System Step by Step Part 2: Alternating Least Square (ALS) Matrix Factorization in Collaborative Filtering” written by Kevin Liao at the end of 2018 and the article “Building A Deep Learning Model using Keras” written by Eijaz Allibhai in 2018 as well. Let’s now assume that you have at least basic knowledge about the two models described above, the Deep Neural Matrix Factorization algorithm (DNMF) used in this third step has then the following architecture:

Architecture of the Neural Matrix Factorization (NMF) algorithm used by Amine Zaamoun’s Movie Recommender System
Deep Neural Matrix Factorization architecture, schema by author

The principle of this algorithm is the same as a classical matrix factorization. I mean that using this model, we try to predict the rating that a certain user would have given to a certain movie. I specify the rating “he would have given” because this algorithm fills the blank values that currently exist in the ‘ratings.csv’ dataset. Let me explain: even a big movie fan may not have seen or rated the 9742 movies in the dataset at our disposal. The idea is then to be able to give movies that he hasn’t yet rated himself ratings that determine whether he likes them or not. This is exactly what the matrix factorization part of our algorithm does. The addition of the neural network then makes it possible to further increase the predictive performances of the model, thus reducing the error between predicted and actual ratings. Here is a code snippet to show you how to implement such a model using the Tensorflow and Keras libraries. We will use it to predict the rating associated with a not already existent (userId, movieId) couple.

snippet of the DNMF algorithm from the third step in our Movie Recommender System implementation

We can now follow this same logic to predict the associated scores of all not already existing (userId, movieId) pairs in our ‘ratings.csv’ dataset. Let’s take for example user n°401, the top 10 of his movie ratings calculated by the DNMF algorithm is then as follows:

Top 10 movie rating predictions for user n°401 resulting from the DNMF model used by Amine Zaamoun’s Movie Recommender System
Top 10 movie rating predictions for user n°401 according to the DNMF algorithm used in my recommender system, results by author

We can now save the results of the 2 tables generated using this model in 2 different csv files: the top 10 users recommended for each movie and the top 10 movies recommended for each user.

pdUserRecs.to_csv(os.path.join(trained_datapath, 'DNMF_MovieRecommendationsForAllUsers.csv'), index=False)pdMovieRecs.to_csv(os.path.join(trained_datapath, 'DNMF_UserRecommendationsForAllMovies.csv'), index=False)

Step 4: Deployment of the final system using the pre-computed results from the precedent models on Flask

Here we are finally at the last step, which will this time require some slight knowledge of web development. It will be useful to properly deploy the system as a real application. In this web application, we will link all the work done in the previous steps of this article. Indeed, the user will start by choosing 3 movies from a catalog of the 100 most popular movies, calculated according to their weighted average scores in step 1. These 3 movies will be used as input data for our 2 models in order to obtain a final recommendation of 10 movies of which 5 come from the kNN and 5 from the DNMF. Also, in order to offer the end-user a fast and smooth experience, the predictions that will be given by the DNMF model have been pre-calculated. What does this mean concretely? It means that for each of the 3 movies chosen, the system will search in the ‘DNMF_UserRecommendationsForAllMovies.csv’ table for the 5 users who “match” the most according to the predicted scores:

Top 10 user matching predictions for movie n°6726 from the DNMF model used by Amine Zaamoun’s Movie Recommender System
Top 10 user matching predictions for movie n°6726 according to the DNMF algorithm used in my recommender system, results in image by author

The system will then use this list of users who match the most to repeat the same process as earlier. In other terms, it will add in another list the 5 favorite movies of each user, among which 5 will be randomly kept at the end, using the other saved table. This allows us to give movie recommendations based on similar user profiles to the end-user of our web application. Another very important point is that these recommendations have been given quickly and accurately without having to wait for hours for the model to be re-trained, hence the major interest of having pre-calculated the DNMF results.

code of the make_recommendations() function from the last step in our Movie Recommender System implementation

As you can notice, a POST request is sent to the server when the user has chosen his 3 movies and presses the button to get his recommendations. When this request is processed, the presented function returns several variables that are associated with templates. Here is how to use them in the ‘index.html’ file of our deployed web application:

snippet of the index.html file from the last step in our Movie Recommender System implementation

Thus, the end-user will be able to benefit from nice-looking movie recommendations accompanied by their posters in a functional web application. Here is what the final result looks like:

Recommender System Web Application recommendations with movie posters by Amine Zaamoun
Deployed system final recommendations, image by author

And that’s it! You can now drink a good coffee after this reading and then try to implement your very own version of this system.

Summary

In this article, we have seen together how to transform a simple dataset into a true functional movie recommender system using the Python programming language and deploy it as a web application. We also learned that a recommender system is usually based on different inter-connected algorithms. This is indeed useful in order to provide recommendations for each type of product, whether it is “popular” or “less known”. I tried my best to present the topic in a more practical than theoretical way, so anyone can understand what I’m talking about, hoping you liked it. The source codes are available in my GitHub Movie_Recommender_System-Python repo.

References

[1] F. Maxwell Harper and Joseph A. Konstan. The MovieLens Datasets: History and Context (2015), ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19.

Don’t hesitate to visit my GitHub to follow my Data Science projects. You can also contact me directly on LinkedIn if you have any question, I’ll be happy to help you! I would also like to thank my supervisor, Dipl.-Ing. Aykan Aydin, Chapter AI & Data Science, for his precious tips and advices during the realization of this project.

--

--

BI Engineer @Amazon | President of DOT ESSEC & CentraleSupélec in the DSBA Master's degree | Centrale Lille accredited Engineer