This summer I was privileged to collaborate with Made With ML to experience a meaningful incubation towards data science. I chose the awesome MovieLens dataset and managed to create a movie recommendation system that somehow simulates some of the most successful recommendation engine products, such as TikTok, YouTube, and Netflix.
This article is going to explain how I worked throughout the entire life cycle of this project, and provide my solutions to some technical issues.

Ideas
At first glance at the dataset, there are three tables in total:
- movies.csv: This is the table that contains all the information about the movies, including title, tagline, description, etc. There are 21 features/columns totally, so we candidates can either just focus on some of them or try utilizing all of them.
- _ratingssmall.csv: A table that records all the users’ rating behaviors, covering their rates and the time stamp when they posted the rates.
- links.csv: A table that records each movie’s unique ID on two respective movie database: IMDB and TMDB.
There are two common recommendation filtering techniques: Collaborative Filtering and content filtering. Collaborative filtering requires the model to learn the connections/similarity between users so that it can generate the best recommendation options based on users’ previous choices, preferences, or tastes. And content filtering needs the profile of both the users and the items so that the system can determine the recommendation according to users’ and items’ common properties.
Now I am going to try both of them step by step.
Collaborative Filtering
Collaborative filtering just requires me to keep track of users’ previous behaviors, say, how much they preferred a movie in the past. And fortunately, we are already provided with this sort of information because the data in table _ratingssmall.csv exactly reflects this. To implement this technique, I applied the wonderful Python Library Surprise. It provides a set of built-in algorithms that are commonly used in recommendation system development. I chose 5 methods to compare their accuracy with RMSE as the measure and the result is as follows:

SVD outperforms any other counterpart and here is the snippet of the final recommendation (of course, configured with SVD) list for each user will be like:

The most obvious advantage of collaborative filtering is its easy implementation. It does not require too detailed information towards the users and items, and ideally, it can be achieved with 5 lines of codes.
Content Filtering
Even though the collaborative filtering technique has its outstanding advantage, its other side of the coin is also apparent: it can not resolve the "cold start" problem. This problem refers to the situation where a new item or a new user added to the system and the system has no way to either promote the item to the consumers or suggest the user any available options. This is due to that the system doesn’t keep track of the properties of users and items. Unless users start rating the new item, it will not be promoted; and likewise, the system has no idea what to recommend until the user starts to rate.

And Content Filtering is the solution to it. It enables the system to understand users’ preferences when the user/item profiles are provided. For example, if a user’s playlist contains Justice League, Avengers, Aquaman, and The Shining, chances are that he/she prefers the action and horror genres. If using collaborative filtering, this user would be suggested some comedies because other audience who watched Justice League, Avengers, Aquaman, and The Shining watched comedies. This sometimes doesn’t make sense if this certain user doesn’t like comedies at all. But with content filtering, such an issue can be avoided since the system has been acknowledged what the preference of this user is.
To implement a content-filtering recommendation system, I utilized TFIDF to reflect the importance of each genre in any movie (I only considered genres at this stage). And then I calculated the sum product of the importance weights and users’ preferences towards different genres (given in user profile). Based on the sum-product, we could simply sort movies and suggest the users the top N candidates as the recommendations.
What if I’m new?
As the previous code snippet shows, I created the user/movie profile based on the existing users’ rating records in history. It has not entirely solved the cold start problem yet nevertheless because the system still has no idea what to do for the new users or with the new movies. I will tell you how I extract the genre information from the movie posters in the rest of this article and now I am going to show how the system should respond to a new user.
I assume that new users have two mindsets: they understand either what kinds of movies they want or nothing. For the first type of customers, I allow them to choose whichever genres at their will and simply let the system return according to their self-provided preferences.
For those who have not known what to do yet, I implemented part of the work of Tobias Dörsch, Andreas Lommatzsch, and Christian Rakow. I made the system scrape the most popular twitter accounts whose focus is on movies as soon as the new user without any preferences requests. Then I matched the most frequently mentioned named entities, which were recognized by spaCy, with the movies. The matched movies are supposed to the ones most likely popular because of their close similarity to the persons/movies of the current time.

How to release new movies?
A well-established movie streaming platform would introduce new movies constantly. I wanted to simulate this behavior and my idea was that whenever there are new movies starting streaming, they can get recommended in the content filtering Recommendation System even though their production companies do not provide their genre information. I developed a method that applies CV to generating the genres automatically, and for the details about it, please visit this article.

Deployment
I wrapped what I researched in the previous sections and managed to develop a web application using Streamlit. Just feel free to have fun with it on https://recommendation-sys.herokuapp.com/.

Conclusion
This is my first simulation of some state-of-art recommendation engines. I leveraged my knowledge in NLP and CV, especially content/collaborative filtering recommendation and multi-label classification.
I should admit that there is still a huge space for this project to improve and here are some of my future concentrations:
- Utilize more information of the given dataset. You might still remember that I once mentioned there are 21 columns in the table movies.csv. But I just used genres for an easy demonstration. With more information input, it is believed that the recommendation will be more personalized and targeted.
- Use more advanced recommendation techniques. Many recommendation engines developed by some big-name brands are more and more sophisticated, and their logic behind is also more and more in-depth. Model-based filtering, hybrid filtering are some of the recently emerging technologies.
- Eliminate the "filter bubble". A user perhaps can only watch the movies recommended by the system, and the recommendation is based on his/her previous watch history. In this case, other movies that don’t align with their preferences are not available to the users, which makes the users look like trapped in a "bubble". Nonetheless, some users are still welcome to other types so that the recommendation system should find a balance point between recommending similar movies and the other.
- Recommend movies based on recent events. This sounds quite similar to what I did to the new users when they do not provide their preferences. The difference is that what I did is simply recommending them the movies with the recent hit persons (actors, filmmakers, etc.) involved, while we could have made the recommendation system smarter, which means understanding something happening right now, even if it is not movie-relevant, and recommending the related movies. This would an interesting discovery, given that millions of, if not billions of, people are stuck at home by COVID-19 these days and they might be willing to experience how humans eventually conquered a virus attack in the movie.
If you are interested in my project and willing to contribute to it, please feel free to visit here: