Just like I did a couple of days back with the Kickstart Campaigns analysis project, today I will do with the analysis of movie ratings. Here’s the link to the previous article, if you’ve missed it:
Recommender systems are the core of most of the bigger (and smaller) webshops, movie/tv show sites like Netflix, and many others. Here’s the ‘official’ definition, according to Wikipedia:
A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. They are primarily used in commercial applications.[1]
There exist two main types of recommender systems:
- Collaborative Filtering – based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past
- Content-Based – based on a description of the item and a profile of the user’s preferences. They treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on product features.
There, of course, exist some derivations of those two, but this is more than enough for now. Let’s focus on what this article will talk about.
Why should you read this article?
There are two main benefits:
- You’ll dive into the world of recommendation systems – a lot of fun to do, and a nice thing to add to your resume
- You’ll walk through yet another exploratory data analysis process, further advancing your skills in the area
Sounds like something that would be beneficial to you? Good, let’s get started.
How is this post structured?
This is the first of two posts I will write on recommendation systems. It covers the process of data gathering, exploratory data analysis, and data visualization. Although this post won’t exactly cover the process of creating the recommender system, it’s a must for every Data Science project to firstly be familiar with the data itself.
In case you’re wondering, the second post in this series will cover building recommender system from scratch in Python, so that post will be meat and potatoes of the series, and this one is here to warm you up.
The contents of the post are as follows:
- Data gathering and importing
- Basic data preparation
- Exploring movie publish years
- Rating exploration
- Movie genre exploration
- Calculating ratings by genre
- Visualizing the number of ratings
- Conclusion
Yeah, I know what you are thinking, and I know that it’s a lot of ground to cover. But I’ll try to make it as short, and as up to the point as possible.
Data Gathering and Importing
The dataset can be found on the official GroupLens website. So go there and choose the one you like. They are divided into a few sizes:
I’ve downloaded the first one due to memory issues on larger ones. If your PC is stronger than mine (I5–8300H, 8GB RAM), feel free to download any of the larger ones. But then again, if you don’t want to wait long for computations to finish, download the 100K dataset, the results, in general, won’t vary much.
Okay, dataset downloaded? Let’s import it in Python. Here are all of the libraries you’ll need, alongside with the CSV reads and merge process:
Once that’s done, here’s how the dataset will look like:

Basic Data Preparation
I have to admit something to you – I was lying about the whole data cleaning process. There won’t be one for this dataset, because well, it is as clean as it can be:

Not a single value missing, such a rare occasion. God bless educational datasets!

As concludable from the dataset title, there are 100K entries. But how many movies there are?

Around 10K. Pretty good actually, I think it will be enough to create some general conclusions on the movie market.
Exploring Movie Publish Years
There is some preparation necessary for this step. The goal is to analyze the distribution of movie publish years, but this information isn’t available in the desired form, yet.
The publishing year is available in the title column, surrounded by brackets at the end of the title. For some movie that however is not the case, which makes the process of extraction a bit harder. So, the logic is as follows:
- If publishing year is present, remove the brackets and keep integer representation of the year
- Else, put 9999 as the year – an obvious indicator that year is missing
This block of code implements the described logic:
As it turns out, year information isn’t available for only 30 instances, which isn’t terrible (as we have 100K rows). Now let’s plot a nice histogram of those years (excluding the year 9999). Here’s the code:
Once this cell of code is executed, here’s the resulting chart:

As it looks like, most movies came out between 1995 and 2005.
Rating exploration
Logic from the _make_histogram function can be used here, without any other preparation or data manipulation, to draw a histogram of movie ratings. Just make sure to pass rating as an attribute, instead of moviePubYear_.

Looks like average rating would be somewhere around 3.5, and it also looks like the users are more prone to give a full star rating rather than 0.5.
Movie genre exploration
As with the publishing years, some preparation will be required. Take a look at how the genre column looks like:

There’s no way to analyze it the way it currently is. What I want to accomplish is the following:
- Split the string on the pipe (|) character
- Create a new entry for each genre
So, 1 row of Adventure|Animation|Children|Comedy|Fantasy should become 5 rows, with other information remaining the same. Pandas provides a nice way of accomplishing this:
This way a new DataFrame is created, and the first couple of rows look like this:

This 5 rows now represent 1 row from the original dataset, so yeah, this is where my laptop failed on the larger datasets – it wasn’t capable of holding like 100 million rows in memory.
Similarly, I can now declare a function for plotting bar charts:
A lot of prep for just one chart, I know, but it was worth it:

Most movies are in drama or comedy category – nothing unusual here.
Calculating Ratings by Genre
Obtaining this information will probably be the most difficult task in this post. That’s mainly because it involves a good bit of preparation, and the logic behind may not be as intuitive as the one done earlier. You will need to:
- Calculate the rating for each movie on individual genre level (genre string split by | )
- Append a list of ratings for every genre to a dictionary
- Calculate the average rating as the mean of the list in the dictionary
Sounds confusing? Maybe a bit, but try to read and execute the following code block line by line, to get a gist of it:
From here you can easily plot the ratings by genre:

Seems like ‘Film-Noir‘ did the best, but then, that genre has the least number of movies inside. Horror movies did the worse, which makes sense, there’s a ton of just terrible horrors. Both documentaries and war movies tend to do better than the majority, which also makes sense – everybody loves a good WW2 movie, and if it’s presented as a documentary that’s a five-star rating from me!
Visualizing the Number of Ratings
You’ve made it to the last part, good work! Let’s now dive further into the number of rating visualization. It will also involve some prep work, but nothing demanding. You will have to:
- Create a DataFrame with movieId column grouped, and count the instances
- Merge it with the original dataset
-
Rename columns that were messed up upon merging
Before jumping into visualization, let’s take a look at the top 10 movies according to the number of ratings:
I’m quite satisfied with the results. These top-rated ones are classics and deserve to be where they are.
Why did you sort by the number of ratings instead of the pure rating?
Ah, excellent question. The reason is, I want to avoid movies which were rated with 5 stars, but only by 1 or a couple of users. Those movies are not relevant if they were popular enough more users would rate them.
You can now draw a histogram of numRatings column:

This was expected. Most movies don’t have a good budget, and in general, that leads to a not so popular movie. If the movie isn’t popular, the majority of people won’t watch it, and therefore, won’t rate it. Love it or hate it, that’s how it works.
And now finally, let’s conclude this section with a nice scatter plot of rating vs. a number of ratings. Here’s the code:
Here’s how my chart looks like:

You can see a trend here – as a movie gets more ratings it’s average ratings tends to increase. This also makes perfect sense if you think about it. If more and more people are watching a particular movie, it probably has a good budget and good marketing, which would mean that it’s a blockbuster of some sort, and they are generally highly rated.
Conclusion
This was a long post, I’m aware of that. However, it covered every essential part of exploratory data analysis. Now you are more familiar with the data itself, which means you are less likely to make some stupid mistake in further analysis.
The next post will be up in a couple of days, and in it, you’ll create your first (probably) recommender system. It will be linked down here, so keep this article close to you (this sounded weird).
Recommender System in Python – Part 2 (Content-Based System)
What are your thoughts? Did you cover something else in your analysis? If so, please share it in the comment section.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.