
Introduction
Data Visualization is one of the fundamental skills in the Data Scientist toolkit. Given the right data, Possessing the ability to tell compelling stories of data can unlock a goldmine of opportunities for any organisation to drive value to whomever they serve – Let’s not forget making the employees more efficient also.
In the past, I’ve written some tips to do effective data visualization, however, in that post, I did not use a single dataset to explore all of the ideas that were shared. As a result, with this post, we are going to get our hands dirty and do some visualizations using the tips I shared and dive deep into the MovieLens dataset.
For full access to the code used in this post, visit my Github repository.
The Data
As earlier mentioned, we are going to be using the MovieLens dataset. Specifically, we will be using the MovieLens 100K movie ratings dataset which consists of 1000 users on 1700 movies. The data was collected through the MovieLens web site during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up – users who had less than 20 ratings or did not have complete demographic information were removed from this data set.
For us to effectively perform our visualizations, we were concerned with 3 specific datasets that were collected:
u.data
– Consist of the full dataset, 100000 ratings by 943 users on 1682 items.u.item
– Information about the items (movies)u.user
– Demographic information about the users
In this project, I use popular Data Science libraries such as Pandas for data manipulation, Matplotlib for data visualization and NumPy for working with arrays. Additionally, I leverage Python’s datetime module for general calendar related functions and IPython for interactive computing.
We begin by simply importing the frameworks and loading the data using Pandas read_csv
– See Documentation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from IPython.display import IFrame
import warnings
warnings.filterwarnings("ignore")
# read data
rating_df= pd.read_csv("../data/u.data", sep="t", names=["user_id", "item_id", "rating", "timestamp"])
item_df = pd.read_csv("../data/u.item", sep="|",encoding="latin-1",
names=["movie_id", "movie_title", "release_date", "video_release_date",
"imbd_url", "unknown", "action", "adventure", "animation",
"childrens", "comedy", "crime", "documentary", "drama", "fantasy",
"film_noir", "horror", "musical", "mystery", "romance",
"sci-fi", "thriller", "war", "western"])
user_df = pd.read_csv("../data/u.user", sep="|", encoding="latin-1", names=["user_id", "age", "gender",
"occupation", "zip_code"])
Taking the 3 DataFrames we were provided: u.data
, u.item
, and u.user
, we have converted them into Pandas Dataframes and stored them in the variables as follows:
rating_df
– The full u data set holding all the ratings given by usersitem_df
– Information about the items (movies)user_df
— Demographic information about the users
Cross-Checking the Data
A general rule of thumb I stand by is to always check that what I am told that I am given is exactly what has been provided. Pandas makes identifying these things easy with df.info()
and df.head()
(or df.tail()
) functions which give us more information about the DataFrame and allows us to see a preview of the data.
To start, I begin by viewing the rating_df
of which we are expecting there to be 100000 ratings by 943 users on 1682 items.
# peak at ratings_df
print(rating_df.info())
rating_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 100000 non-null int64
1 item_id 100000 non-null int64
2 rating 100000 non-null int64
3 timestamp 100000 non-null int64
dtypes: int64(4)
memory usage: 3.1 MB
None

We can see we have 100000 ratings, but we want to ensure there are 943 users and 1682 items.
# checking unique users
print(f"# of Unique Users: {rating_df['user_id'].nunique()}")
# checking number of items
print(f"# of items: {rating_df['item_id'].nunique()}")
# of Unique Users: 943
# of items: 1682
Good. We can confirm the rating_df
has exactly what it says it will have. However, upon further inspection, I noticed that we have a timestamp variable but it is currently been shown as a int64
data type. From the README I identified that the timestamp column of this dataframe is in unix seconds since 1/1/1970 UTC. Hence, we use Datetime
(a Python Built-in) to convert the Dtype of the timestamp
column to datetime64
.
# convert timestamp column to time stamp
rating_df["timestamp"] = rating_df.timestamp.apply(lambda x: datetime.fromtimestamp(x / 1e3))
# check if change has been applied
print(rating_df.info())
rating_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 100000 non-null int64
1 item_id 100000 non-null int64
2 rating 100000 non-null int64
3 timestamp 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 3.1 MB
None

You now see in the terminal printout that the timestamp
column is now of Data type datetime64[ns]
.
Now I am much more comfortable with rating_df
, I can move on to exploring item_df
which we expect to give us more information about the movie.
# peak at items_df
print(item_df.info())
item_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_id 1682 non-null int64
1 movie_title 1682 non-null object
2 release_date 1681 non-null object
3 video_release_date 0 non-null float64
4 imbd_url 1679 non-null object
5 unknown 1682 non-null int64
6 action 1682 non-null int64
7 adventure 1682 non-null int64
8 animation 1682 non-null int64
9 childrens 1682 non-null int64
10 comedy 1682 non-null int64
11 crime 1682 non-null int64
12 documentary 1682 non-null int64
13 drama 1682 non-null int64
14 fantasy 1682 non-null int64
15 film_noir 1682 non-null int64
16 horror 1682 non-null int64
17 musical 1682 non-null int64
18 mystery 1682 non-null int64
19 romance 1682 non-null int64
20 sci-fi 1682 non-null int64
21 thriller 1682 non-null int64
22 war 1682 non-null int64
23 western 1682 non-null int64
dtypes: float64(1), int64(20), object(3)
memory usage: 315.5+ KB
None

We already know that we have 1682 unique items in our data from rating_df
so seeing 1682 non-null items in the movie_id
and movie_title
columns gave me an instant chill. Nonetheless, video_release_date
is completely empty meaning it does not provide us with any information about the movies meaning we can remove this column.
I noticed that release_date
and imbd_url
are also missing some values, but not enough that we need to delete the column – if worst comes to worst, we can manually impute these values by visiting the IMBD website and using the movie title to find the imbd_url
and release_date
.
Another one of my so-called "rituals" is to think of what sort of data type to expect when I am reading in a data. I expected release_date
to be of datetime64
data type but upon inspection, it was of data type object so I followed the necessary processing steps to convert an object to a datetime.
# drop empty column
item_df.drop("video_release_date", axis=1, inplace= True)
# convert non-null values to datetime in release_date
item_df["release_date"] = item_df[item_df.release_date.notna()]["release_date"].apply(lambda x: datetime.strptime(x, "%d-%b-%Y"))
# check if change is applied
print(item_df.info(), item_df.shape)
item_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_id 1682 non-null int64
1 movie_title 1682 non-null object
2 release_date 1681 non-null datetime64[ns]
3 imbd_url 1679 non-null object
4 unknown 1682 non-null int64
5 action 1682 non-null int64
6 adventure 1682 non-null int64
7 animation 1682 non-null int64
8 childrens 1682 non-null int64
9 comedy 1682 non-null int64
10 crime 1682 non-null int64
11 documentary 1682 non-null int64
12 drama 1682 non-null int64
13 fantasy 1682 non-null int64
14 film_noir 1682 non-null int64
15 horror 1682 non-null int64
16 musical 1682 non-null int64
17 mystery 1682 non-null int64
18 romance 1682 non-null int64
19 sci-fi 1682 non-null int64
20 thriller 1682 non-null int64
21 war 1682 non-null int64
22 western 1682 non-null int64
dtypes: datetime64[ns](1), int64(20), object(2)
memory usage: 302.4+ KB
None (1682, 23)

After our processing steps we can see that we no longer have the video_release_date
column and release_date
is now displayed as a datetime64
data type.
Since we were provided with some urls, I thought it may be cool to take advantage of this and view some of the urls in imbd_url
using IFrame
from the Ipython
library.
Note: The urls in the
imbd_url
columns may of moved to a new address permanently or are down at the time of implementing. Also, I could not connect to the IMBD webpage when I manually entered the url for a movie (i.e. I manually entered copycat (1995) url in IFrame and it returned that it refused to connect – I have not found a work around for this yet, but will update the notebook once I have. In the meantime, I’ve simply used the IMBD homepage url to give an idea of how it would work – essentially, we have full access to the webpage from our notebook.
# viewing random imbd_urls
IFrame("https://www.imdb.com", width=800, height=400)
IMDb: Ratings, Reviews, and Where to Watch the Best Movies & TV Shows
Last but not least we have user_df
. If you remember correctly this is the Demographic information about the users, hence I am expecting there to be 943 rows since (especially in user_id
column) we have already confirmed that there are 943 unique users in the data.
# peak at user data
print(user_df.info())
user_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 943 non-null int64
1 age 943 non-null int64
2 gender 943 non-null object
3 occupation 943 non-null object
4 zip_code 943 non-null object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB
None

Great, we can confirm that we have 943 users: In summary, we have 100K ratings from 943 users on 1682 movies. To make Data Visualization simple at various points in the notebook, I decided to combine the DataFrames we have together – I discuss how to do this more in the PyTrix Series on Combining Data.
# store full dataframe
full_df = pd.merge(user_df, rating_df, how="left", on="user_id")
full_df = pd.merge(full_df, item_df, how="left", right_on="movie_id", left_on="item_id")
full_df.head()

Fabulous! We have successfully confirmed that the expected data is exactly what we have. This is sufficient information for us to dive deeper into our data and get a better understanding.
Asking Questions and Answering with Data
Following the protocols of Effective Data Visualization, my next step is to think of some questions that would give me more insight into the data at hand, then identify the best method to visualise the answer to our question – the best method may be defined as the most simple and clear way to express the answer to our question.
Note: In this section my thoughts usually jump around as I am querying the data. Consequently, I prefer to use Jupyter Notebooks when exploring data.
What are the top 10 most rated movies?
# return number of rows associated to each title
top_ten_movies = full_df.groupby("movie_title").size().sort_values(ascending=False)[:10]
# plot the counts
plt.figure(figsize=(12, 5))
plt.barh(y= top_ten_movies.index,
width= top_ten_movies.values)
plt.title("10 Most Rated Movies in the Data", fontsize=16)
plt.ylabel("Moive", fontsize=14)
plt.xlabel("Count", fontsize=14)
plt.show()

In our dataset, Star Wars (1977) was the most rated film. This information is so valuable as we may decide to use the most rated movies in our dataset to recommend to new users to overcome the _Cold start problem._ We can go further into our data from this question and begin to think about what sort of genres are associated with the most rated films – In this case, we only looked at the genres associated with Star Wars.
genres= ["unknown", "action", "adventure", "animation", "childrens", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"]
full_df[full_df.movie_title == "Star Wars (1977)"][genres].iloc[0].sort_values(ascending=False)
action 1
sci-fi 1
romance 1
adventure 1
war 1
western 0
documentary 0
animation 0
childrens 0
comedy 0
crime 0
fantasy 0
drama 0
film_noir 0
horror 0
musical 0
mystery 0
thriller 0
unknown 0
Name: 204, dtype: int64
I am not a major Star Wars fan, although I have watched many of them, but I made mention of that simply to confirm that associating genres such as action, sci-fi, adventure, war and romance sounds about right for this movie.
This question completely ignored the least rated movies, but if we were building a recommender system we cannot ignore less rated movies as there may be many reasons as to why a movie has not got many ratings. Let’s take a look at some of the movies that are the least rated in the dataset.
# the least rated movies
least_10_movies = full_df.groupby("movie_title").size().sort_values(ascending=False)[-10:]
least_10_movies
movie_title
Coldblooded (1995) 1
MURDER and murder (1996) 1
Big Bang Theory, The (1994) 1
Mad Dog Time (1996) 1
Mamma Roma (1962) 1
Man from Down Under, The (1943) 1
Marlene Dietrich: Shadow and Light (1996) 1
Mat' i syn (1997) 1
Mille bolle blu (1993) 1
Á köldum klaka (Cold Fever) (1994) 1
dtype: int64
Big Bang Theory was a surprise occurrence on this list for me, other than that I am unfamiliar with the other movies – I am not much of a movie person anyways, so that doesn’t mean much.
What are the Max/Min number of Movies rated by One user?
From the README provided with the dataset we were told that the minimum number of movies rated by a single user was 20, but we don’t know the maximum number of movies rated by a single user.
movies_rated = rating_df.groupby("user_id").size().sort_values(ascending=False)
print(f"Max movies rated by one user: {max(movies_rated)}nMin movies rated by one user: {min(movies_rated)}")
Max movies rated by one user: 737
Min movies rated by one user: 20
rating_df.user_id.value_counts().plot.box(figsize=(12, 5))
plt.title("Number of Movies rated by a Single user", fontsize=16)
plt.show()

The maximum number of movies rated by a single user in the dataset is 737 – whoever that is, is a very loyal movie watcher and rater – and the median number of movies rated by someone is 70. There are plenty of outliers that have rated more than 320 movies which is what I am approximating to be the extreme value from the plot above.
How many movies were released per year?
# create the year column from Movie title
full_df["year"] = full_df["movie_title"].str.extract("((d{4}))", expand=True)
# return number of rows by the year
year_counts = full_df[["movie_title", "year"]].groupby("year").size()
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(year_counts.index, year_counts.values)
ax.xaxis.set_major_locator(plt.MaxNLocator(9)) # changes the number of xticks we see
plt.title("Number of movies per Annum", fontsize=16)
plt.xlabel("Year", fontsize= 14)
plt.ylabel("# of Movies Released", fontsize=14)
plt.show()

It’s pretty hard to miss the massive spike and dip between 1988–1998. It’s worth doing some research and asking questions to a domain expert to determine what could of happened during this period.
How many Men/Women rated movies?
# count the number of male and female raters
gender_counts = user_df.gender.value_counts()
# plot the counts
plt.figure(figsize=(12, 5))
plt.bar(x= gender_counts.index[0], height=gender_counts.values[0], color="blue")
plt.bar(x= gender_counts.index[1], height=gender_counts.values[1], color="orange")
plt.title("Number of Male and Female Participants", fontsize=16)
plt.xlabel("Gender", fontsize=14)
plt.ylabel("Counts", fontsize=14)
plt.show()

There are clearly a lot more males in this sample than females and this may have a major influence on the genres of movies watched.
What are the most popular Movie Genres among Males and Females?
full_df[genres+["gender"]].groupby("gender").sum().T.plot(kind="barh", figsize=(12,5), color=["orange", "blue"])
plt.xlabel("Counts",fontsize=14)
plt.ylabel("Genre", fontsize=14)
plt.title("Popular Genres Among Genders", fontsize=16)
plt.show()

To my surprise, it turns out male and females really appreciate similar genres. Both genders most popular genre was Drama followed by comedy. Of course, we take into consideration there are more males in this dataset than females and we must also take this into account when we think of building our recommendation system.
Something to know would be whether there is a change in interest when we put a constraint on the ages of raters.
What are the most popular Movie Genres among Children by gender?
Note: Using UK standards, an adult can be defined as someone that is >= 18 years old, hence a Child would be < 18.
full_df[full_df["age"] < 18][genres + ["gender"]].groupby("gender").sum().T.plot(kind="barh", figsize=(12, 5), color=["orange", "blue"])
plt.xlabel("Counts",fontsize=14)
plt.ylabel("Genre", fontsize=14)
plt.title("Popular Genres Among Children by Gender", fontsize=16)
plt.show()

Drama is still quite popular for under 18 males, but more males under 18 preferred comedy and action films. On the other hand, females under 18 pretty much didn’t change, it’s still drama and comedy.
These figures were interesting, but I was wondering whether the popularity of drama and comedy amongst both genders was due to those types of movies being generally regarded as the best type of movies (hence they get the most views and ratings) or whether it is because those tags are associated with the most movies.
What Genre is associated with the most Movies?
Note: Multiple genres can be associated to a movie (i.e. A movie can be animation, childrens and comedy)
# get the genre names in the dataframe and their counts
label= item_df.loc[:, "unknown":].sum().index
label_counts= item_df.loc[:, "unknown":].sum().values
# plot a bar chart
plt.figure(figsize=(12, 5))
plt.barh(y= label, width= label_counts)
plt.title("Genre Popularity", fontsize=16)
plt.ylabel("Genres", fontsize=14)
plt.xlabel("Counts", fontsize=14)
plt.show()

Just as I thought, drama and comedy tags are associated to the most films in the sample. Maybe filmmakers are aware of our need for a laugh and some drama hence they play on it – this is something we can research.
Next we observe the average ratings per genre…
What are the Distribution of Ratings per Genre?
Note: Density plots are used to observe the distribution of a variable in a dataset.
# https://github.com/HarilalOP/movielens-data-exploration/blob/master/src/main/code/exploratory_analysis.ipynb
df_temp = full_df[['movie_id','rating']].groupby('movie_id').mean()
# Histogram of all ratings
df_temp.hist(bins=25, grid=False, edgecolor='b', density=True, label ='Overall', figsize=(15,8))
# KDE plot per genre
for genre in genres:
df_temp = full_df[full_df[genre]==True][['movie_id','rating']].groupby('movie_id').mean()
df_temp.rating.plot(grid=True, alpha=0.9, kind='kde', label=genre)
plt.legend()
plt.xlim(0,5)
plt.xlabel('Rating')
plt.title('Rating Density plot')
plt.show()

The plot is predominantly left-skewed for most genres – This could possibly by down to users being more willing to rate movies they enjoyed, since people do not really watch a movie if they aren’t enjoying it. We would have to conduct some research on whether this is the case in our instance.
Ok, the last plot was more complicated. We can simplify things again by looking more specifically at the users.
What’s the Age Distribution by Gender?
# creating new variable for ages of all males and females
female_age_dist = user_df[user_df["gender"] == "F"]["age"]
male_age_dist = user_df[user_df["gender"] == "M"]["age"]
# plotting boxplots
plt.figure(figsize=(12,5))
plt.boxplot([female_age_dist, male_age_dist])
plt.xticks([1, 2], ["Female", "Male"], fontsize=14)
plt.title("Age Distribution by Gender", fontsize=16)
plt.show()

The male age distribution has some outliers, and the female median age is slightly higher than males. Additionally, the female age distribution box is longer than the male box meaning it’s more dispersed than the male ages.
What’s the most common occupation amongst the users?
# creating the index and values variables for occupation
occ_label= user_df.occupation.value_counts().index
occ_label_counts = user_df.occupation.value_counts().values
# plot horizontal bar chart
plt.figure(figsize=(12,5))
plt.barh(y=occ_label, width=occ_label_counts)
plt.title("Most common User Occupations", fontsize=16)
plt.show()

To nobody’s surprise, the majority of people in the dataset are students. Let’s see the average ratings given by each occupation.
What is the average rating of a given occupation?
# creating a empty df to store data
df_temp = pd.DataFrame(columns=["occupation", "avg_rating"])
# loop through all the occupations
for idx, occ in enumerate(occ_label):
df_temp.loc[idx, "occupation"] = occ
df_temp.loc[idx, "avg_rating"] = round(full_df[full_df["occupation"] == occ]["rating"].mean(), 2)
# sort from highest to lowest
df_temp = df_temp.sort_values("avg_rating", ascending=False).reset_index(drop=True)
df_temp

Wrap Up
Stopping at this point was difficult because there is so much more insights we can extract from this data. I personally believe that Data Visualization does not really have an end, so it is down to the person doing the visualizations to decide when to stop. A good indicator may be when we believe we have sufficient understanding of the data to begin to build an effective baseline model (if we don’t have one already). Upon building the model, we can always come back and iterate on our visualizations to get more insight from our data based on our models predictions.
This article was made using jupyter_to_medium
.
Let’s continue the conversation on LinkedIn…