Comprehensive Data Explorations with Matplotlib

Introduction

Data Visualization is one of the fundamental skills in the Data Scientist toolkit. Given the right data, Possessing the ability to tell compelling stories of data can unlock a goldmine of opportunities for any organisation to drive value to whomever they serve – Let’s not forget making the employees more efficient also.

In the past, I’ve written some tips to do effective data visualization, however, in that post, I did not use a single dataset to explore all of the ideas that were shared. As a result, with this post, we are going to get our hands dirty and do some visualizations using the tips I shared and dive deep into the MovieLens dataset.

Effective Data Visualization

For full access to the code used in this post, visit my Github repository.

kurtispykes/recommender_system

The Data

As earlier mentioned, we are going to be using the MovieLens dataset. Specifically, we will be using the MovieLens 100K movie ratings dataset which consists of 1000 users on 1700 movies. The data was collected through the MovieLens web site during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up – users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

MovieLens 100K Dataset

For us to effectively perform our visualizations, we were concerned with 3 specific datasets that were collected:

u.data – Consist of the full dataset, 100000 ratings by 943 users on 1682 items.
u.item – Information about the items (movies)
u.user – Demographic information about the users

In this project, I use popular Data Science libraries such as Pandas for data manipulation, Matplotlib for data visualization and NumPy for working with arrays. Additionally, I leverage Python’s datetime module for general calendar related functions and IPython for interactive computing.

We begin by simply importing the frameworks and loading the data using Pandas read_csv – See Documentation.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

from datetime import datetime
from IPython.display import IFrame

import warnings 
warnings.filterwarnings("ignore")

# read data
rating_df= pd.read_csv("../data/u.data", sep="t", names=["user_id", "item_id", "rating", "timestamp"])

item_df = pd.read_csv("../data/u.item", sep="|",encoding="latin-1", 
                      names=["movie_id", "movie_title", "release_date", "video_release_date",
                             "imbd_url", "unknown", "action", "adventure", "animation",
                             "childrens", "comedy", "crime", "documentary", "drama", "fantasy", 
                             "film_noir", "horror", "musical", "mystery", "romance", 
                             "sci-fi", "thriller", "war", "western"])

user_df = pd.read_csv("../data/u.user", sep="|", encoding="latin-1", names=["user_id", "age", "gender",
                                                                            "occupation", "zip_code"])

Taking the 3 DataFrames we were provided: u.data, u.item, and u.user, we have converted them into Pandas Dataframes and stored them in the variables as follows:

rating_df – The full u data set holding all the ratings given by users
item_df – Information about the items (movies)
user_df— Demographic information about the users

Cross-Checking the Data

A general rule of thumb I stand by is to always check that what I am told that I am given is exactly what has been provided. Pandas makes identifying these things easy with df.info() and df.head() (or df.tail()) functions which give us more information about the DataFrame and allows us to see a preview of the data.

To start, I begin by viewing the rating_df of which we are expecting there to be 100000 ratings by 943 users on 1682 items.

# peak at ratings_df
print(rating_df.info())
rating_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user_id    100000 non-null  int64
 1   item_id    100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB
None

We can see we have 100000 ratings, but we want to ensure there are 943 users and 1682 items.

# checking unique users
print(f"# of Unique Users: {rating_df['user_id'].nunique()}")

# checking number of items
print(f"# of items: {rating_df['item_id'].nunique()}")

# of Unique Users: 943
# of items: 1682

Good. We can confirm the rating_df has exactly what it says it will have. However, upon further inspection, I noticed that we have a timestamp variable but it is currently been shown as a int64 data type. From the README I identified that the timestamp column of this dataframe is in unix seconds since 1/1/1970 UTC. Hence, we use Datetime (a Python Built-in) to convert the Dtype of the timestamp column to datetime64.

# convert timestamp column to time stamp 
rating_df["timestamp"] = rating_df.timestamp.apply(lambda x: datetime.fromtimestamp(x / 1e3))

# check if change has been applied 
print(rating_df.info())
rating_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   user_id    100000 non-null  int64         
 1   item_id    100000 non-null  int64         
 2   rating     100000 non-null  int64         
 3   timestamp  100000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 3.1 MB
None

You now see in the terminal printout that the timestamp column is now of Data type datetime64[ns].

Now I am much more comfortable with rating_df, I can move on to exploring item_df which we expect to give us more information about the movie.

# peak at items_df 
print(item_df.info())
item_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   movie_title         1682 non-null   object 
 2   release_date        1681 non-null   object 
 3   video_release_date  0 non-null      float64
 4   imbd_url            1679 non-null   object 
 5   unknown             1682 non-null   int64  
 6   action              1682 non-null   int64  
 7   adventure           1682 non-null   int64  
 8   animation           1682 non-null   int64  
 9   childrens           1682 non-null   int64  
 10  comedy              1682 non-null   int64  
 11  crime               1682 non-null   int64  
 12  documentary         1682 non-null   int64  
 13  drama               1682 non-null   int64  
 14  fantasy             1682 non-null   int64  
 15  film_noir           1682 non-null   int64  
 16  horror              1682 non-null   int64  
 17  musical             1682 non-null   int64  
 18  mystery             1682 non-null   int64  
 19  romance             1682 non-null   int64  
 20  sci-fi              1682 non-null   int64  
 21  thriller            1682 non-null   int64  
 22  war                 1682 non-null   int64  
 23  western             1682 non-null   int64  
dtypes: float64(1), int64(20), object(3)
memory usage: 315.5+ KB
None

We already know that we have 1682 unique items in our data from rating_df so seeing 1682 non-null items in the movie_id and movie_title columns gave me an instant chill. Nonetheless, video_release_date is completely empty meaning it does not provide us with any information about the movies meaning we can remove this column.

I noticed that release_date and imbd_url are also missing some values, but not enough that we need to delete the column – if worst comes to worst, we can manually impute these values by visiting the IMBD website and using the movie title to find the imbd_url and release_date.

Another one of my so-called "rituals" is to think of what sort of data type to expect when I am reading in a data. I expected release_date to be of datetime64 data type but upon inspection, it was of data type object so I followed the necessary processing steps to convert an object to a datetime.

# drop empty column 
item_df.drop("video_release_date", axis=1, inplace= True)

# convert non-null values to datetime in release_date
item_df["release_date"] = item_df[item_df.release_date.notna()]["release_date"].apply(lambda x: datetime.strptime(x, "%d-%b-%Y"))

# check if change is applied
print(item_df.info(), item_df.shape)
item_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   movie_id      1682 non-null   int64         
 1   movie_title   1682 non-null   object        
 2   release_date  1681 non-null   datetime64[ns]
 3   imbd_url      1679 non-null   object        
 4   unknown       1682 non-null   int64         
 5   action        1682 non-null   int64         
 6   adventure     1682 non-null   int64         
 7   animation     1682 non-null   int64         
 8   childrens     1682 non-null   int64         
 9   comedy        1682 non-null   int64         
 10  crime         1682 non-null   int64         
 11  documentary   1682 non-null   int64         
 12  drama         1682 non-null   int64         
 13  fantasy       1682 non-null   int64         
 14  film_noir     1682 non-null   int64         
 15  horror        1682 non-null   int64         
 16  musical       1682 non-null   int64         
 17  mystery       1682 non-null   int64         
 18  romance       1682 non-null   int64         
 19  sci-fi        1682 non-null   int64         
 20  thriller      1682 non-null   int64         
 21  war           1682 non-null   int64         
 22  western       1682 non-null   int64         
dtypes: datetime64[ns](1), int64(20), object(2)
memory usage: 302.4+ KB
None (1682, 23)

After our processing steps we can see that we no longer have the video_release_date column and release_date is now displayed as a datetime64 data type.

Since we were provided with some urls, I thought it may be cool to take advantage of this and view some of the urls in imbd_url using IFrame from the Ipython library.

Note: The urls in the imbd_url columns may of moved to a new address permanently or are down at the time of implementing. Also, I could not connect to the IMBD webpage when I manually entered the url for a movie (i.e. I manually entered copycat (1995) url in IFrame and it returned that it refused to connect – I have not found a work around for this yet, but will update the notebook once I have. In the meantime, I’ve simply used the IMBD homepage url to give an idea of how it would work – essentially, we have full access to the webpage from our notebook.

# viewing random imbd_urls
IFrame("https://www.imdb.com", width=800, height=400)

IMDb: Ratings, Reviews, and Where to Watch the Best Movies & TV Shows

Last but not least we have user_df. If you remember correctly this is the Demographic information about the users, hence I am expecting there to be 943 rows since (especially in user_id column) we have already confirmed that there are 943 unique users in the data.

# peak at user data
print(user_df.info())
user_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB
None

Great, we can confirm that we have 943 users: In summary, we have 100K ratings from 943 users on 1682 movies. To make Data Visualization simple at various points in the notebook, I decided to combine the DataFrames we have together – I discuss how to do this more in the PyTrix Series on Combining Data.

Pandas: Combining Data

# store full dataframe 
full_df = pd.merge(user_df, rating_df, how="left", on="user_id")
full_df = pd.merge(full_df, item_df, how="left", right_on="movie_id", left_on="item_id")
full_df.head()

Fabulous! We have successfully confirmed that the expected data is exactly what we have. This is sufficient information for us to dive deeper into our data and get a better understanding.

Asking Questions and Answering with Data

Following the protocols of Effective Data Visualization, my next step is to think of some questions that would give me more insight into the data at hand, then identify the best method to visualise the answer to our question – the best method may be defined as the most simple and clear way to express the answer to our question.

Note: In this section my thoughts usually jump around as I am querying the data. Consequently, I prefer to use Jupyter Notebooks when exploring data.

What are the top 10 most rated movies?

# return number of rows associated to each title
top_ten_movies = full_df.groupby("movie_title").size().sort_values(ascending=False)[:10]

# plot the counts
plt.figure(figsize=(12, 5))
plt.barh(y= top_ten_movies.index,
         width= top_ten_movies.values)
plt.title("10 Most Rated Movies in the Data", fontsize=16)
plt.ylabel("Moive", fontsize=14)
plt.xlabel("Count", fontsize=14)
plt.show()

In our dataset, Star Wars (1977) was the most rated film. This information is so valuable as we may decide to use the most rated movies in our dataset to recommend to new users to overcome the _Cold start problem._ We can go further into our data from this question and begin to think about what sort of genres are associated with the most rated films – In this case, we only looked at the genres associated with Star Wars.

genres= ["unknown", "action", "adventure", "animation", "childrens", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"]

full_df[full_df.movie_title == "Star Wars (1977)"][genres].iloc[0].sort_values(ascending=False)

action         1
sci-fi         1
romance        1
adventure      1
war            1
western        0
documentary    0
animation      0
childrens      0
comedy         0
crime          0
fantasy        0
drama          0
film_noir      0
horror         0
musical        0
mystery        0
thriller       0
unknown        0
Name: 204, dtype: int64

I am not a major Star Wars fan, although I have watched many of them, but I made mention of that simply to confirm that associating genres such as action, sci-fi, adventure, war and romance sounds about right for this movie.

This question completely ignored the least rated movies, but if we were building a recommender system we cannot ignore less rated movies as there may be many reasons as to why a movie has not got many ratings. Let’s take a look at some of the movies that are the least rated in the dataset.

# the least rated movies 
least_10_movies = full_df.groupby("movie_title").size().sort_values(ascending=False)[-10:]
least_10_movies

movie_title
Coldblooded (1995)                            1
MURDER and murder (1996)                      1
Big Bang Theory, The (1994)                   1
Mad Dog Time (1996)                           1
Mamma Roma (1962)                             1
Man from Down Under, The (1943)               1
Marlene Dietrich: Shadow and Light (1996)     1
Mat' i syn (1997)                             1
Mille bolle blu (1993)                        1
Á köldum klaka (Cold Fever) (1994)            1
dtype: int64

Big Bang Theory was a surprise occurrence on this list for me, other than that I am unfamiliar with the other movies – I am not much of a movie person anyways, so that doesn’t mean much.

What are the Max/Min number of Movies rated by One user?

From the README provided with the dataset we were told that the minimum number of movies rated by a single user was 20, but we don’t know the maximum number of movies rated by a single user.

movies_rated = rating_df.groupby("user_id").size().sort_values(ascending=False)
print(f"Max movies rated by one user: {max(movies_rated)}nMin movies rated by one user: {min(movies_rated)}")

Max movies rated by one user: 737
Min movies rated by one user: 20

rating_df.user_id.value_counts().plot.box(figsize=(12, 5))
plt.title("Number of Movies rated by a Single user", fontsize=16)
plt.show()

The maximum number of movies rated by a single user in the dataset is 737 – whoever that is, is a very loyal movie watcher and rater – and the median number of movies rated by someone is 70. There are plenty of outliers that have rated more than 320 movies which is what I am approximating to be the extreme value from the plot above.

How many movies were released per year?

# create the year column from Movie title 
full_df["year"] = full_df["movie_title"].str.extract("((d{4}))", expand=True)

# return number of rows by the year 
year_counts = full_df[["movie_title", "year"]].groupby("year").size()

fig, ax = plt.subplots(figsize=(12, 5)) 
ax.plot(year_counts.index, year_counts.values)
ax.xaxis.set_major_locator(plt.MaxNLocator(9)) # changes the number of xticks we see
plt.title("Number of movies per Annum", fontsize=16)
plt.xlabel("Year", fontsize= 14)
plt.ylabel("# of Movies Released", fontsize=14)
plt.show()

It’s pretty hard to miss the massive spike and dip between 1988–1998. It’s worth doing some research and asking questions to a domain expert to determine what could of happened during this period.

How many Men/Women rated movies?

# count the number of male and female raters
gender_counts = user_df.gender.value_counts()

# plot the counts 
plt.figure(figsize=(12, 5))
plt.bar(x= gender_counts.index[0], height=gender_counts.values[0], color="blue")
plt.bar(x= gender_counts.index[1], height=gender_counts.values[1], color="orange")
plt.title("Number of Male and Female Participants", fontsize=16)
plt.xlabel("Gender", fontsize=14)
plt.ylabel("Counts", fontsize=14)
plt.show()

There are clearly a lot more males in this sample than females and this may have a major influence on the genres of movies watched.

What are the most popular Movie Genres among Males and Females?

full_df[genres+["gender"]].groupby("gender").sum().T.plot(kind="barh", figsize=(12,5), color=["orange", "blue"])
plt.xlabel("Counts",fontsize=14)
plt.ylabel("Genre", fontsize=14)
plt.title("Popular Genres Among Genders", fontsize=16)
plt.show()

To my surprise, it turns out male and females really appreciate similar genres. Both genders most popular genre was Drama followed by comedy. Of course, we take into consideration there are more males in this dataset than females and we must also take this into account when we think of building our recommendation system.

Something to know would be whether there is a change in interest when we put a constraint on the ages of raters.

What are the most popular Movie Genres among Children by gender?

Note: Using UK standards, an adult can be defined as someone that is >= 18 years old, hence a Child would be < 18.

full_df[full_df["age"] < 18][genres + ["gender"]].groupby("gender").sum().T.plot(kind="barh", figsize=(12, 5), color=["orange", "blue"])
plt.xlabel("Counts",fontsize=14)
plt.ylabel("Genre", fontsize=14)
plt.title("Popular Genres Among Children by Gender", fontsize=16)
plt.show()

Drama is still quite popular for under 18 males, but more males under 18 preferred comedy and action films. On the other hand, females under 18 pretty much didn’t change, it’s still drama and comedy.

These figures were interesting, but I was wondering whether the popularity of drama and comedy amongst both genders was due to those types of movies being generally regarded as the best type of movies (hence they get the most views and ratings) or whether it is because those tags are associated with the most movies.

What Genre is associated with the most Movies?

Note: Multiple genres can be associated to a movie (i.e. A movie can be animation, childrens and comedy)

# get the genre names in the dataframe and their counts
label= item_df.loc[:, "unknown":].sum().index
label_counts= item_df.loc[:, "unknown":].sum().values

# plot a bar chart
plt.figure(figsize=(12, 5))
plt.barh(y= label, width= label_counts)
plt.title("Genre Popularity", fontsize=16)
plt.ylabel("Genres", fontsize=14)
plt.xlabel("Counts", fontsize=14)

plt.show()

Just as I thought, drama and comedy tags are associated to the most films in the sample. Maybe filmmakers are aware of our need for a laugh and some drama hence they play on it – this is something we can research.

Next we observe the average ratings per genre…

What are the Distribution of Ratings per Genre?

Note: Density plots are used to observe the distribution of a variable in a dataset.

# https://github.com/HarilalOP/movielens-data-exploration/blob/master/src/main/code/exploratory_analysis.ipynb
df_temp = full_df[['movie_id','rating']].groupby('movie_id').mean()

# Histogram of all ratings
df_temp.hist(bins=25, grid=False, edgecolor='b', density=True, label ='Overall', figsize=(15,8))

# KDE plot per genre
for genre in genres:
    df_temp = full_df[full_df[genre]==True][['movie_id','rating']].groupby('movie_id').mean()
    df_temp.rating.plot(grid=True, alpha=0.9, kind='kde', label=genre)
plt.legend()
plt.xlim(0,5)
plt.xlabel('Rating')
plt.title('Rating Density plot')
plt.show()

The plot is predominantly left-skewed for most genres – This could possibly by down to users being more willing to rate movies they enjoyed, since people do not really watch a movie if they aren’t enjoying it. We would have to conduct some research on whether this is the case in our instance.

Ok, the last plot was more complicated. We can simplify things again by looking more specifically at the users.

What’s the Age Distribution by Gender?

# creating new variable for ages of all males and females
female_age_dist = user_df[user_df["gender"] == "F"]["age"]
male_age_dist = user_df[user_df["gender"] == "M"]["age"]

# plotting boxplots 
plt.figure(figsize=(12,5))
plt.boxplot([female_age_dist, male_age_dist])
plt.xticks([1, 2], ["Female", "Male"], fontsize=14)
plt.title("Age Distribution by Gender", fontsize=16)

plt.show()

The male age distribution has some outliers, and the female median age is slightly higher than males. Additionally, the female age distribution box is longer than the male box meaning it’s more dispersed than the male ages.

What’s the most common occupation amongst the users?

# creating the index and values variables for occupation
occ_label= user_df.occupation.value_counts().index
occ_label_counts = user_df.occupation.value_counts().values

# plot horizontal bar chart
plt.figure(figsize=(12,5))
plt.barh(y=occ_label, width=occ_label_counts)
plt.title("Most common User Occupations", fontsize=16)
plt.show()

To nobody’s surprise, the majority of people in the dataset are students. Let’s see the average ratings given by each occupation.

What is the average rating of a given occupation?

# creating a empty df to store data
df_temp = pd.DataFrame(columns=["occupation", "avg_rating"])

# loop through all the occupations 
for idx, occ in enumerate(occ_label):
    df_temp.loc[idx, "occupation"] = occ 
    df_temp.loc[idx, "avg_rating"] = round(full_df[full_df["occupation"] == occ]["rating"].mean(), 2)

# sort from highest to lowest
df_temp = df_temp.sort_values("avg_rating", ascending=False).reset_index(drop=True)
df_temp

Wrap Up

Stopping at this point was difficult because there is so much more insights we can extract from this data. I personally believe that Data Visualization does not really have an end, so it is down to the person doing the visualizations to decide when to stop. A good indicator may be when we believe we have sufficient understanding of the data to begin to build an effective baseline model (if we don’t have one already). Upon building the model, we can always come back and iterate on our visualizations to get more insight from our data based on our models predictions.

This article was made using jupyter_to_medium.