Machine Learning Recommender Engine with AWS SageMaker

A preference-based clustering technique for item recommendation using AWS SageMaker and its built-in k-Means algorithm.

Marc Saint-Félix
Towards Data Science

--

Objective

The objective of this project is to build a recommender engine with AWS SageMaker. The main goal is to recommend the top 20 movie selection to users that are clustered by preference similarities.

Photo by Samuel Regan-Asante on Unsplash

Context

For the “A Cloud Guru — Machine Learning Challenge” proposed by Kesha Williams, ML enthusiasts are tasked to implement their solution to the following problem:

Environment and tools

We chose to work with the MovieLens dataset that was downloaded from www.grouplens.org

More specifically we used movies.csv and ratings.csv from the ml-latest-small.zip file that was uploaded to our AWS S3 bucket. This project was developped on a Jupyter Notebook with a conda_python3 framework in an AWS SageMaker environment. We used popular ML libraries such as Pandas and Scikit-Learn to process the data.

Our solution

Libraries imports

Scikit-Learn has its own k-Means algorithm but we chose the SageMaker built-in k-Means algorithm for this project. We also chose to use Pandas to manipulate dataframes. Matplotlib and Seaborn were used as visualization tools. Finally, we imported boto3 as it is the Python library for AWS.

import pandas as pd
import numpy as np
import io
import sagemaker.amazon.common as smac
from pandas import DataFrame
import boto3
from sagemaker import get_execution_role
import sagemaker
from sagemaker import KMeans
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns

Loading data

Let’s get the appropriate credentials for SageMaker to access our S3 bucket.

role = get_execution_role()
bucket='guruchallenge-bucket'
sub_folder = 'movielens_dataset'

Let’s load the csv files into our notebook with Pandas dataframes to start the preprocessing steps.

data_key = 'movies.csv'
data_location = 's3://{}/{}/{}'.format(bucket, sub_folder, data_key)
movies = pd.read_csv(data_location, low_memory=False, delimiter=',', encoding='utf-8')
data_key = 'ratings.csv'
data_location = 's3://{}/{}/{}'.format(bucket, sub_folder, data_key)
ratings = pd.read_csv(data_location, low_memory=False, delimiter=',', encoding='utf-8')

We observe that the dataset contains 100836 ratings of 9742 movies by 610 users.

Feature engineering

Let’s remove useless columns, merge dataframes into one, one-hot-encode all genre categories.

Matplotlib allows us to visualize the data and identify 4 outliers. These 4 combined users rated a total of 9148 movies. This is nearly 10% of all ratings for just 4 users. As this will significantly affect our clustering process, let’s get rid of them. Also, as we will use MinMaxScaling later on, outliers should be removed.

The one-hot-encoding of genres generates a set of vectors for all movies seen by a given user. Now for each user, let’s multiply each vector by its matching rating. By summing all columns for a given user, we can create a “scoring vector” that represents their tastes, based on their viewing history. Let’s divide the scoring vectors by their matching UserRatingCount to generate the mean scoring for each user.

Let’s visualize the heatmap (a portion of it) of our dataframe in its current state:

Movies with most views contain drama or comedy, as they are usually the most popular genres among users. Even for users who rate these genres lower, these features will inevitably gain more importance in the dataset and will affect clustering. Inversely, less frequent genres like documentary or film-noir tend to disappear from the heatmap, making clustering nearly impossible for these genres.

To address this concern, we need a more relevant and balanced clustering approach. Let’s perform feature scaling for better user segmentation: MinMaxScaler will scale all features from 0 to 1. Therefore it will emphasize users who have a stronger preference than others on any genre.

Users with common interests will be clustered together and will receive mainstream recommendations. Users with more specific preferences will be clustered accordingly and will receive appropriate movie suggestions based on what similar users (from the same cluster) rated highly.

Let’s have a look at the newly generated heatmap which now shows a wider representation of genres.

Training and deployment

The built-in k-Means algorithm expects float32 as input format. Our dataframe is converted to float32 and used as training dataset. The model artifacts are stored onto S3. Training will be performed on a single c4.xlarge.

For now, let’s define k = 10 clusters for this training.

data_train = df_scaled
data_train = data_train.astype('float32')
num_clusters = 10
output_location = 's3://' + bucket + '/model-artifacts'
kmeans = KMeans(role=role,
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=output_location,
k=num_clusters)

Now calling fit on the k-Means estimator to train our model. This process usually takes 3 to 5 minutes:

%%time
kmeans.fit(kmeans.record_set(data_train.values), job_name=job_name)

Now deploying our model on an endpoint. This step usually takes up to 10 minutes.

kmeans_predictor = kmeans.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')

Making inferences for the entire dataset: all users get a label from the closest cluster in 297 ms.

%%time 
result = kmeans_predictor.predict(data_train.values[0:len(data_train)])
cluster_labels = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]
CPU times: user 51.3 ms, sys: 82 µs, total: 51.4 ms
Wall time: 297 ms

Now deleting the endpoint to avoid additional costs.

sagemaker.Session().delete_endpoint(kmeans_predictor.endpoint)

Let’s recommend items!

We are going to generate the top 20 movie recommendations for a given cluster of users. For example, let’s try cluster #8. First of all, let’s identify all users IDs within our cluster.

Let’s define ‘clust_movieRatings’ as the dataframe that contains all ratings for all movies viewed by users within this cluster. We are going to make movie recommendations from the data that this dataframe contains.

clust_movieRatings = ratings.loc[ratings['userId'].isin(users_cluster)]

Now let’s visualize the cluster’s heatmap which will better reflect specific preferences from its users:

As we can see, users with stronger taste than others for “film-noir” and “mystery” have been clustered. The homogeneous colors on each column also confirms that these users have similar preferences compared to other users in the dataset.

Now let’s try and identify the top 20 movies ranked by these particular users in order to recommend them to any other user with similar tastes. In order to perform such rankings, IMDb suggests the following method:

Sorting items with their rating alone would not lead to proper rankings, as the number of times each item was ranked is of major importance too. To address this matter, IMDb uses the bayesian weighted rating formula.

Weighted Rating (𝑊𝑅)=(𝑣/(𝑣+𝑚))𝑅+(𝑚/(𝑣+𝑚))𝐶

Where:

R = average for the movie (within the cluster)

v = number of votes for the movie (within the cluster)

m = minimum votes required to be listed (‘bayesianThreshold’)

C = the mean vote across the whole report (‘bayesianConstant’)

For an item to be listed in this particular project, we chose to set the minimum vote count to 3.

Let’s add the ‘weightedRating’ field to our ‘clust_movieRatings’ dataframe and let’s have a look at our final rankings:

Let’s convert these movieId’s into actual movie titles :

This is the top 20 movie selection that would be recommended to any user labeled with cluster #8.

Let’s play around with our algorithm and let’s visualize the heatmap of cluster #6:

These users seem to have a stronger taste for musicals than other users (among other genres such as children, animation, etc). Let’s have a look at their top 20 movie selection, using the weighted rating formula:

Conclusion

We can generate relevant and specific movie selections for all clusters of users with complex combinations of preferences. Any new user labeled with the closest cluster number will be recommended the best 20 movies that they might enjoy.

For further improvements, we still need to implement the “elbow method” to determine the optimal value of k. We will also have to train our model with larger datasets.

Here’s the link to our code, thank you for reading!

--

--

Founder | Artificial Intelligence & Machine Learning Specialist | Writer for Towards Data Science & Pianist