Beginner’s Guide to Creating the SVD Recommender System

Published in

Towards Data Science

7 min readNov 22, 2019

Prologue

Ever logged into Netflix and see they are suggesting you watch Gravity if you had spent the last night watching Interstellar? Or perhaps bought something on Amazon and saw they are recommending us products that we may be interested in? Or ever wondered how the online ad agencies show us ads based on our browsing habits? It all boils down something called a recommendation system which predicts what we may be interested in based on our and others’ history of interacting with products.

As I promised, we’ll make a recommender system. And just so you don’t feel bad about yourself, we’ll make a pretty cool one too. We’ll make a collaborative filtering one using the SVD ( Singular Vector Decomposition ) technique; that’s quite a notch above the basic content-based recommender system.

Collaborative filtering captures the underlying pattern of interests of like-minded users and uses the choices and preferences of similar users to suggest new items.

Requirements

So let’s get started. So what we’ll need is listed below. You most probably know and already have these if you are reading this.

1. python >= 2.7
2. pandas >= 0.17
3. numpy
4. scipy

For the unaware, pandas, numpy & scipy are python packages. These will make our life easy. You can install them using pip from terminal or command prompt. Google it if you don’t know how to. For example, the command below installs pandas package.

$ pip install pandas

Dataset

We’ll definitely need a dataset to work on. We’ll use the famous Movielens dataset for making our recommendation system. Head over to http://grouplens.org/datasets/movielens/ and download the movielens100k dataset.

The dataset contains about 100,000 ratings of different movies by different users. Let’s explore the dataset. Create a new script exploration.py and add the following code blocks. Note: Here we’ll be using separate scripts but you can very well use a single iPython notebook, that’s a lot more convenient.

import pandas as pd
import numpy as npdata = pd.read_csv('movielens100k.csv')
data['userId'] = data['userId'].astype('str')
data['movieId'] = data['movieId'].astype('str')users = data['userId'].unique() #list of all users
movies = data['movieId'].unique() #list of all moviesprint("Number of users", len(users))
print("Number of movies", len(movies))
print(data.head())

There you go! You will see there are 718 users and 8915 movies in the dataset.

Number of users 718
Number of movies 8915
+----+----------+-----------+----------+-------------+
|    |   userId |   movieId |   rating |   timestamp |
|----+----------+-----------+----------+-------------|
|  0 |        1 |         1 |        5 |   847117005 |
|  1 |        1 |         2 |        3 |   847642142 |
|  2 |        1 |        10 |        3 |   847641896 |
|  3 |        1 |        32 |        4 |   847642008 |
|  4 |        1 |        34 |        4 |   847641956 |
+----+----------+-----------+----------+-------------+

Train & Test split

We could have used the normal random train-test split on the dataset. But since we have the timestamps available, let’s do something fancy and better. Let’s make a new script workspace.py where we’ll do all our work. Add the following code at the start.

import pandas as pd
import numpy as np
import scipydata = pd.read_csv('movielens100k.csv')
data['userId'] = data['userId'].astype('str')
data['movieId'] = data['movieId'].astype('str')users = data['userId'].unique() #list of all users
movies = data['movieId'].unique() #list of all moviestest = pd.DataFrame(columns=data.columns)
train = pd.DataFrame(columns=data.columns)test_ratio = 0.2 #fraction of data to be used as test set.for u in users:
    temp = data[data['userId'] == u]
    n = len(temp)
    test_size = int(test_ratio*n)temp = temp.sort_values('timestamp').reset_index()
temp.drop('index', axis=1, inplace=True)
    
dummy_test = temp.ix[n-1-test_size :]
dummy_train = temp.ix[: n-2-test_size]
    
test = pd.concat([test, dummy_test])
train = pd.concat([train, dummy_train])

What this does is that based on the timestamps when these ratings were given, we sort the data to keep the more recent ratings towards the bottom and take 20% of ratings from every user starting from the bottom as the test set. So, instead of random selection, we take the recent ratings as the test set. This is more logical in the sense that the goal of recommenders is to rate un-encountered products in the future based on historical ratings of similar products.

Utility Matrix

The dataset in the current form is of no use to us. In order to use the data for the recommender engine, we need to transform the dataset into a form called a utility matrix. We make a function create_utility_matrix in a new script. Name it recsys.py . We shall use the functions in this script to work on our train and test set.

As a parameter, we pass a dictionary that stores the key-value pairs for each column of the dataset `data` that we are also passing. From the dataset, we’ll see the column number or column names for each of the corresponding fields, the column userId or column 0 for the key ‘user’, column movieId or column 1 for key ‘item’ and column ratings or column 2 for key ‘value’.

Utility Matrix is nothing but a 2D matrix where one axis belongs to the users and the other axis belongs to the items (movies in this case). So the value at (i,j) location of the matrix will be the rating that user i gave for movie j.

Let’s give an example to clear up a bit more. Suppose we have this dataset of 5 ratings.

+----+----------+-----------+----------+
|    |   userId |   movieId |   rating |
|----+----------+-----------+----------+
|  0 |      mark|     movie1|        5 |
|  1 |      lucy|     movie2|        2 |
|  2 |      mark|     movie3|        3 |
|  3 |     shane|     movie2|        1 |
|  4 |      lisa|     movie3|        4 |
+----+----------+-----------+----------+

If we pass this dataset through the create_utility_matrix function described below, it will return an utility matrix like this and auxiliary dictionaries of user_index and item_index as shown below.

+----+----+----+
| 5  | nan| 3  |   # user_index = {mark: 0, lucy:1, shane:2, lisa:3}
+----+----+----+   # item_index = {movie1:0, movie2: 1, movie3:2}
| nan| 2  | nan|
+----+----+----+
| nan| 1  | nan|   # The nan values are for user-item combinations
+----+----+----+   # where the ratings are unavailable.
| nan| nan| 4  |
+----+----+----+

Let’s see the function now.

import numpy as np
import pandas as pd
from scipy.linalg import sqrtmdef create_utility_matrix(data, formatizer = {'user':0, 'item': 1, 'value': 2}):    """
        :param data:      Array-like, 2D, nx3
        :param formatizer:pass the formatizer
        :return:          utility matrix (n x m), n=users, m=items
    """
        
    itemField = formatizer['item']
    userField = formatizer['user']
    valueField = formatizer['value']    userList = data.ix[:,userField].tolist()
    itemList = data.ix[:,itemField].tolist()
    valueList = data.ix[:,valueField].tolist()    users = list(set(data.ix[:,userField]))
    items = list(set(data.ix[:,itemField]))    users_index = {users[i]: i for i in range(len(users))}    pd_dict = {item: [np.nan for i in range(len(users))] for item in items}    for i in range(0,len(data)):
        item = itemList[i]
        user = userList[i]
        value = valueList[i]    pd_dict[item][users_index[user]] = value    X = pd.DataFrame(pd_dict)
    X.index = users
        
    itemcols = list(X.columns)
    items_index = {itemcols[i]: i for i in range(len(itemcols))}
    # users_index gives us a mapping of user_id to index of user
    # items_index provides the same for items    return X, users_index, items_index

SVD Computation

SVD is Singular Vector Decomposition. What it does is that it decomposes a matrix into constituent arrays of feature vectors corresponding to each row and each column. Let’s add another function to recsys.py. It will take the output from the `create_utility_matrix` and the parameter `k` which is the number of features into which each user and movie will be resolved into.

The SVD technique was introduced into the recommendation system domain by Brandyn Webb, much more famously known as Simon Funk during the Netflix Prize challenge. Here we aren’t doing Funk’s iterative version of SVD or FunkSVD as it is called but instead using whatever numpy’s SVD implementation has to offer.

def svd(train, k):
    utilMat = np.array(train)    # the nan or unavailable entries are masked
    mask = np.isnan(utilMat)
    masked_arr = np.ma.masked_array(utilMat, mask)
    item_means = np.mean(masked_arr, axis=0)    # nan entries will replaced by the average rating for each item
    utilMat = masked_arr.filled(item_means)    x = np.tile(item_means, (utilMat.shape[0],1))    # we remove the per item average from all entries.
    # the above mentioned nan entries will be essentially zero now
    utilMat = utilMat - x    # The magic happens here. U and V are user and item features
    U, s, V=np.linalg.svd(utilMat, full_matrices=False)
    s=np.diag(s)    # we take only the k most significant features
    s=s[0:k,0:k]
    U=U[:,0:k]
    V=V[0:k,:]    s_root=sqrtm(s)    Usk=np.dot(U,s_root)
    skV=np.dot(s_root,V)
    UsV = np.dot(Usk, skV)    UsV = UsV + x    print("svd done")
    return UsV

Binding it all together

Back to workspace.py where we’ll use the functions above. We’ll find out the root mean square error of the predicted ratings for the test set using the true ratings. Besides making a function, we’ll also create a list to hold the different number of features which will help us to make an analysis for the future.

from recsys import svd, create_utility_matrixdef rmse(true, pred):
    # this will be used towards the end
    x = true - pred
    return sum([xi*xi for xi in x])/len(x)# to test the performance over a different number of features
no_of_features = [8,10,12,14,17]utilMat, users_index, items_index = create_utility_matrix(train)for f in no_of_features: 
    svdout = svd(utilMat, k=f)
    pred = [] #to store the predicted ratings    for _,row in test.iterrows():
        user = row['userId']
        item = row['movieId']        u_index = users_index[user]
        if item in items_index:
            i_index = items_index[item]
            pred_rating = svdout[u_index, i_index]
        else:
            pred_rating = np.mean(svdout[u_index, :])
        pred.append(pred_rating)print(rmse(test['rating'], pred))

For test_size = 0.2 , the RMSE scores will be something around 0.96

This is a modest score for Movielens100k. With little tweaks, you perhaps can beat 0.945. But that’s up to you.

If you like the article let me know! And here are 3 links for your perusal:

https://github.com/mayukh18/reco (Complete codes of SVD as well as implementations of other famous RecSys algorithms)
https://paperswithcode.com/sota/collaborative-filtering-on-movielens-100k ( State of the Art results on Movielens100k. These are validated on the official test set)
https://sifter.org/~simon/journal/20061211.html (Most famous blog of Simon Funk detailing his SVD approach)