How to Build a Content-Based Movie Recommender System
Creating a content-based movie recommender system
In this article, I will try to explain how we can create a recommender system without user data. I’ll also share an example that I’ve done with Python and tell you how it works step by step.
I am not going to explain the types of recommender systems separately, since it is quite easy to find on the internet. So, let’s start talking about the content-based recommender system!
What is a Content-Based Recommender System?
Content-based recommender systems do not include datas retrieved from the users other than you. It simply helps you by identifying products that are similar to the product you like.
For example, you have a website that sells stuff online and you don’t have a registered user yet, but you still want to recommend products to the visitors of the website. In this case, the content-based recommender system would be an ideal option for you.
However, content-based recommendation systems are limited because they do not contain other user data. And it doesn’t help a user discover their potential tastes.
For example, let’s say that user A and user B like drama movies. User A also likes comedy movies, but since you don’t have that knowledge, you keep offering drama movies. Eventually, you’re eliminating other options that user B potentially might like.
Anyway! Let’s talk about a few terms we’re going to use before we create this system. Let’s start with Kernel Density Estimation!
Kernel Density Estimation
Kernel density estimation is a really useful statistical tool with an intimidating name. Often shortened to KDE, it’s a technique that let’s you create a smooth curve given a set of data.
KDE is a method to help determine the density of data distribution. It provides information about where many points are located and where they are not. So in one-dimensional arrays, it helps you clustering by separating the lowest density points (local minima) with the highest density points (local maxima). Just follow these steps,
- Compute densities
- Find local minima and local maxima values
- Create clusters
Cosine Similarity
Cosine similarity is a method for measuring similarity between vectors. Mathematically, it calculates the cosine of the angle between the two vectors. If the angle between the two vectors is zero, the similarity is calculated as 1 because the cosine of zero is 1. So the two vectors are identical. The cosine of any angle varies from 0 to 1. Therefore, similarity rates will vary from 0 to 1. The formula is expressed as follows:
That’s enough for now! Let’s code it!
Feature Importances
I want to set a score for each movie or series, and I need a coefficient for each feature, so I’m going to look at feature importances.
The dataset is as follows,
You can easily obtain this data set by merging files shared at https://datasets.imdbws.com. I obtained this data by merging title.basics.tsv.gz with title.ratings.tsv.gz and after, I deleted some features. For example, the end_year field contained too much null value, so I removed it. For more detailed information, please see my repository. It is shared at the end of this article.
I have to say one more detail, I’ve converted the kind field into an integer field by using label encoder.
As you can see above, I’ve tried three different methods. The first is the importance of the feature provided directly by the Random Forests model.
Another is Permutation Importances. This approach directly measures the effect of the field on the model using the random re-shuffling technique for each predictor. It maintains the distribution of the variable because it uses a random re-shuffling technique.
The last is Drop Column Feature Importances. This method is completely intuitive and each time it wipes one feature and compares it to the model in which all columns are used. It is usually much safer, but it can have a long processing time. Processing time is not so important for our data set.
The results are like this:
We choose the Drop Column Feature Importances method among these methods. As we indicated before, it is much more reliable and, when we take a glance at the results, they make much more sense to calculate scores.
dataset['score'] = (
0.4576 * dataset['num_votes'] +
0.3271 * dataset['runtime'] +
0.3517 * dataset['start_year'] +
0.0493 * dataset['kind']
)
Clustering
I’m going to use the scores to create the clusters. So I can recommend movies with the same score on average.
I have a 1-dimensional array of scores, and I can use KDE to cluster. I used this code to see the distribution of scores:
import matplotlib.pyplot as plt
import seaborn as snsplt.figure(figsize=(9, 6))
sns.distplot(dataset['score'])
plt.axvline(18000, color='r');
And I got a graph like this,
I added a vertical line for 18,000, because the density is between 650 and 18,000. If I give you points greater than 18,000 when applying KDE, it collects all points less than 18,000 in one cluster, and that’s not what we want, because it’s going to reduce diversity.
I applied 3 stages to do clustering with KDE as I mentioned at the beginning of my article.
- Compute densities
from sklearn.neighbors.kde import KernelDensityvals = dataset['score'].values.reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(vals)
s = np.linspace(650, 18000)
e = kde.score_samples(s.reshape(-1, 1))
2. Find local minima and local maxima values
from scipy.signal import argrelextremami = argrelextrema(e, np.less)[0]
ma = argrelextrema(e, np.greater)[0]points = np.concatenate((s[mi], s[ma]), axis=0)
buckets = []
for point in points:
buckets.append(point)buckets = np.array(buckets)
buckets.sort()
3. Create clusters
dataset['cluster'] = buckets.searchsorted(dataset.score)
Text Similarity
Finally, I calculated the similarities between the genres in order to be able to recommend the same type of film as accurate as possible. I used TF-IDF and Linear Kernel for this. Consequently, cosine similarity was used in the background to find similarities.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kerneltfidf_vectorizer = TfidfVectorizer()
matrix = tfidf_vectorizer.fit_transform(dataset['genres'])kernel = linear_kernel(matrix, matrix)
Let’s see the recommendations now!
def get_recommendations2(movie_index):
print(dataset.iloc[movie_index])
print('**' * 40)
sim_ = list(enumerate(kernel[movie_index]))
sim = sorted(sim_, key=lambda x: x[1], reverse=True)
index = [i[0] for i in sim if i[0] != movie_index and i[1] > .5]
cond1 = dataset.index.isin(index)
cond2 = dataset.cluster == dataset.iloc[movie_index]['cluster'] selected = dataset.loc[cond1 & cond2] \
.sort_values(by='score', ascending=False).head(20) print(selected[['title', 'cluster', 'genres']])
That seemed pretty useful to me! If you want to see the codes in more detail, serve with flask, index movies with elasticsearch and use docker, you can look at my repository:
Thank you for reading!
References
- Eryk Lewinson, Explaining Feature Importance by example of a Random Forest (2019)
- Matthew Conlen, Kernel Density Estimation
- Matthew Overby, 1D Clustering with KDE (2017)
- CountVectorizer, TfidfVectorizer, Predict Comments (2018)