What to Bring? — Item Suggestions with Collaborative Filtering

Personalized item suggestions with actual data for a real case

Malte Bleeker
Towards Data Science

--

Item Suggestions (image by author)

A birthday party, a trip together, or the summer gathering with one’s local sports club - events like these are great until one's own contribution to the preparation is required. It often starts with a highly devoted person who takes the initiative and gets things started, but sooner or later, and with the number of items that have already been committed continuously increasing, one also has to ask the question: What will I bring to the table?

Lists of items that are commonly brought to any type of such events have been created in abundance and still, it is always a struggle to come up with a suitable idea that relieves yourself at least of the social obligation (most commonly by defaulting on the silver bullet, a bottle of wine). Thanks to the availability of hundreds of thousands of such lists due to a web application of ours, we were able to tackle this challenge by turning it into the quest of suggesting promising items to users, based on the already existing items in a list.

We approached the challenge by utilizing similarities between the ~100k lists in our database and the current list of interest to a user. The utilization of such list-to-list similarities and the subsequent recommendation/suggestion of suitable items is also called collaborative filtering, given that the goal is to filter the list of a gazillion different items in the database that could be suggested, to just a few most suitable (okay, actually its just about 100k different items in our case). The analogous case that can be often found for e.g. book or movie recommendation systems is pursuing a similar goal by utilizing the similarity between what users have watched (in our case, which items have been added to a list) to suggest movies a user might have missed out on so far and that have been watched by similar users (in our case items that have commonly been added to similar lists).

How does it work? — A simple explanation

The first step is to create a table with the lists (IDs) as the rows and all item names as columns. When all our lists would contain only 20 unique words, we would therefore only need 20 columns and the number of rows would equal the number of lists. Whenever a word is contained in a list, we add the value “1” in the corresponding column and row, and if a word is not contained in a list a “0” (in text analysis this is also known as Bag of Words). We do the same for the list we would like to make suggestions for.

A simple illustration of the list-item matrix (image by author)

For movie or book recommendations, the same procedure would be applied, but the columns would contain the different movies or books in the database, the rows would contain the different users (IDs), and the values could indicate whether a book has been read or if and how it has been rated.

Optional: Corresponding Python Code (Step 1)

# Create the list-item matrix
matrix = df.pivot_table(index='id', columns='what', values='value')

# Matrix Dimensions (126.252 x 179.674)
# Number of unique listIDs: 126252
# Number of unique items ("what"): 179674

This brings our data in a suitable format, ready for the second step — the calculation of the similarity between the lists in our database and the list we would like to suggest some items for. One of the most commonly applied methods to measure the similarity between two rows (vectors) is the “Cosine Distance” or “Cosine Similarity”. It’s calculated with the dot product of both rows, divided by the regular product of the magnitude of both vectors. The graphic below should make this calculation more understandable, but the focus here is on an intuitive understanding of the cosine distance and I refer you to this article for any mathematical subtleties. Please also consider the code chunks below as an optional, but supplementary side dish.

Calculation of the cosine similarity between two lists/vectors (image by author)

If for example, two lists would be exactly similar, they would have a cosine similarity of 1, lists that have no words in common a similarity of 0, and with at least some words in common a similarity of 0 < x < 1. The higher the cosine similarity value, the more similar the list in the database to our list of interest. With this in mind, we can calculate the similarity of our list with all other lists in the table, getting a similarity score for each row. Next, we sort these rows and extract the most similar rows, either based on a predefined number (e.g. most similar 50 lists) or on a similarity threshold (e.g. > 0.6). In our case, the similarity scores varied a lot depending on the number of items in the list of interest and its specific use case, so we resorted to the utilization of the hundred most similar lists for simplicity (A rule of thumb: The more lists we select here, the more stable, but also generic will our suggestions be). We can now create a copy of our original list-word table, containing only the 100 most similar lists inside.

Optional: Corresponding Python Code (Step 2)

# Find the index of the listID "TestList" in the matrix
# (the TestList is the list we would like to get suggestions for)
list_index = matrix.index.get_loc(listID)

# Extract the row of the listID "TestList" from the matrix
list_row = matrix.iloc[list_index]

# Calculate the similarity between the listID "TestList" and all other listIDs
similarities = cosine_similarity([list_row], matrix)[0]

# Return the indices of all lists with a similarity greater than 0.6 and store them with the similarities in a list
#similar_list_indices = np.where(similarities > 0.5)[0]

# Return the indices of the 100 most similar lists
similar_list_indices = np.argsort(similarities)[-100:]

# Extract the corresponding similarities
similarity_scores = similarities[similar_list_indices]

# Create a list of tuples with the listID and the similarity
similar_lists = [(listid_dict[i], similarity) for i, similarity in zip(similar_list_indices, similarity_scores)]

# Convert the indices to listIDs
similar_list_ids = [listid_dict[i] for i in similar_list_indices]

# Extract the rows of similar lists from the matrix
recommendation_matrix = matrix[matrix.index.isin(similar_list_ids)]

Once this is done, it’s time to identify the most promising items in these similar lists. First, we check the already existing items in our list of interest and delete the corresponding items (columns) from the table (we assume in this case that the user wouldn’t like to get an already included item suggested). Next, the simplest approach would have been to check which words occur the most in these similar lists and suggest those in decreasing order. This however would have given the same weight to a word in the 99th most similar list as is given to the most similar list. To adjust for this the rows are multiplied with the corresponding similarity score that was calculated before. As a result, the values in the rows (now between 0 and 1) are significantly smaller for the 99th most similar row compared to the most similar row. Based upon this, the weighted sum of each column (item) can be calculated and the item with the highest score can be suggested.

Optional: Corresponding Python Code (Step 3)

# Find columns with a value of 1 in the 'TestList' row (Items that are already in the list)
columns_to_remove = recommendation_matrix.columns[recommendation_matrix.loc[listID] == 1]

# Drop the identified columns (Do not recommend items already in the list)
recommendation_matrix.drop(columns_to_remove, axis=1, inplace=True)

# Create a dictionary to map listIDs to similarity scores
listid_to_similarity = {listID: similarity for listID, similarity in similar_lists}

# Multiply each row in the recommendation matrix by the corresponding similarity score
recommendation_matrix = recommendation_matrix.apply(lambda row: row * listid_to_similarity.get(row.name, 1), axis=1)

# calculate the sum of each column and sort the values in descending order
recommendations = recommendation_matrix.sum().sort_values(ascending=False)

# Print out the Items with the highest Scores (the most suitable item suggestions)
top_item_recommendations = recommendations.head(10)
print(top_item_recommendations)

That’s already it – enough data in combination with just a simple, but powerful calculation like the Cosine Similarity can enable us to generate suitable personalized suggestions and recommendations. In the following, I´d like to show you the results of three suggestion simulations, with only a few items in the lists yet (items and lists are mostly translated from German … so please apologize if some item picks might feel weird to you).

Simulation of personalized item suggestions for three lists (author by image)

As can be seen, just a few items are enough to generate personalized item suggestions that reflect the underlying theme of the list — suggestions that get even more specific once the most common items for a specific type of event are taken care of.

Some additional notes: To prevent the recommendation of item names that are very specific or possibly even contain any personal information we only included items (columns) that occurred at least across 20 different lists. We also excluded lists that contained less than 3 items. The list suggestion feature is not deployed in a production environment but has so far only been simulated and tested with the actual data in a Jupyter Notebook as described and tested within this article.

Thank you for your interest in this article and I highly appreciate all types of feedback — I wish you all the best and stay curious.

--

--