The world’s leading publication for data science, AI, and ML professionals.

Creating a Grocery Product Recommender for Instacart

Using K-Means Clustering and Association Rule Mining to Recommend Products to Instacart Customers

Photo by Scott Warman on Unsplash
Photo by Scott Warman on Unsplash

In the ecommerce shopping experience product recommendations come in many forms: they may be used to recommend other products on one product’s page (Amazon’s "Frequently bought together" feature for instance) or they may be used on the checkout page to show customers products they may be interested in based on their total order. In this post I will detail how I made a recommender for the former case.

Further, recommendations may be more helpful if they are targeted towards a specific segment of customers, rather than made uniformly. For instance if one group of customers tends to buy a lot of nondairy milk substitutes and another group tends to buy traditional milk, it may make sense to make different recommendations to go along with that box of Cheerios. In order to make tailored recommendations I first segmented Instacart users based on their purchase history using K-Means clustering and then made recommenders based on the product association rules within those clusters. In this post I will go through that process and give samples of recommendations.

The Data and Instacart Background

Instacart is an online grocery delivery service that allows users to place grocery orders through their website or app which are then fulfilled and delivered by a personal shopper- very similar to Uber Eats but for grocery stores. In 2017 they released a year of their data composed of about 3.3 million orders from about 200,000 customers. Released in the form of a relational database, I combined the separate tables together to perform EDA, leaving me with the following:

EDA

When performing EDA there were some key questions I wanted to answer:

  1. What are the most popular products and aisles?
  2. How "top heavy" was the assortment? That is, how much of the total ordering share do the top products make up?
  3. How large are orders?

To get a broad idea of what Instacart tends to sell we can defer to their department sales. Instacart has 21 departments at the top of their product taxonomy. Below are the unit sale shares for each of them:

Instacart seems to be a popular choice for produce. The most popular Instacart "aisles," the next level down in their taxonomy, are their fruit and vegetable aisles, the below chart showing unit share of total sales for the top 10 aisles:

There are 134 aisles in total with 49685 products so the above chart indicates quite a "top heavy" distribution in terms of product sales with the top 3 aisles accounting for over 25% of units sold. The below chart, showing unit shares for the top 3 products follows the same trend:

Almost all of the top 30 products are produce and there is a steep drop-off in terms of share after the most popular items. It will be interesting to see if K-Means Clustering may reveal distinct customer groups from this produce-heavy distribution.

Below are the descriptive characteristics of order size:

count    3.346083e+06
mean     8.457334e+01
std      1.355298e+02
min      1.000000e+00
25%      1.500000e+01
50%      3.600000e+01
75%      1.050000e+02
max      1.058500e+04

Here is a distribution chart with a cutoff at 500 units:

The chart and table indicate that Instacart may have room to improve considering order size- the right skewed distribution indicating that most orders may not be fulfilling all the grocery needs for their respective customers with half of the orders having less than 36 items. A product recommendation system would allow customers to more easily find products they want and expose customers to items they have never bought from Instacart.

PCA and K-Means Clustering

The goal for the K-Means clustering is to group customers into segments based on the products they have bought historically. To accomplish this I intended to implement K-Means on the share of units bought from the sum of each customer’s previous orders. My first step was to transform the combined table I displayed earlier into a table where each row represents a customer and the columns represent the share bought from each aisle. A sample from this table is below:

I then performed PCA to reduce the number of features for the K-Means algorithm. This would allow me to better visualize my clusters as well as help K-Means run more efficiently. In deciding on the number of components to use I referred to the variance % each component represented of the total variance and chose a cutoff after the marginal variance added leveled off:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components = 30)
principalComponents = pca.fit_transform(aisle_share_pivot)
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_, color = 'black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(features)
plt.xticks(rotation = 45)
PCA_components = pd.DataFrame(principalComponents)

The component this happened at according to the chart was at component 5. I then fit sci-kit-learn’s K-Means algorithm on the 6 PCA components and looked at the resulting SSE curve:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
sse = {}
labels = {}
for k in range(2,15):
    kmeans = KMeans(n_clusters = k).fit(PCA_components[[0,1,2,3,4,5]])
    sse[k] = kmeans.inertia_
    labels[k] = kmeans.labels_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

It seems that the curve starts to flatten at cluster 5 so I moved forward with 6 clusters. Here are those clusters plotted on a scatterplot matrix of the 6 PCA components:

Though not perfect, it does seem that I have identified 6 distinct groups that should result in differences in aisle purchase history. The best way to check this of course is to look at each cluster’s aisle share for unit purchases. Below is a heatmap for share of purchases by aisle for the top 20 Instacart aisles:

There are clear differences for the 6 clusters, with the most obvious being the relative amounts they buy fresh fruits and fresh vegetables. This makes sense given that produce makes up over 25% of Instacart unit sales. The differences for each cluster may be better brought out by looking at them each individually, which I do in the charts below:

Cluster 0 are heavy vegetable shoppers, cluster 1 seems to mainly use Instacart for beverages, cluster 2 has a more balanced palette, cluster 3 prefers packaged produce, cluster 4 are heavy fruit shoppers and cluster 5 buy fresh fruits and vegetables almost equally. Differences may also be noticed looking into the less frequently bought aisles, for instance "baby food formula" is the 8th most purchased aisle for cluster 5 but does not appear in the top 10 for any other cluster.

It is also of interest for Instacart’s business is the size of these clusters in terms of number of users and buying power. The table below shows the percent of users belonging to each cluster in the left column and the percent of unit purchases belonging to each cluster on the right.

Interestingly, cluster 5 represents about 35% of the users but almost 50% of the unit purchases. Recall that this cluster’s most bought aisles were fresh fruits and fresh vegetables but in equal amounts and also featured baby food formula in its top 10 aisles. This suggests that this cluster may contain users using Instacart for shopping for families with babies and young children, appearing to be Instacart’s most important customers. An entire project may be carrying this analysis further to isolate Instacart’s best customers! At this point, however, I move forward to creating the product recommender.

Association Rule Mining

With the 200,000 users broken up into cluster I was ready to perform basket analysis via association rule mining on orders. This worked by splitting the total orders table into 6 tables for the 6 different clusters and finding association rules between each product. Association rules specify the relationships products have with each other in terms of how likely they are to be bought together in the same order.

Three of the common rules are support, confidence and lift. Support is simply the frequency an itemset appears in a dataset and is computed by dividing the frequency by the size of the dateset (via Wikipedia):

Confidence is the proportion of transactions containing one item that also contain another item and computed by dividing the support of one or more items by the support of a subset of the numerator, via Wikipedia:

Lift is the ratio of the observed frequency of two or more items over the expected frequency. It indicates if two or more items occur more frequently than they would if they appeared together randomly. A value greater than one indicating a non-random relationship. Formula below:

I computed these metrics by cluster for all items over a minimum support of .01% using a python script employing generators. This was necessary given the size of the dataset (3.3 million orders containing about 50,000 products). The table below shows the output sorted by lift for cluster 5:

As can be seen the highest lift values of the entire dataset are of products similar to each other as can be expected.

Product Recommender

To perform the product recommendations to be displayed on a product’s page I wrote a python script that takes in user_id, product_id, desired lift cutoff and num_products to be returned. With these inputs it determines the cluster of the user (stored in a dataframe outputted from the cluster analysis), filters the dataframe containing the product association rules for that cluster and returns the specified number of products with a lift greater than the lift input, prioritizing the items with the greatest lift. If there are less items than the num_products that meet the criteria it will return all products that do meet the criteria. The code for the recommender may find in the Github repository, link at the end of the article.

In the below I show the recommender in action, showing the outputs for "organic whole milk" for the 5 clusters limited to 5 items.

cluster 0['Whole Milk Plain Yogurt', 'YoBaby Blueberry Apple Yogurt', 'Organic Muenster Cheese', 'Apples + Strawberries Organic Nibbly Fingers', 'Yo Toddler Organic Strawberry Banana Whole Milk Yogurt']

cluster 1['Beef Tenderloin Steak', 'Original Dairy Free Coconut Milk', 'Pumpkin & Blueberry Cruncy Dog Treats', 'MRS MEYERS   12.5OZ HANDSOAP RHUBAR', 'Organic Sprouted Whole Grain Bread']

cluster 2['Whole Milk Plain Yogurt', 'Organic Skim Milk', "Elmo Mac 'n Cheese With Carrots & Broccoli", 'Kids Sensible Foods Broccoli Littles', 'Apples + Strawberries Organic Nibbly Fingers']

cluster 3['Organic Nonfat Milk', 'Paneer', 'Organic Whole Milk Yogurt', 'Organic Plain Yogurt', 'Extra Light Olive Oil']

cluster 4['Puffed Corn Cereal', 'String Cheese, Mozzarella', 'Cold-Pressed Sweet Greens & Lemon Juice', 'Organic Stage 2 Broccoli Pears & Peas Baby Food', 'Superberry Kombucha']

cluster 5['Whole Milk Plain Yogurt', 'YoTot Organic Raspberry Pear Yogurt', 'Organic Atage 3 Nibbly Fingers Mangoes Carrots', "Elmo Mac 'n Cheese With Carrots & Broccoli", 'Blueberry Whole Milk Yogurt Pouch']

The above lists all contain the products with the highest lift associated with organic whole milk for each cluster. What may stick out is that cluster 1’s recommendations don’t make as much intuitive sense as the other clusters’ recommendations. This is most likely because this cluster makes up less than 1% and of unit purchases and less than 2% of users and seems to leverage Instacart specifically for non-milk beverages. Further work would be required to determine if fewer clusters would be optimal but generating non-intuitive recommendations isn’t so much of an issue considering users from this group are not likely to view a milk product anyway. For another example on a less general use product, here are the results from "Mild Salsa Roja":

cluster 0['Thin & Light Tortilla Chips', 'Red Peppers', 'Organic Lemon', 'Organic Cucumber', 'Organic Grape Tomatoes']
cluster 2['Real Guacamole', 'Thin & Light Tortilla Chips', 'Original Hummus', 'Organic Reduced Fat 2% Milk', 'Thick & Crispy Tortilla Chips']
cluster 4['Organic Raspberries', 'Banana', 'Organic Strawberries', 'Organic Hass Avocado']
cluster 5['Thin & Light Tortilla Chips', 'Organic Large Brown Grade AA Cage Free Eggs', 'Organic Reduced Fat 2% Milk', 'Organic Large Grade AA Brown Eggs', 'Thick & Crispy Tortilla Chips']

Cluster 1 and cluster 3 did not have any items with a lift over 1 for this product.

That’s it for now! The link to the GitHub with the Jupyter Notebooks is here.


Related Articles