Retailers have access to an unprecedented amount of shopper transactions. As shopping habits have become more electronic, records of every purchase are neatly stored in databases, ready to be read and analyzed. With such an arsenal of data at their disposal, they can uncover patterns of consumer behavior.
What is Market Basket Analysis?
A market basket analysis is a set of affinity calculations meant to determine which items sell together. For example, a grocery store may use market basket analysis to determine that consumers typically buy both hot dogs and hot dog buns together.
If you’ve ever gone onto an online retailer’s website, you’ve probably seen a recommendation on a product’s page phrased as "Customers who bought this item also bought" or "Customers buy these together". More than likely, the online retailer performed some sort of market basket analysis to link the products together.
A savvy retailer can leverage this knowledge to inform decisions on pricing, promotions, and store layouts. The aforementioned grocery store might have a sale on hot dogs, but increase the margins on the hot dog buns. The customers would buy more hot dogs and feel as if they found a bargain while the store would sell more product and raise their revenue.
The Math behind Market Basket Analysis
For every combination of items purchased, three key statistics are calculated: support, confidence, and lift.
Support is the general popularity of an item relative to all other purchases. In a grocery store, milk would have a high support, because many shoppers buy it every trip. Support is given as a number between 0 and 1. Mathematically:

Confidence is the conditional probability that customers who bought Product A also bought Product B. There would likely be a high confidence between marshmallows and graham crackers, because they’re often bought together for s’mores. Confidence is given as a number between 0 and 1. Mathematically:

Lift is the sales increase in Product B when Product A is bought. There might be a high lift between hamburger patties and buns, because as more patties are bought, they drive the sale of buns. Mathematically:

Lift is a bit unusual compared to the other two measures. Instead of a value between 0 and 1, lift is interpreted by its distance from 1:
- Lift = 1 suggests no relationship between the products
- lift > 1 suggests a positive relationship between the products
- lift < 1 suggests a negative relationship between the products
The Apriori Algorithm

By far, the most common approach to perform Market Basket Analysis is the Apriori Algorithm. First proposed in 1994 by Agrawal and Srikant, the algorithm has become historically important for its impact on retailers to meaningfully track transaction associations.
While still useful and widely used, the Apriori Algorithm also suffers from high computation times on larger data sets. Thankfully, most implementations offer minimum parameters for confidence and support and set a limit to the number of items per transaction to reduce the time to process.
For demonstration, I’ll use the Python implementation called efficient-apriori. Note that this library is rated for Python 3.6 and 3.7. Older versions of Python may use apyori, which supports 2.7 and 3.3–3.5.
Data Processing
To show the application of the Apriori Algorithm, I’ll use a data set of transactions from a bakery available on Kaggle.
import pandas as pd
import numpy as np
# Read the data
df = pd.read_csv("BreadBasket_DMS.csv")
# eliminate lines with "NONE" in items
df = df[df["Item"] != "NONE"]
After the usual imports of Pandas and Numpy to help process the data, the previously saved CSV file is read to a DataFrame.
A few lines of the data contain "NONE" in the Item column, which isn’t particularly helpful, so those are filtered out.
# Create and empty list for data processing
transaction_items = []
# Get an array of transaction numbers
transactions = df["Transaction"].unique()
for transaction in transactions:
# Get an array of items per transaction number
items = df[df["Transaction"] == transaction]["Item"].unique()
# Add the item to the list as a tuple
transaction_items.append(tuple(items))
Unlike a lot of other libraries which support Pandas DataFrames out of the box, efficient-apriori needs the transaction lines as a series of tuples in a list.
To create this data structure, a list of unique Transaction ID numbers are collected. For every Transaction ID, a Numpy array of items associated with the ID are grouped. Finally, they’re converted into tuples and placed in a list.
Performing the Apriori Algorithm
# import apriori algorithm
from efficient_apriori import apriori
# Calculate support, confidence, & lift
itemsets, rules = apriori(transaction_items, min_support = 0.05, min_confidence = 0.1)
After importing the library, the Apriori Algorithm can be placed on a single line. Note the min_support and min_confidence arguments, which specify the minimum support and confidence values to calculate. The actual values of these will differ between various types of data. Setting them too high won’t produce any results. Setting them too low will give too many results and will take a long to time to run.
It’s a Goldilocks problem which requires some trial and error to determine. For particularly large data sets, some preliminary calculations for support may be required to determine a good baseline.
For this particular data set, most transactions contain a single item purchases. While an interesting result in and of itself, it means the minimum values for support and confidence need to be set relatively low.
# print the rules and corresponding values
for rule in sorted(rules, key = lambda rule: rule.lift):
print(rule)
Finally, the results are placed in the rule variable, which may be printed. The results should look something like the below:
{Coffee} -> {Cake} (conf: 0.114, supp: 0.055, lift: 1.102, conv: 1.012)
{Cake} -> {Coffee} (conf: 0.527, supp: 0.055, lift: 1.102, conv: 1.103)
To better understand the output, look at the two lines specifying the rules for coffee and cake. The two lines both give the confidence (conf), the support (supp), and lift, but the order between the two lines are switched.
In the first line, probabilities are measured as cake conditional on coffee while in the second line, probabilities are measured as coffee conditional on cake. In other words, of the customers who bought coffee, not many also bought cake. Of the customers who bought cake, however, most also bought coffee. This is why there’s a difference in confidence values.
It’s a subtle, but important difference to understand.
In addition, the lift values are greater than 1, suggesting that the sale of cake boosts the sale of coffee and vice versa.
With this understanding, the bakery could take advantage of this analysis by:
- Placing coffee and cake closer together on the menu board
- Offer a meal with cake and a coffee
- Run a coupon campaign on cake to drive the sale of coffee
Conclusions
Market basket analysis is a set of calculations meant to help businesses understand the underlying patterns in their sales. Certain complementary goods are often bought together and the Apriori Algorithm can undercover them.
Understanding how products sale can be used in everything from promotions to cross-selling to recommendations. While the examples I gave were primarily retail-driven, any industry can benefit from better understanding how their products move.