Background
Retailers typically have a wealth of customer transaction data which consists of the type of items purchased by a customer, their value and the date they were purchased. Unless the retailer has a loyalty rewards system, they may not have demographic information on their customers such as height, age, gender and address. Thus, in order to make suggestions on what this customer might want to buy in the future, i.e which products to recommend to a customer, this has to be based on their purchase history and information on the purchase history of other customers.
In collaborative filtering, recommendations are made to customers based on finding similarities between the purchase history of customers. So, if Customers A and B both purchase Product A, but customer B also purchases Product B, then it is likely that customer A may also be interested in Product B. This is a very simple example and there are various algorithms that can be used to find out how similar customers are in order to make recommendations.
One such algorithm is k-nearest neighbour where the objective is to find k customers that are most similar to the target customer. It involves choosing a k and a similarity metric (with Euclidean distance being most common). The basis of this algorithm is that points that are closest in space to each other are also likely to be most similar to each other.
Another techinque is to use basket analysis or association rules. In this method, the aim is to find out which items are bought together (put in the same basket) and the frequency of this purchase. The output of this algorithm is a series of if-then rules i.e. if a customer buys a candle, then they are also likely to buy matches. Association rules can assist retailers with the following:
- Modifying store layout where associated items are stocked together;
- Sending emails to customers with recommendations on products to purchase based on their previous purchase (i.e. we noticed you bought a candle, perhaps these matches may interest you?); and
- Insights into customer behaviour
Let’s now apply association rules to a dummy dataset
The dataset
A dataset of 2,178,282 observations/rows and 16 variables/features was provided.
The first thing I did with this dataset was quickly check for any missing values or NAs as per follows. As shown below, no missing values were found.

Now the variables were all either read in as numeric or string variables. In order to meaningfully interpret categorical variables, they need to be changed to factors. As such, the following changes were made.
retail <- retail %>%
mutate(MerchCategoryName = as.factor(MerchCategoryName)) %>%
mutate(CategoryName = as.factor(CategoryName)) %>%
mutate(SubCategoryName = as.factor(SubCategoryName)) %>%
mutate(StoreState = as.factor(StoreState)) %>%
mutate(OrderType = as.factor(OrderType)) %>%
mutate (BasketID = as.numeric(BasketID)) %>%
mutate(MerchCategoryCode = as.numeric(MerchCategoryCode)) %>%
mutate(CategoryCode = as.numeric(CategoryCode)) %>%
mutate(SubCategoryCode = as.numeric(SubCategoryCode)) %>%
mutate(ProductName = as.factor(ProductName))
Then, all the numeric variables were summarised into their five-point summary (min, median, max, std dev., and mean) to identify any outliers within the data. By running this summary, it was found that the features MerchCategoryCode, CategoryCode, and SubCategoryCode contained a large number of NAs. Upon further inspection, it was found that the majority of these code values contained digits; however, the ones that had been converted to NAs contained characters such as "Freight" or the letter "C". As these codes are not related to customer purchases, these observations were removed.
Negative gross sales and negative quantity indicate either erroneous values or customer returns. This may be interesting information; however, it is not related to our objective of analysis and as such these observations were omitted.
Data Exploration
It is always a good idea to explore the data to see if you can see any trends or patterns within the dataset. Later on, you can use an algorithm/machine learning model to validate these trends.
The graph below shows me that the highest number of transactions come from Victoria followed by Queensland. If a retailer wants to know where to increase sales then this plot may be useful as the number of sales are proportionately low in all other states.

The below plot shows us that most gross sales values around >0-$40 (median is $37.60).

We can also see this plot by state as below. However, the transactions from Victoria and Queensland seem to cover up information for other states. Boxplots may be better for visualisation.

The below boxplots (though hard to see due to the scale being extended by the outliers) show that most sales across all states are close to the overall median. There in an abnormally high outlier for NT and a couple for VIC. For our purpose, since we are only interested in understanding which products do customers buy together in order to make recommendations, we do not need to deal with these outliers.

Now that we have had a look at sales by state. Let’s try and get a better understanding of the products purchased by customers.
The plot below is coloured based on the frequency of purchases per item. Lighter shades of blue indicate higher frequencies.
Some key takeaways are:
- No sales for team sports in ACT, NSW, SA, and WA – could be due to these products not being stocked there or perhaps they need to be marketed better
- No sales for ski products in ACT, NSW, SA, and WA. I find this quite shocking as NSW and ACT are quite close to some major ski resorts like Thredbo. It is weird that there are ski product sales in QLD which experiences a warm climate throughout the year. Either these products have been mislabelled or they were not stocked in NSW and ACT.
- Paint and panel sales in WA only.
- Bike sales in VIC only.
- Camping and apparel recorded highest sales in VIC, followed by Gas, Fuel and BBQing.

Due to the distribution of sales by product and state, it appears that any association rules we come up with will mainly be based on sales from VIC and QLD. Furthermore, as not all products were stocked/sold in all states, it is expected that the association rules will be limited to a very few number of products. However, since I have already embarked on this mode of analysis, let’s continue to see what we get.
We have two years worth of data, 2016 and 2017. So, I decided to compare the gross number of sales for the two years.
Despite the higher number of transactions in 2016 (2.5 times more than 2017), mean gross sales were higher for 2017 than 2016. This seems quite counter-intuitive. So, I decided to dive into this deeper by looking at monthly sales.
Year# of TransactionsMean Gross Sales ($)2016 1481922
$69.02017 593315
$86.0
In 2016, the highest number of sales were recorded for January and March with steep declines in September to November and then an increase in December. However, transactions continued to decline in 2017 with an increase in December (Xmas season).
Deduction: As highest number of sales are for Camping, apparel and BBQ & Gas, it makes sense that sales for these products is high during the holiday season
Recommendation to the retailer: May want to explore whether stores have sufficient stock for these products in Dec-Jan as they are the most popular.
Deduction: Despite the steady decline in the number of transactions, mean gross sales continue to increase month on month with it being highest in Dec 2017. This indicates fewer customers that made purchases but made purchases of products of greater value.
Recommendation: What can the retailer do to ensure there is a steady state of purchases throughout the year rather than an increasing trend with maximum number of purchases at the end of the year as the retailer is still paying overhead costs and employee salaries amongst other costs to run its stores?
Basket Analysis/Association Rules
Let’s go back to our objective.
Aim: To determine which products are customers likely to buy together in order to make recommendations for products
I used the arules package and the read.transactions function to convert the dataset into a transaction object. A summary of this object gives the following output
## transactions as itemMatrix in sparse format with
## 1019952 rows (elements/itemsets/transactions) and
## 21209 columns (items) and a density of 9.531951e-05
##
## most frequent items:
## GAS BOTTLE REFILL 9KG* GAS BOTTLE REFILL 4KG*
## 30628 11724
## 6 PACK BUTANE - WILD COUNTRY SNAP HOOK ALUMINIUM GRIPWELL
## 9209 7086
## PEG TENT GALV 225X6.3MM P04G (Other)
## 6948 1996372
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 546138 234643 109888 55319 30185 16656 9878 6018 3716 2332
## 11 12 13 14 15 16 17 18 19 20
## 1611 993 751 490 353 237 157 140 99 88
## 21 22 23 24 25 26 27 28 29 30
## 53 48 28 31 20 13 12 15 8 1
## 31 32 33 34 35 36 37 38 39 40
## 4 2 4 3 4 1 4 2 1 4
## 43 46
## 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.022 2.000 46.000
##
## includes extended item information - examples:
## labels
## 1 10
## 2 11
## 3 11/12
Based on the output above, we can conclude the following.
- There are 1019952 collections (baskets) of items and 21209 items.
- Density measures the percentage of non-zero cells in a sparse matrix. It is the total number of items that are purchased divided by the possible number of items in that matrix. You can calculate how many items were purchased by using density: 1019952_21209_0.0000953 = 2,061,545
- Element (itemset/transaction) length distribution: This tells you you how many transactions are there for 1-itemset, for 2-itemset and so on. The first row is telling you the number of items and the second row is telling you the number of transactions.
- Majority of baskets (87%) consist of between 1 to 3 items.
- Minimum number of items in a basket = 1 and maximum = 46 (only one basket)
- Most popular items are gas bottle, gas bottle refill, gripwell, and peg tent.
We can look at this information graphically via absolute frequency and relative frequency plots.
Both plots are in descending order of frequency of purchase. The absolute frequency plot tells us that the highest number of sales are for gas related products. The relative frequency plot shows how the sales of the products that are close to each other in the bar chart are related to each other (i.e. relative). Thus, a recommendation that one can make to the retailer is to stock these products together in the store or send customers an EDM making recommendations for products that are related in the plot and have not yet been purchased by the customer.


The next step to do is to generate rules for our transaction object. The output is as follows.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1019
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[21209 item(s), 1019952 transaction(s)] done [2.52s].
## sorting and recoding items ... [317 item(s)] done [0.04s].
## creating transaction tree ... done [0.84s].
## checking subsets of size 1 2 done [0.04s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.25s].
The above output shows us that 7 rules were generated.
Details of these rules are shown below.
## set of 7 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001128 Min. :0.5458 Min. : 26.30 Min. :1150
## 1st Qu.:0.001464 1st Qu.:0.6395 1st Qu.: 80.36 1st Qu.:1493
## Median :0.001650 Median :0.6634 Median :154.58 Median :1683
## Mean :0.001652 Mean :0.6759 Mean :154.48 Mean :1685
## 3rd Qu.:0.001668 3rd Qu.:0.7265 3rd Qu.:245.30 3rd Qu.:1701
## Max. :0.002524 Max. :0.7898 Max. :249.14 Max. :2574
##
## mining info:
## data ntransactions support confidence
## tr 1019952 0.001 0.5
Now each of these rules have support, confidence, and lift values.
Let’s start with support which is the proportion of transactions out of all transactions used to generate the rules (i.e. 1,019,952) that contain the two items together (i.e. 1190/1019952 = 0.0011 or 0.11%, where count is the number of transactions that contain the two items.
Confidence is the proportion of transactions where two items are bought together out of all transactions where one of the item is purchased. As these are apriori rules, the probability of buying item B is based on the purchase of item A.
Mathematically, this looks like the following:
Confidence(A=>B) = P(A∩B) / P(A) = frequency(A,B) / frequency(A)
In the results above, confidence values range from 54% to 79%.
Probability of customers buying items together with confidence ranges from 54% to 79%, where buying item A has a positive effect on buying item B (as lift values are all greater than 1) .
Note: When I ran the algorithm, I experimented with higher support and confidence values as if there is a greater number of transactions within the dataset where two items are bought together then the higher the confidence. However, when I ran the algorithm with 80% or more confidence, I obtained zero rules.
This was expected due to the sparsity in data for frequent items where 1-item baskets are most common and the majority of purchased items related to camping or gas products.
Thus, the algorithm was run with the following parameters.
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.5,maxlen=10))
Lift indicates how two items are correlated to each other. A positive lift value indicates that buying item A is likely to result in a purchase of item B. Mathematically, lift is calculated as follows.
*Lift(A=>B) = Support / (Supp(A) Supp(B) )**
All our rules have positive lift values indicating that buying item A is likely to lead to a purchase of item B.
Rules inspection
Let’s now inspect the rules.
lhs rhs support confidence lift count
## [1] {GAS BOTTLE 9KG POL CODE 2 DC} => {GAS BOTTLE REFILL 9KG*} 0.001650078 0.7897701 26.30036 1683
## [2] {WEBER BABY Q (Q1000) ROASTING TRIVET} => {WEBER BABY Q CONVECTION TRAY} 0.001127504 0.6526674 241.45428 1150
## [3] {GAS BOTTLE 2KG CODE 4 DC} => {GAS BOTTLE REFILL 2KG*} 0.001344181 0.7308102 154.58137 1371
## [4] {GAS BOTTLE 4KG POL CODE 2 DC} => {GAS BOTTLE REFILL 4KG*} 0.001583408 0.7222719 62.83544 1615
## [5] {YTH L J PP THERMAL OE} => {YTH LS TOP PP THERMAL OE} 0.001667726 0.6634165 249.13587 1701
## [6] {YTH LS TOP PP THERMAL OE} => {YTH L J PP THERMAL OE} 0.001667726 0.6262887 249.13587 1701
## [7] {UNI L J PP THERMAL OE} => {UNI L S TOP PP THERMAL OE} 0.002523648 0.5458015 97.88840 2574
Interpretation of the first rule is as follows:
If a customer buys the 9kg gas bottle, there is a 79% chance that customer will also buy its refill. This is identified for 1,683 transactions in the dataset.
Now, let’s look at these plots visually.
All rules have a confidence value greater than 0.5 with lift ranging from 26 to 249.

The Parallel coordinates plot for the seven rules shows how the purchase of one product influences the purchase of another product. RHS is the item we propose the customer buy. For LHS, 2 is the most recent addition to the basket and 1 is the item that the customer previously purchased.
Looking at the first arrow we can see that if a customer has Weber Baby (Q1000) roasting trivet in their basket, then they are likely to purchase weber babgy q convection tray.
The below plots would be more useful if we could visualize more than 2-itemset baskets.

Wrapping up
You have now learnt how to make recommendations to customers based on which items are most frequently purchased together based on apriori rules. However, some important things to note about this analysis.
- The most popular/frequent items have confounded the analysis to some extent where it appears that we can only make recommendations with respect to only seven association rules with confidence. This is due to the uneven distribution of the number of items by frequency in the basket.
- Customer Segmentation may be another approach for this dataset where customers are grouped by spend (SalesGross), product type (i.e. CategoryCode), StateStore, and time of sale (i.e. Month/Year). However, it would be useful to have more features on customers to do this effectively.
Reference: https://www.datacamp.com/community/tutorials/market-basket-analysis-r
Code and dataset: https://github.com/shedoesdatascience/basketanalysis