The world’s leading publication for data science, AI, and ML professionals.

Basket Analysis

Confirming the Trivial, Uncovering the Mysterious, Exploiting the Useful

What is Basket Analysis?

Do all these items belong in the same customer basket? - Photo by Julius Drost on Unsplash
Do all these items belong in the same customer basket? – Photo by Julius Drost on Unsplash

Basket analysis is a pretty simple concept. Imagine customers shopping at a grocery store. They grab items off the shelf, walk to the register and pay. You then collect all the receipts from your customers and create a list, customer by customer, of what they bought (i.e., what was in their basket or shopping cart). You might end up with something like the following with {…} denoting items in a customer’s basket:

  • Customer A: {peanut butter, jelly, bread, rice, cereal, milk}
  • Customer B: {peanut butter, jelly, herbs, potatoes}
  • Customer C: {peanut butter, bread, cereal, milk}
  • Customer D: {milk, cereal}

Looking at this collection of purchases, there are a few associations that are obvious:

  • {peanut butter} => {jelly} with 67% probability (i.e., 2/3 baskets that had peanut butter in them also had jelly)
  • {peanut butter} => {bread} with 67% probability (i.e., 2/3 baskets that had peanut butter in them also had bread}
  • {bread} => {peanut butter} with 100% probability (i.e., every basket that had bread in them also had peanut butter}
  • {bread, cereal} => {milk} with 100% probability (i.e., every basket that had bread and cereal also had milk}
  • {cereal, milk} => {bread} with 67% probability (i.e., 2/3 baskets that had cereal & milk also had bread}

These are called association rules or simply associations. They tell us which products tend to be selected together and how knowing one product is in a basket will predict what else may be in the basket. This type of information is incredibly useful (more on this soon).

The challenge with discovering associations is the sheer number of possible combinations. Imagine a store with 100 different products. A customer with a single item in their basket can have 100 possible combinations. A customer with two products can have (100 x 99)/2 = 4,950 possible combinations. If the number of items goes up to 3 then there are (1009998)/(32) = 161,700 possible combinations. With four products it goes up to 3,921,255 possible combinations. This is the classic combinatorial problem where you have a set of N items and choose R from that set; the equation that gives the number of possible combinations is N!/[R! (N-R)!]. Combinatorials grow significantly faster than even exponential growth – which makes it impossible to analyze the data to identify associated products unless you use a clever algorithm; trying to exhaustively examine every possible combination is not feasible no matter your computational resources.

The Apriori Algorithm

Fortunately there exists a clever algorithm: the "Apriori algorithm", which is designed to rapidly evaluated very large datasets by making a simple a priori assumption. For a set of items to be considered associated, the set must appear with sufficient frequency. But if a set of items appears frequently, so must subsets of those items. This means we can eliminate a very large number of candidate associations simply by eliminating subsets that appear infrequently. In our trivial example, only a single customer purchased rice, but 3 customers purchased peanut butter. This means that rice appeared only 25% of the time versus peanut butter which appeared 75% of the time. Looking over the list we see the following frequencies for each product in a customer’s basket:

  • Cereal => 75%
  • Milk => 75%
  • Peanut butter => 75%
  • Jelly => 50%
  • Bread => 50%
  • Rice => 25%
  • Herbs => 25%
  • Potatoes => 25%

Since herbs and potatoes appear relatively infrequently, we can eliminate them from consideration because combinations of these items plus other items will happen infrequently as well. This reduces the search space from 8 items to only 6 items, which reduces the number of possible sets from 256 to just 64. The Apriori algorithm applies this assumption to all possible subsets (e.g., {peanut butter, jelly}, {milk, cereal}) to rapidly eliminate entire groups of potential associations, which makes it feasible to process large datasets.

I really recommend the interested reader purchase Machine Learning with R, which has an excellent chapter on the Apriori algorithm. Using this technique is laughably easy in R. As an illustration, let me show you conceptually how many lines of code are needed to apply this algorithm:

library('a rules');
data <- read.csv(file="segment.csv", sep=",")
data_xtabs <- as.data.frame.matrix(xtabs(~account_id+segment_name, data));
data_xtabs_logical <- as.data.frame(data_xtabs > 0);
transactions <- as(data_xtabs_logical, "transactions");
summary(transactions);
rules <- apriori(transactions, parameter = list(support = 0.01, 
confidence = 0.5, minlen = 2));
summary(rules);

Although I strongly recommend you first understand the theory behind the Apriori algorithm before using it… Yes, it really is that simple to use. Full disclosure I did eliminate 3–4 lines from my example that aren’t important for our discussion but the R code above is very representative and only 8 lines long. And of those 8 lines one is to load the Apriori library, four are to load the data and two are to print summaries of the data and the rules respectively. You can see why I am such a fan of R and highly recommend organizations with a serious interest in developing a robust Analytics capability become accustomed to using R (or Python) and moving away from simple tools like Excel.

When working with the Apriori algorithm there are three key concepts you must understand: support, confidence and lift. These are pretty simple to understand:

  • Support – This is a measurement of how frequently a set of items appears in the Data. For example, if the support of peanut butter is 50% it means that 50% of the customers had peanut butter in their basket. Similarly, if the support of {peanut butter, jelly} is 25% then 1 out of every 4 customers had both peanut butter and jelly in their basket. The higher the required support, the more data the Apriori algorithm will exclude from consideration.
  • Confidence – This is a measurement of how predictive an associated set of products is. For example, if the confidence of {peanut butter} => {jelly} is 75% it means that every basket that had peanut butter in it also had a 75% chance of having jelly in the same basket. Similarly, saying that the confidence of {bread, cereal} => {milk} is 100% means that every basket that had both bread and cereal also had milk. So confidence tells you how much you can trust the association. The higher the confidence, the more associations you will prune (i.e., discard from consideration).
  • Lift – This is a measurement of how surprised you should be about the associated set of products; surprised not in a common sense way but in a statistical way. For example, let’s say you had a basket with 1 blue ball and 9 red balls. If you reach in and grab a random ball, you have a 10% chance of getting a blue ball and 90% chance of getting a red ball. Let’s assume you reach into the basket 3 times, randomly grab a ball, look at the color and drop the ball back in the basket. Statistically, the probability of getting 3 blue balls is just 0.1%, which means that for every 1,000 attempts at doing this, you should get 3 blue balls only one single time (on average). If you go ahead and try it and in those 1,000 attempts end up getting such an outcome 50 times instead of 1 or 2 times then you should be very surprised. Getting that outcome 50 times instead of 1 time means that it happened 50x more often than random chance would predict. So lift is basically dividing how often an association happened by the probability that the association occurred by chance (i.e., the association isn’t real).

There is no right answer for what support, confidence or lift should be when using the Apriori algorithm to analyze your data. However, I like to use a simple rule of thumb:

  • Support – This should scaled according to your data. For example, if you are looking at how often customer segments are associated with each other but only 10% of your customers purchase repeatedly, then your support can’t be more than 10% because 90% of your data will have a single item in the basket. So I usually use a simple rule that the support should be at least 1% of how often I expect a subset of items to appear in the data.
  • Confidence – An association that isn’t predictive is useless so generally speaking I like to see a confidence of at least 50% wherever possible so that you are "right" more often than not when using the association.
  • Lift – Any rule with a lift less than 10x should be used only with caution; I like to see the lift be at least an order of magnitude higher than chance to feel comfortable that it is a real association and not just a random coincidence

The trivial, the mysterious and the useful

Whenever you use algorithms like the Apriori algorithm to build associations you will generally end up with rules that can be classified into three groups:

  1. The trivial
  2. The mysterious
  3. The useful

Trivial rules are associations that are obvious. For example, you’ll likely get laughed out of the room by business leaders if you use this technique to tell them that customers who buy diapers also buy baby formula. Or that customers who by peanut butter also buy jelly. These are obvious and do not require gigabytes of data and fancy algorithms to be discovered. Generally speaking you will get back a significant number of "trivial" rules from most business data.

Mysterious rules are associations that appear in the data but that are difficult to explain and not necessarily something you would do even if the associations are real and not random. For example, let’s say that you find that purchasing milk is strongly associated with also purchasing mechanical pencils. This would be a head scratching rule because there is no reason – at least that I can think of – for why customers who put milk in their shopping cart will also put mechanical pencils. And you probably are not going to reconfigure your store to place the pencils and pens display by the milk section. So I would categorize these types of head scratching associations as "mysterious" – though it doesn’t mean they are not useful to know. What you do with these depends on your Strategy… which will be discussed in a bit.

Finally there are the useful rules. These are rules that are not trivial, not "mysterious" or "weird" and that you can actually put to use immediately. Have you ever wondered why it is that when you walk through Home Depot or other big-box hardware stores that they have a habit of hanging a small number of samples of an un-related yet seemingly appropriate product along an aisle? For example, I’ve noticed that my local Home Depot will put a small number of hammers and nails in the lumber section…. there is an entire aisle that has more nails than you will ever need and more variations of a hammer than most people have ever seen but for some reason they still put a few samples in the lumber section. Why? Its because of associations. Either through manual analysis, anecdote or from using techniques like the Apriori algorithm, someone noticed that people who buy lumber also tend to buy hammers and nails. This isn’t a trivial rule because I would have assumed that most people who buy lumber from the lumber aisle already have at least one hammer and plenty of nails because lumber isn’t something most people buy on a whim. And yet people still tend to purchase these things together… maybe you can never have enough hammers and nails if you work with lumber?

Analytics !=Strategy… Yes my friend you do have to think!

Analysts and data scientists sometimes make the mistake of assuming the analytics is the be all and end all. That’s a mistake. Analytics without strategy is like buying lots of cement and wood and expecting the house to build itself. Analytics is just a building block… strategy is actually using those building blocks to get something useful done. So you’ve written 8 lines in R and uncovered a series of mysterious and useful associations (hopefully you got my point about ignoring the trivial ones)… congratulations now what are you going to do with those associations? Its important if you are an Analyst or a Data Scientist to think like a business person and ask yourself this: "So what?"

If you’ve ever walked around a grocery store you will have noticed that the milk section is always at the back of the grocery store. This is purposeful: it forces you to wander around the grocery store which will increase the chances you will buy more stuff than just milk. That’s strategy! Let’s consider a trivial association like {peanut butter} => {jelly}. There are two basic strategies you can take with this association:

  1. Place the peanut butter in a different aisle far away from the jelly to force shoppers to wander around and increase the chances they will buy something else
  2. Place the peanut butter in the same aisle to increase the chances the shopper will buy jelly whenever they get peanut butter

Which is the better strategy? There is no right answer. Indeed I have seen some grocery stores use one strategy and another grocery store use the other strategy. Placing the peanut butter and jelly far away from each other might increase the chances that a customer grabs another product while looking for both items but it might risk that the customer grabs only one and leaves if he or she isn’t inclines to play "hide and seek". Placing the peanut butter and jelly in the same aisle increases the chances they’ll grab both and easily generate more revenue but it risks that the customer might have grabbed a third item if they had been forced to wander around. The only way to decide is to run an A/B test and compare the revenue generated by each strategy. If you want to better understand what an A/B really measures, I’ve written an article about that too.


Related Articles