Lies, Damned Lies, and Statistics

We all know that correlation does not necessarily imply causation. But sometimes, what appears to be very strong correlation is not actual correlation at all – it is more of a statistical red herring that can lead to incorrect decisions.
Mark Twain famously stated:
There are three kinds of lies: lies, damned lies, and statistics.
Below is the true story in which, I almost unwittingly told a great statistical lie in an attempt to analyze customer value. The story goes on to detail how I caught the lie by discovering seemingly paradoxical correlations, the ensuing intellectual struggle, and the revelation of the ultimate truth. It is complete with the all the intuitive explanations for the phenomenon and the Python code for the simulations, tables and graphs to help you understand the concept and perhaps use it in your own work. Hang on to your hats.
The Backstory
The manager of the dress category for a women’s clothing brand with stores across the U.S. believed that customers who purchased dresses (a fairly small category for this brand) were more valuable than a typical customer. She believed that by offering a wider variety of dresses and focusing more on dresses in advertisements, store signage, and website imagery, the brand would attract more high-value customers. However, she needed evidence to back up her hunch, so she asked me if we can use our transactional data to determine if customers who buy dresses are indeed more valuable than other customers.
In order to effectively analyze the data, we needed to define our terms precisely. We defined a "customer" as someone who purchased at least one item from this brand’s stores or website in a given year. We then divided these customers into two groups: "dress buyers" whose purchases that year included at least one dress, and "non-dress buyers" whose purchases that year did not include any dresses. Finally we defined "customer value" as the total amount spent by that customer over the course of that year. It is important to note that the customer value of dress buyers includes the amount they spent on all items – not just dresses.
We then compared the customer value of dress buyers with non-dress buyers in each of the last three years. We saw that the dress buyers consistently spent significantly more money on average (again, across all items, not just dresses) than non-dress buyers. Due to the proprietary nature of this data and the limitations of my memory, I cannot report the specific numbers, but these numbers would have convinced the most anti-dress designer to scrap her pants patterns at once. The dress category manager was psyched to go to the executives with this news. Her category’s customers were highly valuable!
I had a nagging feeling that something wasn’t quite right, so I asked the manager to view those findings as preliminary because I still needed to do more digging. I was surprised by the degree to which dress-buyers outspent non-dress buyers, and I wondered what the buyer vs. non-buyer analysis would look for other categories, so I repeated the analysis for other categories as well.
I tried pants-buyers vs. non-pants-buyers, sweater-buyers vs. non sweater-buyers, scarf-buyers vs. non-scarf buyers, you get the idea. For each and every category, the "x-category buyers" were consistently and significantly better customers than the "non-x-category buyers"! Something was clearly amiss, how could buyers of every single category be better than average?!
Maybe it was because I was comparing customers who specifically bought something with customers who specifically did not buy something? Although I didn’t see intuitively why that would matter, because the non-x-category buyers bought at least one item by definition, I decided to try one more approach. This time, I compared customers who bought at least one item in category-x, with all customers (including both those whose purchases included and didn’t include items in category-x). While this analysis shrunk the difference in annual spend by a bit, it still showed buyers of each and ever category were significantly more valuable than the truly average customer.

How could this be? It seemed paradoxical! I was flummoxed! So I did what any reasonable person would do and banged my head against the wall until I figured it out.
Finally through the haze of a throbbing head, a caffeine induced buzzing, and the impending shame of approaching the dress-category manager with my unsatisfactory findings, the confusion melted away and the truth appeared in all its crystal clear splendor like a beautiful rainbow after a terrible storm.
The Intuition
If you already understand why I was getting the above results, congratulations! You’re one smart cookie, feel free to skip the rest of this and go do some smart stuff. For everyone else let’s think about the dataset – because after thinking really hard for a really long time, the answer finally occurred to me after I deeply analyzed the dataset itself.

We can imagine the dataset as a giant receipt (sort of like the receipts from a particular American chain of drug stores whose name consists of three letters), that lists every single item purchased by every single customer at every single store for a full year. Each line on this receipt lists the category of the item purchased, the price paid, and the customer who purchased it.
Imagine we then hang this receipt from a fluffy cloud at the top of the rainbow I mentioned earlier, throw a dart at it (without aiming) and then read the line that is closest to the dart. Do you think the customer listed on that line is most likely to be someone who spends:
- more than average?
- less than average?
- around average?

If it isn’t intuitively clear that the answer is "more than average", imagine the same scenario above except there are only two customers: Customer A who bought 99% of the items on the receipt and, Customer B who bought 1% of the items on the receipt. An "average customer" will, in theory, have bought (99% + 1%)/2 = 50% of items on the receipt, but that customer doesn’t actually exist, and chances are my dart will land on one of Customer A’s lines, an above average customer. This concept generalizes to less extreme cases with many different types of customers. Whenever a random line is selected (or hit by a dart), it is more likely to be a line that belongs to a customer with lots of lines, than a line that belongs to a customer with just a few lines. So you can expect that the selected line will belong to a better customer.
If we throw lots of of darts, and take the average value of all the customers whose lines got hit (some customers may have multiple lines that got hit – that’s OK, we’ll just count them once), we’ll most likely find that this average value is greater than the average value of customers who have no lines that were hit (because those customers tend to take up less space on the receipt).
By selecting items that belong to one category (in our case dresses), even if buying from that category were completely meaningless from a customer value standpoint, we would still expect that the average buyer of that category would have a higher customer value than everyone else! The question we need to answer then is how much more valuable than average do we expect these buyers to be? If dress buyers are significantly more valuable than this theoretical value, then we can conclude that they are more valuable customers.
Simulation-Based Demonstration
In case you’re still not convinced of the intuition, let’s demonstrate this phenomenon with a fun little simulation in Python! If you are convinced and/or just not interested in all this coding, feel free to scroll to the Solution section.
Let’s assume we have 10 product categories, and for simplicity’s sake, let’s price every product in every category at $1.
Now let’s assume that there is a pool of four types of customers that exist in the great big world of all possible customers:
- Frugal Fannie: 40% of customers, buys one item a year
- Regular Rachel: 30% of customers, buys three items a year
- Loyal Laurie: 20% of customers, buys 10 items a year
- Shop-aholic Shirley: 10% of customers, buys 50 items a year
Now let’s say that in this particular year, we had 1,000 customers. Each customer is essentially a random draw from the population of customer types. Depending on her type, she will buy 1, 3, 10, or 50 items, but unlike a normal human shopper with tastes and preferences, this simulated shopper randomly draws an item from our infinitely large inventory of items in 10 categories:
- Category 1, Scarves: o.9% of items
- Category 2, Belts: 1.3% of items
- Category 3, Shoes: 2.0% of items
- Category 4, PJs: 3.0% of items
- Category 5, Dresses: 4.5% of items
- Category 6, Sweaters: 6.7% of items
- Category 7, Skirts: 10.1% of items
- Category 8, Shorts: 15.1% of items
- Category 9, Pants: 22.6% of items
- Category 10, Tops: 33.9% of items
To build our data set we first select a customer from the potential customer population and determine how many purchases she will make based on her type. Then for each of her purchases, we randomly select a product from our inventory and record the purchase on our giant receipt. We repeat this process for all 1,000 customers:

Ok that looks good, let’s check a few things.
Firstly, does the category mix match what we expected?

Looks about right, now let’s see what the customer type mix looks like.

OK, this looks good too. Let’s now analyze the average customer value of buyers vs. non-buyers for each of our 10 categories:

Let’s point out two things in this table. Firstly, the average spend for the buyers of each category is always greater than the average spend of the non-buyers, as we expected. Secondly, both the average buyer spend and the average non-buyer spend seem to trend downward as the category probability increases, i.e. the average customer value in each group seems to be inversely related to the category probability.
Let’s graph this to investigate further:

Note that as the probability of purchasing a product from a particular category increases, the average spend of that category’s buyers trends toward the average spend of all customers. This intuitively makes sense, because if there were only one product, then the probability of purchasing that product would be 100% and the average spend of its buyers would be exactly the same as the average spend of all customers, because all customers would have purchased that item. If we return to the dart analogy, as we throw more and more darts, we cover more and more of the receipt until eventually, every customer has a line that was hit, so the darted customers are the average customer.
Note also that the average spend of non-buyers trends toward $1 (the minimum spend of any customer in our dataset) as the category probability increases. This is because the greater the probability of purchasing an item, the greater the probability that the customers who did not buy it are the Frugal Fannies, who, as we defined earlier, only buy one product each. Given that we also conveniently set the price of each product to $1, each Frugal Fannie will have spent exactly $1. So as the probability of purchasing a particular product increases, prevalence of Frugal Fannie’s among the non-buyers, increases, and the average spend of all non-buyers approaches $1.
A Solution
We demonstrated that having a higher than average spend per customer was to be expected for dress-buyers as it is to be expected for any category-buyers. The question remains, however, how much more should we expect the average dress buyer spend to be if buying a dress is not correlated with customer value. If the actual customer value of dress buyers were to be significantly greater than that expected number, we would conclude that dress buyers are indeed better customers, and if the actual customer value of dress-buyers were to fall significantly below this expectation, we would conclude that they are actually worse customers. If, however, it fell relatively close to the expected average customer value, we would simply assume that dress buyers are just average customers. In other words, how do we correct for the bias introduced when we select a subset of customers, so that we can have an apples-to-apples comparison?
Unfortunately, there is no closed form mathematical formula (that I know of) to compute this. However, just as we demonstrated the phenomenon using simulation, we can also solve it that way.
Let’s assume that there were 5,000 dresses sold over the course of a year. If dress buyers were truly average customers, then we would expect the average customer value of dress buyers to be roughly equivalent to the average customer value of the customers listed on 5,000 randomly selected lines from our giant receipt, or to use our previous analogy, the lines on the giant receipt hit by 5,000 dart throws. It won’t be exactly the same – there will likely be some error – but this random selection should be a good approximation, because. if buying a dress is meaningless from a customer value standpoint, we can consider it essentially random.
Unfortunately, that would only give us one point of reference – we would not know if that was a particularly unusual 5,000 dart throws or typical ones. What we can do to solve for that is throw those 5,000 darts, record the average customer value for the lines that got hit, remove the darts and do it again. If we repeat this process many, many times (let’s say 10,000 times), we will now have a very pretty normal (bell-shaped) distribution of average customer values. The average of these averages will be a very good and unbiased estimate of what average customer value to expect if buying a dress were truly random. We can then determine with quantifiable statistical significance, how valuable actual dress buyers really are. So far example, if the average value of a dress buyer is greater than the 97.5th percentile, we can say with greater than 95% confidence that if buying a dress were unassociated with average customer spend, it would be very unlikely to see the result we have gotten. OK that was very stati-sticky language – In other words, they are probably better customers than average, and we’re like 95% sure of that.
Let’s code this up like we did before, except this time, let’s only use three product categories: belts, tops, and dresses. Let’s purposely make dresses more likely to be bought by the best customers and belts more likely to be bought by the worst customers.

Now let’s simulate the distributions (the repeated dart-throwing at a receipt exercise):

As expected, dress buyers are significantly more valuable than their expected value, and tops buyers are well-within their expected value range, and belt buyers are significantly less valuable than their expected value.
Let’s graph the simulation distributions so we can see this even more clearly!

Now we can clearly see how far away from an expected range belt-buyer and dress-buyer values are. While this does not prove causality, i.e. we still don’t know if purchasing a dress makes a customer better, or it’s just that better customers tend to like dresses (actually we know that it is the latter in this case, because we created the simulation, but in real life we wouldn’t know this), we still have valuable information about the value of customers who are buying products in the various categories. For example, if this company were to consider discontinuing a category, they would probably be better off discontinuing belts than dresses, because the risk of alienating the best customers is the lowest with that category.
But can we determine causality? What if the company really wants to know if there are products that, when purchased, actually cause a customer to become a better customer? Imagine buying a video game console, and that leads to buying many games and accessories. Or in our clothing case, imagine a product (like a dress) that goes perfectly with a belt, shoes, shrug, and a handbag all conveniently sold by this company. How can we use statistics to determine that?
Let’s explore that question more fully in a future post.
p.s. In real life, dress buyers did indeed turn out to be more valuable customers (though it wasn’t as extreme as the simulated graph I show here), and the dress category manager lived happily ever after.
About the Author

Hi, I’m Daniel. I’m a data science student, practitioner, and leader in that order. I’m especially fascinated by how data science can help businesses make better decisions and do things more efficiently. I currently lead the data science team at Red Oak Sourcing, and I have built, led, and several data science teams in various industries. If you’re interested in connecting with me professionally check out my LinkedIn profile here: https://www.linkedin.com/in/danielwiesenfeld/, and if you’d like to here my occasional less-filtered thoughts, follow me on twitter here https://twitter.com/DataDan5. You can also find my public repos and gists here on github here https://github.com/dan-s-w. I hope you enjoy my articles.