How to Use Causal Inference In Day-to-Day Analytical Work — Part 1 of 2

Published in

Towards Data Science

11 min readOct 15, 2019

Analysts and data scientists operating in the business world are awash in observational data. This is data that’s generated in the course of the operations of the business. This is in contrast to experimental data, where subjects are randomly assigned to different treatment groups, and outcomes are recorded and analyzed (think randomized clinical trials or AB tests).

Experimental data can be expensive or, in some cases, impossible/unethical to collect (e.g., assigning people to smoking vs non-smoking groups). Observational data, on the other hand, are very cheap since they are generated as a side effect of business operations.

Given this cheap abundance of observational data, it is no surprise that ‘interrogating’ this data is a staple of everyday analytical work. And one of the most common interrogation techniques is comparing groups of ‘subjects’ — customers, employees, products, … — on important metrics.

Shoppers who used a “free shipping for orders over $50” coupon spent 14% more than shoppers who didn’t use the coupon.
Products in the front of the store were bought 12% more often than products in the back of the store
Customers who buy from multiple channels spend 30% more annually than customers who buy from a single channel.
Sales reps in the Western region delivered 9% higher bookings-per-rep than reps in the Eastern region.

Comparisons are very useful and give us insight into how the system (i.e. the business, the organization, the customer base) really works.

And these insights, in turn, suggest things we can do — interventions — to improve outcomes we care about.

Customers who buy from multiple channels spend 30% more annually than customers who buy from a single channel.

30% is a lot! If we could entice single-channel shoppers to buy from a different channel the next time around (perhaps by sending them a coupon that only works for that new channel), maybe they will spend 30% more the following year?

Products in the front of the store were bought 12% more often than products in the back of the store.

Wow! So if we move weakly-selling products from the back of the store to the front, maybe their sales will increase by 12%?

These interventions may have the desired effect if the data on which the original comparison was calculated is experimental (e.g., if a random subset of products had been assigned to the front of the store and we compared their performance to the ones in the back).

But if our data is observational — some products were selected by the retailer to be in the front of the store for business reasons; given a set of channels, some customers self-selected to use a single channel, others used multiple channels— you have to be careful.

Why?

Because comparisons calculated from observational data may not be real. They may NOT be a reflection of how your business really works and acting on them may get you into trouble.

The general question of how to answer ‘interventional’ questions from observational data is studied in the field of causal inference. There are helpful articles (example, example), books (Causal Inference, The Book of Why ), and courses (example) that teach key concepts like causation vs correlation, confounding, selection bias, causal diagrams, reverse causality etc.

While this knowledge is interesting and valuable, there’s a LOT of it. Is there a basic idea within that, to paraphrase Charlie Munger, “carries most of the freight” — something that we can routinely apply on a Monday morning to test if a suggestive comparison is real?

Yes. When a comparison misleads you, confounding is often the reason (we will see examples of this shortly).

Confounding occurs when

subjects self-select into groups or get assigned to groups based on non-random factors and
these factors influence what the subjects are being compared on.

These factors are called confounders. Think of a confounder as a common cause of both how subjects end up in groups and how the groups fare on the metric of interest.

Let’s look at a few examples.

Comparison: people who meditate regularly are X% less likely to get heart disease than those who don’t. Possible confounding due to exercise and diet: people who exercise and follow healthier diets => more likely to be in the meditation group and less prone to heart disease (source: Adapted from http://bit.ly/2TQjzus)
Comparison: people who are single are X% more likely to be active on Facebook than those who are married. Possible confounding due to age: younger people => more likely to be in the single group and more likely be active on Facebook (source: Adapted from Categorical Data Analysis, Chapter 2.1.8, page 43).
Comparison: A hospital looking to replace its ultrasound machines runs a test and discovers that the new device is taking X% longer to use than the old one. Possible confounding due to expertise level: newer technicians => more likely to try the new devices and may take longer to do an ultrasound in any device (source: Adapted from http://bit.ly/2V0yfsZ).

Learning about confounders is like learning a new word. Once you become aware of their existence, you will see them everywhere.

What can we do about confounders?

We need to control for them. There’s an extensive literature on how to do so (article, article, article) and researchers in many fields — biostatistics and epidemiology, for example — have been doing it for decades. When you read an article in the newspaper that “X is associated with a higher risk of Y, after controlling for age, gender, BMI, blood pressure and level of physical activity”, confounders are being controlled.

A simple and widely-used approach to control for confounders is stratification.

We group the subjects into buckets based on the values of the confounders and calculate the comparison for each bucket.
Within each bucket, by definition, the values of the confounders don’t change. Therefore, any change to the outcome metric within the bucket cannot be due to changes in the confounders. This is the essence of why stratification works.
We then calculate a new overall comparison by calculating a weighted average of the numbers for each bucket. The weights we use here are the key and we will examine them in detail below.
If the original comparison and the new comparison are very different from each other, confounding is in play and we shouldn’t trust the original comparison.

Let’s assemble these ideas into a checklist to use when you see a suggestive comparison.

1. First, confirm that the underlying data is in fact observational i.e., did the subjects self-select or end up in the groups in some non-random manner? If the data is experimental, you don’t need this checklist :-)
2. Think of potential confounders that influence both (a) which subject ends up in which group and (b) the metric of interest? If the subjects self-selected into the groups, think about what factors may have led them — consciously or unconsciously — to chose one group over another.
3. If data is available on any of the confounders you have identified, run a stratification analysis as outlined above.

Let’s apply the checklist to a few examples.

You introduced product recommendations on one section of your e-commerce website a few months ago and want to know if it is ‘working’. One of the simplest comparisons you can do is compare the spending of visitors who clicked on a recommendation with those visitors who didn’t.

This is what you find.

Website visitors who clicked on product recommendations spent 18% more per visitor than visitors who didn’t.

Should you believe this comparison?

If you do, you may decide to show recommendations on all sections of the site to increase the chance of a shopper clicking on a recommendation. For shoppers who do that, maybe they will spend 18% more on average?

Let’s apply the checklist.

Observational data? Yes. While A/B tests are ubiquitous in e-commerce and are sources of reliable comparisons due to their use of randomization, a comparison like the one above is unlikely to come from a randomized test since you can’t force visitors to click on product recommendations. The visitors must have self-selected.
Potential confounders. The obvious one is the visitor’s prior history. If they are a loyal customer, they may visit the site often, explore the site more and click on product recommendations, spend more when they buy etc. Thus, loyalty can influence both their propensity to click on a recommendation and their spend.
Stratification analysis.

Step 1: Define confounder buckets. If we believe that loyalty is a confounding factor, perhaps the simplest thing we can do is to define two buckets, New Visitors and Returning Visitors, as an approximate measure of a visitor’s loyalty.

Step 2: Calculate the numbers for each confounder bucket. We calculate the per-visitor spend numbers for New and Returning visitors separately. This expands the original table from this …

… to this

Wait a second! Clicking on product recommendations has no effect on either Visitor Type — $0.70 stays at $0.70 for New and $1.30 stays at $1.30 for Returning — and yet, the overall impact is 18%! How’s this possible?

To get insight into how this can happen, let’s calculate the overall numbers — $1.18 and $1.00 — bottom up from the bucket level numbers.

We will first calculate the overall percentage of New and Returning visitors in our sample:

We then calculate what percentage of each visitor type clicked on recommendations:

While 83% of Returning visitors clicked on product recommendations, only 55% of New visitors did.

With these two tables, we can calculate the % of New and Returning within the Clicker and Non-Clicker groups (i.e. within each column). For example, the % of Clickers that are New visitors = 27% * 55% / (27% * 55% + 73% * 83%) = 20% and the % of Clickers that are Returning is 100% — 20% = 80%. The full table is:

Now, the Overall numbers are just the average of the New and Returning spend numbers weighted by the numbers calculated above …

$0.70 * 20% + $1.30 * 80% = $1.18
$0.70 * 50% + $1.30 * 50% = $1.00

Pictorially,

… resulting in

We can see now how confounding did its mischief.

The mix of New and Returning in each of the groups (i.e. columns) was different from group to group.

If the mix is not nearly identical, strange things are possible — for example, the overall number may show a decrease while every stratum-level number shows an increase! (Simpson’s Paradox).

And why was the mix not identical in the two columns?

Because Returning visitors were more likely to click on recommendations than New visitors.

It is important to note that this phenomenon has nothing to do with the mix of New and Returning visitors in your data sample. The key factor is the % of New visitors clicking on recommendations vs the % of Returning visitors clicking on recommendations.

If these two percentages are the same, the mixes will be identical. If these two percentages are different, the mixes will be different.

(This, btw, is how random assignment prevents confounding. If we had somehow randomly assigned visitors to the “Clicked” and “Didn’t Click” groups, the % of New visitors clicking on recommendations will be the same as the % of Returning visitors clicking on recommendations. Therefore, the mix of New and Returning within the “Clicked” and “Didn’t click” groups would have been identical, preventing the distortion we saw above)

Step 3: Calculate an adjusted overall comparison number.

We “deconfound” by using the same weights in both columns. Which weights? The % mix of New vs Returning customers.

As we saw earlier, the overall mix of New and Returning in the visitor base was 27% and 73% …

… and we use them as weights to calculate an adjusted comparison.

You can see now that the 0% difference in spend for each visitor type is reflected in the adjusted overall number as well. The discrepancy has disappeared.

Confounding had distorted these weights so that they were different for each column. By using the same weights for each column, we undid the damage.

To recap: using the mix of subjects across the confounder buckets in your data as weights for both groups, calculate a weighted average of the bucket level numbers to produce the adjusted overall numbers.

Step 4: Compare the comparisons :-)

Since the original comparison and adjusted comparison are wildly different, the original comparison cannot be trusted.

(Aside: A better way to assess the effect of product recommendations is to assign website visitors to two groups A and B randomly, with recommendations on for group A and off for group B, and measure revenue-per-session, conversion rate etc for the two groups over a defined time period).

Caveat: This “control for confounders” approach is far from foolproof.

There may be other confounders you didn’t think of. There’s a highly developed theory on how to identify a complete set of confounders (e.g., the back door criterion) that I encourage you to read. My hope is that the approach advocated here of attempting to identify confounders informally (based on your knowledge of the business) is enough to get started.
There may be other confounders you thought of but you don’t have data on them so you can’t control for them.
The confounders you identified weren’t really confounders.
Your controlling approach wasn’t good enough e.g., in the example above, we stratified visitors into just two buckets but maybe these buckets were so big that confounding continued within each bucket (called residual confounding in the literature).
On the other hand, if you create too many buckets, some of those buckets may have too few subjects and that will lead to unreliable estimates of the metric.
…

All these caveats notwithstanding, it is worthwhile to apply the checklist when you see a juicy comparison and are tempted to swing into action.

It may not make you a rockstar at work but may save you from jumping to the wrong conclusion at least some of the time.

In Part 2, we look at more examples and address the situation where you have so many confounders that stratification becomes messy.

(If you found this article helpful, you may find these of interest)

How to Use Causal Inference In Day-to-Day Analytical Work — Part 1 of 2

Written by Rama Ramakrishnan