How to Use Causal Inference in Day-to-Day Analytical Work— Part 2 of 2

Rama Ramakrishnan
Towards Data Science
9 min readJun 18, 2020

--

Source: https://www.pexels.com/photo/paper-on-gray-laptop-669617/

In Part 1, we looked at how to use Causal Inference to draw the right conclusions — or at least not jump to the wrong conclusions — from observational data.

We saw that confounders are often the reason why we draw the wrong conclusions and learned about a simple technique called stratification that can help us control for confounders.

In this article, we present another example of how to use stratification and then consider what to do when there are so many confounders that stratification becomes messy.

Let’s say you work at a multi-channel retailer. You have been analyzing customer-level sales data and notice the following:

Customers who buy from multiple channels spend 30% more per transaction than customers who buy from a single channel.

Source: Inspired by http://bit.ly/2Q6BwTN

This is a pretty exciting finding because it is very actionable if true: You can put together a list of your single-channel shoppers and send them offers to entice them to buy from another channel the next time they shop — if a customer bought in a physical store the first time around, you can email them a 20% off online-only coupon. If your finding is real, they will spend more than they otherwise would and you can sit back and watch the money roll in :-)

It is tempting to immediately take this finding to the CEO —I mean, a 30% increase in revenue is a LOT of money — but let’s apply the checklist we defined in Part 1.

Random assignment? No. Shoppers self-selected themselves into the single-channel and multiple-channel groups.

Any confounders? Any factors that influence how many channels a shopper uses AND how much they spend per transaction?

Well, how about the number of times the shopper has shopped with you in the past year?

  • In the extreme case, if they have shopped with you just once, there’s no way they can be in the multi-channel group. More generally, the more times they have shopped with you in the past year, the more likely that they used more than one channel.
  • The more times they have shopped with you, it is more probable that they like your products and are more familiar with your product catalog, and therefore more likely to spend more per transaction.

Let’s go after this confounder.

3. We can control for this confounder by splitting the data using the distinct values of the confounder (as the splitting variable) and calculating the average spend per transaction for each bucket:

There are a few things to note in this table:

  • We have excluded those customers who purchased exactly once in the past year since even the notion of multi-channel isn’t applicable to them.
  • We have stratified using three buckets: 2 purchases, 3–5 purchases and more than 5 purchases. This is a bit of a balancing act - if we stratify into too many buckets, we may not have enough data in some buckets. If we stratify into too few, we will be mixing apples and oranges in the same bucket. My approach is to start the analysis with a handful of buckets and then see how the results of the analysis change if you add more buckets.
  • The overall numbers ($33.80 and $44.10) are just weighted averages of the numbers below. This is just a check to make sure we haven’t made a mistake when we split the data.
  • Now we come to the heart of the analysis. We look at each stratum (i.e., each confounder bucket) and calculate how the spend number changes as you go from single to multi-channel customers.
  • Note that all these changes are in the 1–3% range, in contrast to that big 30% number we started with. This is an indication that confounding is at work and the 30% number is suspect.

Finally, as we learned in Part 1, we “de-confound” by calculating adjusted overall numbers. The adjustment is done by weighting the stratum-level numbers with the % of customers in the entire dataset that are in each stratum.

We calculate the adjusted average spend/transaction for single-channel customers as follows ….

… and similarly for multi-channel customers.

As was emphasized in Part 1, it is vital that the same weights — the % of customers in the entire dataset that are in each stratum — are used to adjust both sets of numbers.

OK, what do we get?

The story has changed drastically!

The 30% difference that originally got us excited has shrunk to 2%.

Among those customers with two purchases in the past year, multi-channel customers spent just 1.5% more than single-channel customers! And among customers with more than five purchases, multi-channel customers spent only a modest 2.8% more than single-channel customers!

Given this dramatic discrepancy between the original 30% number and the adjusted (de-confounded) number of 2%, it is unlikely that there is a significant causal effect due to multiple channels. Do not take this to the CEO :-)

I used just one confounder above to make it easy to explain. But you can use more if you like. For example, you may suspect that shoppers who live in rural areas far from one of your bricks-and-mortar stores may buy exclusively online (and therefore be part of the single-channel group). Rural shoppers may have a different spending profile than others so this may affect the spend metric as well.

Assuming you have the data, you can extend the table above as follows, and like we did above, compare the adjusted number to the overall number:

At this point, you may be wondering: “What if I have more than one or two cofounders? What if I have half a dozen? Stratification will get pretty unwieldy, won’t it?”

Good question. Yes, stratification will get messy. You will need to use multivariate methods like Linear Regression or Logistic Regression to get the job done. You can follow-up on these references (article, article, article) to learn more but here’s a quick example from my experience.

Let’s say you work for a retailer and are considering the purchase of a price optimization system. How would you understand/quantify the impact of the use of the system on revenues?

In an ideal world, you would do an A/B test where a random 50% of products would be priced using the price optimization system and the other 50% would be priced with your current approach. You would run this test for a while and then compare the two groups on revenue etc.

Unfortunately, due to certain organizational reasons, this may not be possible (e.g., perhaps you can’t force your merchandising team to use the system for certain products but not others). But what you can do is to make the system available to all the end-users to use if they wanted to i.e. you can allow end-users to self-select which products they want to price with the new system.

This self-selection clearly implies a situation with non-random assignment and you can’t simply compare the average revenue from products priced using the new system vs the average revenue from products priced using your current approach.

So, what’s the best way to analyze the resulting data to evaluate the system’s revenue impact?

Let’s apply the checklist. What are some potential confounders?

  • The category of the product. Different product categories may have vastly different revenues and if the end-users in charge of pricing for those categories are drawn to/repulsed by the new system in certain ways (e.g., those responsible for fashion-forward product categories may be more likely to believe that no optimization system can beat their gut instinct), it will certainly confound the results.
  • Rate-of-sale of the product. Maybe your team views the new system as risky and unproven and therefore will use it only for products that don’t sell much; if this happens, the new system’s performance will look worse than it actually was.
  • … and so on.

Using these sorts of considerations, several potential confounders can be identified. But since there are too many confounders to do stratification, we can use regression instead.

We assemble a dataset like this, one row for each product:

For illustration, I have included columns for Product Category, the Rate-of-Sale of the product, its Price Tier, its Revenue from the same period in the prior year etc. as potential confounders. In practice, you will have to use your domain knowledge and business judgment to come up with this list.

I also have a column (“System Use Indicator”) that is a 0–1 variable indicating if the new price optimization system was used to price that product or not. Finally, I have a column that shows the revenue for each product during the test period.

With this dataset, we can fit a regression model* to this data with “Test-period Revenue” as the dependent variable and all the other columns as independent variables.

Now, finding the causal effect of using the price optimization system vs the current approach is as simple as reading a number off the regression output.

The coefficient of the “System Use Indicator” dummy variable gives you the incremental causal impact of using the optimization system on Test-period Revenue, controlling for all the other variables.

This approach is widely used: when a causal effect is desired, analysts will identify confounders and routinely throw them into a regression along with the treatment and outcome variables, and pluck out the coefficient of the treatment variable as the causal treatment effect. When you read an article in the newspaper that “X is associated with a higher risk of Y, after controlling for age, gender, BMI, blood pressure and level of physical activity”, this is what’s going on.

This approach should definitely be in your data science toolbox but please remember that it depends on some very critical assumptions, including:

  • All confounders have been included in the model i.e., there are no other confounders (this is probably the most important assumption)
  • The effect of the variables on the outcome is linear

(further reading: pros and cons to using stratification vs regression)

(* The way the table is laid out implies a linear additive model. I did so for ease of explanation but a multiplicative model may be better for this problem if you had to do it for real: express Test-period Revenue as the product of all the factors, take logarithms to make it linear in the parameters, then fit the resulting linear regression model)

This discussion barely scratches the surface of a vast literature on how to make good inferences from observational data. There’s plenty of good material online and many books and courses. If you are just starting out on your learning journey, you are in for a treat :-).

I want to conclude by reiterating an important caveat from Part 1.

Causal inference methods applied to observational data aren’t fool-proof. They rest on a number of important assumptions (e.g., no important confounders are missing in the data) and there’s no guarantee that what you have found is a true causal effect; judgement and care is needed to assess what the numbers mean.

Nevertheless, thinking of potential confounders and how you might control for them increases your causal IQ and will keep you from jumping to the wrong conclusion often enough that you should make it a habit.

(If you found this article helpful, you may find these of interest)

--

--

MIT Professor, AI/ML entrepreneur/advisor. Prev: Founder/CEO CQuotient, SVP Data Science Salesforce, Chief Scientist/VP Oracle Retail, McKinsey. MIT PhD.