
In recent times AB Testing has become the gold standard for product development teams. Running experiments on the product helps teams understand the incremental effect each change has on their key metrics and gradually improve their products. However, setting up the infrastructure to be able to run experiments efficiently is very costly. So oftentimes, as a Product Analyst, you’ll need to find other ways to quantify the impact that a feature change has on a key metric. In this series of posts we’ll be exploring why experimentation is the best way of understanding causality, and other Causal Inference techniques which can be used in its absence.
- Understanding the importance of experimentation
- Statistical methods for Causal Inference
- Machine Learning for Causal Inference
In order to understand the importance of experimentation, let’s quickly cover some key terms and their definitions. In particular we need to understand the concept of confounders.
Confounders
A Confounder is a factor which is independent to the change we’re testing but may have an impact on our key test metrics. These could be external factors, like time and geography, or internal ones like a different product change having an impact on your experiment.
For example, if your product is based around a user’s hobby, like gaming or exercise, then it’s quite likely that Day of Week will have a confounding effect on your product. It’s likely that they will engage with your product more on weekends than weekdays since it’s likely they’ll have more free time on weekends to pursue said hobbies.
Similarly, the country or geography of the user could also have a big effect on how they engage with your product. If your product is focused on gambling, for example, the different attitudes to gambling in different countries around the world would have a huge impact on how users engage with your product.
Now that we have a better idea of what confounders are, let’s look at how these are handled with experimental data vs observational data.
Experimental Data
Experimental data allows us to control for these confounders and in turn allows us to make strong statements about whether the change has caused an increase or decrease in our key test metrics. In product experiments users are randomly assigned to different groups. Each group is then shown different versions of the product and experimental data is a summary of the way users have engaged with each version. Due to this random assignment, as the sample size of users grows, the distribution of each group under each confounder becomes similar to the distribution of the entire user base under the same confounder. As a result each group is a good representation of the entire user base and is almost identical to the other groups. The test metrics can then be used to directly compare between the different versions to understand which one is the most engaging.
Let’s tackle a real world example to help understand this idea. Let us assume that the channel through which a user is acquired is a confounding factor for an experiment. This could be due the messaging being different across different channels so users acquired through some channels expect something different from your product than users acquired through other channels. Similarly it could be due to the different levels of intent being shown by users from different channels. An organic user who has actively sought out your product is more likely to be engaged with it than a user who has clicked on an ad on Facebook.
The graphs below show the channel distribution of users acquired for a hypothetical mobile app. As we can see, the spread of the channels in the overall user base is reflected in each of the Control and Treatment user groups. This is due to the random assignment mentioned above. The proportion of users who are more engaged or less engaged with the product due to channel is now equal across the Control and Treatment groups. So we can directly compare the test metrics across the two groups to understand which version is better independent of the confounding factor.

It’s worth noting at this point that we can quite easily measure how many users were acquired through each channel so we can quite easily see this effect in place if channel was a confounding factor. What if we couldn’t observe our confounding factor? The beauty of the random assignment is that it extends this concept to confounding factors that can’t be observed.
Let’s assume that the hair colour of the user is a confounding factor of our experiment. Our hypothetical app allows users to apply filters with their front camera open. Let us assume that, unknown to us, the facial recognition algorithms used for these filters work better with hair of some colours than others. This would mean some users would have a better experience than others and hair colour would be a confounding factor which we can’t observe. Even though we can’t measure this confounder, random assignment over a large enough sample size would ensure that both user groups have the same spread of user hair colours, as demonstrated in the graphs below.

Observational Data
Observational data is data that has been gathered without the pretence of random assignment to each version. In this case users have been assigned to each version based on time, geography or some other confounder. For example, we could be segmenting the users by cohorts before and after the release of a new feature, or by only releasing the feature in some geographies.
In this case, due to the lack of random assignment, it’s likely that the distribution of each group under a confounder will be different to the distribution of other groups under the same confounder. As such we can’t directly compare the test metrics of the different versions based purely on how the respective groups have engaged with them.
Let’s take a closer look at this using our example from earlier. Let’s assume that we’ve released our feature on a given date and want to segment our Control and Treatment groups based on cohort date. All the new users that installed our app during the 2 weeks before the feature release are part of the Control group and all the new users that installed our app during the 2 weeks after the release are part of the Treatment group.
The graphs below represent what the Channel distributions of our groups may look like. The Control group has more users that were acquired via Facebook Ads whereas the Treatment group has more users that were acquired via Google Ads. Let’s assume that users acquired via Facebook are more engaged due to the messaging being consistent with the product. So users acquired via Facebook would have higher engagement metrics than users acquired via Google. Due to the Channel imbalance between the two groups, this means that the Control group would have higher engagement metrics than if the two groups were balanced, and similarly the Treatment group would have lower engagement metrics than if the two groups were balanced. If we were to directly compare the test metrics of the two groups then Channel will be acting as a confounder and skewing the results of our experiment.

How do we control for confounders?
The first step in being able to control for confounders is being able to measure them. Confounders such as time, geography and channel are easy to measure and as such we can control for them. However confounders that are harder to measure, like the user’s hair colour from an above example, are harder to control for without having an experimental set up.
The analysis we could perform on experimental data is a powerful tool in understanding the causal effects your product changes have had on key metrics. But what do we do if it’s not possible to obtain experimental data? There are a few different techniques which we could use on observational data. We could statistical methods, like different types of weighting, or machine learning methods.
Statistical methods for causal inference
Statistical methods work by weighting engagement metrics to account for the imbalance in the distribution under the confounder between the groups. Essentially we’d use them to work out what the metrics would be if both groups had the same distribution under the confounders as the underlying user base. Once we do this, we can then directly compare these calculations between the different groups to declare a winner.
Let’s revisit the channel example to see this in action. Below, on the left, we see the distribution of each group and the user base under the channel through which the user was acquired. The right hand side of the image shows the groups after the statistical methods are applied.

Once again it’s worth noting that these statistical methods can only be applied to confounders which are observable (i.e. measurable).
I will be exploring the different statistical methods which can be used for this exercise in the next part of this series. The methods I will be exploring include:
- Stratification
- Inverse Probability Weighting
- Propensity Matching
Machine learning methods for causal inference
Some methods use machine learning to work out what the effect your change has had on your key metrics while controlling for confounders.
As before it’s worth noting that these ML methods can only be applied to confounders which are observable (i.e. measurable).
I will be exploring the different ML methods which can be used for this exercise in a later part of this series. The methods I will be exploring include:
- Controlled Regression
- Uplift Modelling
- Meta Learners
I hope this blog post helped explain why AB testing allows us to state the causal impact of each change in a stronger way than other methods. I also hope it introduced the use of causal inference techniques in product analysis. Watch this space for the next part of the series!