One day a good friend and coworker of mine came to me and asked a question that would "haunt me" to this day. In fact, this post, written years after that question, is a derivation of that conversation.
He asked:
-Hey man! We are trying to solve an interesting problem, maybe you can help us out! The challenge is, we want to run a marketing campaign on national television and we want a way to analyze the real impact that it had on our sales. How can we do that in a reliable way?
To which I promptly answered: "man…without a control group I have no idea…😕 ".
Well, as we came to learn later on, there’s actually a set of techniques developed to solve this exact type of problem, also known as Causal Impact inference which will be addressed in this post.
With my coworker question in mind, here’s what we’ll be covering in this article:
- Brief discussion about causality.
- Bayesian structural time series (bsts).
- Introduction to tfcausalimpact built on top of Python.
- Examples.
We’ll be able to fully analyze whether a given random variable causes impact on another one (given a degree of confidence) which will allow us to solve a huge amount of recurring problems in the data science field.

Causality Isn’t Simple
The concept of causality is probably one of the most powerful in the field of data science as well as it’s probably one of the most difficult and complex to understand and implement.
Deriving causality is what could bring us new drugs to fight diseases, a deeper understanding of the human brain, better explanations to various observed phenomena and so on.
In fact, it’s by understanding causality that I could have helped my coworker on the television advertisement challenge: once we find the causal relationship between marketing investment on TV networks and sales revenue we can infer how both are expected to behave, even when manipulating one of them (by increasing investments in marketing for instance).
Using a Bayesian perspective, here’s a simple depiction of a causality relationship between two random variables X and Y:

Given by the joint distribution:

Working with this type of data should be straightforward. We know that X has a direct impact on Y so if one increases we already have an idea on how the other will behave. But let’s imagine as an example that we have a model relating operational and marketing investments along with the company sales:

Given by the joint distribution:

Now, what are the causalities between the variables?
Well, Operation does cause some effect on Sales. But, given this Bayesian graph, if we know the value of Marketing then how much was invested in operations isn’t relevant on Sales anymore. The causal relationship of this graph implies that when we observed marketing then we can fully describe sales.
We can actually see that in data; suppose we have this dataset with fictional numbers:

Let’s fit a linear regression on top of this data using Keras from TensorFlow. At first, we regress the response variable Sales with just the independent Operation variable (the bias is removed to simplify the discussion):
Results in:
[array([[1.7188858]], dtype=float32)]
So we see that our linear regression does find a relationship between Operation and Sales as expected. But let’s see other scenarios. Now we regress with [[Operation, Marketing]] using the same code and the result is:
[array([[-0.09608397],
[ 1.4150134 ]], dtype=float32)]
Notice now that Operation is essentially irrelevant to the regression whereas Marketing has a linear weight of 1.41. Let’s see what happens if we remove now the Operation variable from the regression:
[array([[1.3306718]], dtype=float32)]
Results in 1.33. That means that on this graph structure the variable Operations becomes irrelevant if we also have Marketing. In other words, it no longer causes impact on Sales.
In fact, here’s the code used to build this simulated data:
Notice that Sales is a distribution whose parent is Marketing and the relationship between both is a linear 1.4 combination, just as our regression found.
We’ve just seen one example of a graph structure explaining causality. Depending on how variables are connected and which data we observed then conclusions can vary wildly.
We can go further and propose a new model to our sales data. Suppose now that we find that the following model is the one that actually generates our observed data:

This is known as a collider; now the causality relationship between our variables have completely changed and what we discussed before no longer is true. Let’s suppose that we obtained this dataset from our Marketing team:

Once again we run the linear regression with each variable to see what happens. First, we use just Operation to explain Sales:
[array([[2.3248656]], dtype=float32)]
Then we use Marketing:
[array([[2.2983878]], dtype=float32)]
And finally both Operation and Marketing:
[array([[1.2202957],
[1.3859276]], dtype=float32)]
Now this is interesting. When we observed just one variable, they both led to the same results, that is, a linear regression of 2.3 to explain the data. Only when observing both is that we see that Operation has a weight of 1.22 and Marketing has 1.38. This is how this data was built:
As we can see, in fact we have a 1.2 and a 1.4 linear relationship between both variables with Sales. But when we observed just one of them, the regression almost doubled the linear relationship in an attempt to explain the observed data. There’s a causal relationship between variables regardless of which variables are observed or not but the predictive capability of the model is only precise when both variables are observed.
Now we have an interesting problem. Let’s say I told my coworker that we could create a model relating the advertisement investment to the sales of the company and extract some sort of correlation between both, following a structure like this one:

That would make things easy right? We could just run the TV ads and extract some (possibly) linear relationship between costs and sales. Easy!
But what if the real relationship that explains Sales is something like this:

Where "?" is some unknown variable that might be influencing our sales and we are not even aware of. Or it could just as possible be the case that we had a direct chain structure like so:

Again we would find wrong relationships between both variables which could potentially lead to disastrous decisions (like investing on campaigns that are not really profitable). If we mistakenly conclude that the relationship is twice of what it actually is, it might take a huge amount of time and money until the company finally realizes that something is wrong.
Let’s think about what the ideal solution would look like. It would probably involve running an A/B test of some sort and comparing results from control group against the test group. The reason is that a proper A/B test can properly eliminate sources of bias and lead to precise results. Graphically, it’d be something like this:

Where each α represents some variable that might affect the sales of the company, such as investments on operations or marketing for instance.
If we run the A/B test, all those α’s are present on both groups except for the αi which represents the variable we want to investigate. As an example, in our TV advertisement challenge, we could track down sales on a control group where no advertisement is running against the test group where customers are being exposed to the television campaign.
Mathematically, we compute the posterior distribution for the differences between performances on both groups and see how this random variable behaves. P(B- A) having most of its distribution weight above 0 (zero) means that we have an indication B is performing better than A.


In theory, this would be the best approach for this challenge as we can isolate the causal effects of the advertisement variable on Sales. But this is not really feasible. Most of the time we simply can’t separate variables into two distinct groups such as is the case for the TV commercial. There’s no viable way to show the campaign to one group and not the other as both will be watching the same TV offerings.
If we only had, somehow, the control group, things would be easier. Well, that’s actually how we begin to solve the Causal Impact challenge. So here’s what we’re going to do: given that we can’t create a control group, let’s simulate one!
Yeap, that’s right. We’ll simulate what the control group would look like had no advertisement (or more broadly speaking, had no intervention) taken place. For what’s coming next, we’ll need to learn some stuff on modeling time series data.
Bayesian Structural Time Series
Suppose we collected data of the company sales along with the investments made on operations throughout the month for each day and we start our television campaign on the 18th day of the month:

If we somehow could come up with a model that explains how Sales evolves over time, we’d be able to project what the revenue of the company would have been without the television ads running:

Having our forecasts and their respective confidence intervals, we’d be able to compare our expectations to what really happened and compute from the difference what impact the Advertisement variable might have caused on Sales. Our challenge then becomes a task of figuring out how to better model time series data so we can build the forecast points.
There are several methods for building models on time series data, ranging from simpler methods that extracts various statistics such as auto-correlation or moving averages to more sophisticated algorithms that uses state-of-the-art deep learning nets. The one we’ll be using is the Bayesian Structural Time Series, or bsts for short. The motivation is that this algorithm is flexible enough to model a wild range of possibilities while offering a Bayesian framework which allows us to add our own assumptions of the world (known as the Bayesian priors).
In essence, it works by using latent variables that helps to describe the observed data. Such variables evolve through time by following certain rules, all of which tend to be quite simple and have a straightforward implementation. Here is the mathematical formulation: suppose that our time series we want to model is given by yt. It follows from bsts that:

These equations seem scary but they are actually simple and powerful as they allow for the modeling of an extensive range of possibilities including auto-regressive and moving averages models among many others.
First, the "alphas" represent latent structural components. They work as hidden variables whose time dynamics we define in accordance to the observed data being modeled. We begin with the initial state given by equation 3 and use equation 2 to keep adapting the states for each step t. After the adaptation, a new linear transformation represented by Z is applied and a noise is added (the epsilon value) which models our observed time series yt.

While we won’t get into the details of the mathematics behind finding the posterior of the latent variables distribution, this post from Wei Yi does an excellent job at explaining what’s happening behind the scenes on TensorFlow Probability implementation, which is the one we’ll be using soon.
Here’s a list of the available structural models TensorFlow Probability offers us and which we can add or remove as we wish aiming to better explain the data:
- _AutoRegressive_: Each new state space point receives a value that is a linear combination of the previous components.
- _Dynamic Regression_: Adds a component that is a linear combination of covariates that further help explain the observed data. The dynamic part is implemented through the addition of a Gaussian random walk.
- _LocalLevel: Represents essentially a random walk. This component will be used as default by tfcausalimpact_ later on as it works as an exchange between how well our covariates explain the data or how much the random walk dominates the state space which means the observed data behaves like random fluctuation over time.
- _Seasonal_: Models patterns that repeat over time such as weekly or daily seasons.
- _LocalLinearTrend_: Similar to local level but it adds another slope component which models a constantly increasing (or decreasing) pattern from data.
- _SemiLocalLinearTrend_: Similar to local linear trend but the slope model follows an auto regressive component of first order.
- _SmoothSeasonal_: Also models seasonal effects, this time specified in terms of multiples of a base frequency.
Implementing this concept on top of TensorFlow Probability is quite straightforward. Here’s an example that combines a local level with a linear regression to run forecasts on observed simulated data:
Results in the following plot:

The overall idea is to keep building all structures that we want to represent our model and then combine them through the tfp.sts.Sum operation by sending the components inside a list container.
After that, we just fit the model and use the Markov samples to run forecasts and extract confidence intervals.
It’ll be our job to model the observed data with the best set structural components possible; the better the model is the more precise we’ll be when using a representation for a control group that describes what data would look like had our variable under investigation not assumed its current value.
All ideas necessary for running this simulated A/B test have been implemented on tfcausalimpact so let’s see now how to use this library.
tfcausalimpact
At this point, the necessary knowledge for helping my coworker to solve the TV advertisement challenge have already been introduced. What we need now is a framework to combine all those ideas and compute for us various statistical analyzes to help us in our decision process, such as estimating the impact of the ads and how confident we are in those estimations.
Google implemented on top of R language a powerful library for running Causal Inference. For those also working with Python (main focus of this article), we now have tfcausalimpact which also fits a Bayesian structural model on past observed data and compares forecasts against the real response. Let’s see some examples on how to use it.
Simple Example
Let’s run a simple simulated dataset to see what happens. We’ll be using the simulated dataset _arma_data.csv_ available in tests/fixtures folder:

Running causal impact on this dataset uses just a few lines of code:
Notice that on point 70 up to 77 we add a decreasing summation from 7 (seven) down to 0 (zero). This is to simulate what running a marketing campaign on TV probably would look like: at first, there’s a pick of traffic on sales and it fades away with time (most interventions follow this pattern, that’s why the default model is the local level).
The package gives us a statistical summary with the results, just run:
print(ci.summary())
which gives us:
Posterior Inference {Causal Impact}
Average Cumulative
Actual 121.16 3634.86
Prediction (s.d.) 120.32 (0.35) 3609.45 (10.41)
95% CI [119.65, 121.01] [3589.46, 3630.28]
Absolute effect (s.d.) 0.85 (0.35) 25.41 (10.41)
95% CI [0.15, 1.51] [4.59, 45.4]
Relative effect (s.d.) 0.7% (0.29%) 0.7% (0.29%)
95% CI [0.13%, 1.26%] [0.13%, 1.26%]
Posterior tail-area probability p: 0.01
Posterior prob. of a causal effect: 98.8%
For more details run the command: print(impact.summary('report'))
Let’s focus on the bold numbers. First, we have the absolute effect which is an estimation of, on average, how much the impact added to y in comparison to the counter-factual series. In this case, it predicts that the difference between what really happened and the counter-factual is close to 1. In relative terms (normalized by counter-factual mean) it’s an estimated increase of 0.7%. Finally, the statistical hypothesis test for whether the results are valid or not (usually we use the 95% threshold value correspondent to 5% of significance) is 98% which means results are reliable given a statistical point of view.
Here is the plot that helps us visualize it:
ci.plot()

It’s important to keep in mind: the covariates X cannot be affected by the intervention variable. That is, if in our previous example the response variable is Sales of the company and X is investments on the operations area then the intervention variable cannot affect the operations one; if it does, then the linear regression would also be affected and the counter-factual prediction wouldn’t reproduce what would have been observed had no intervention taken place.
Without being careful, one can conclude to have observed signal from the impact inference when in reality it’s only noise being evaluated. Here’s an interesting example that better illustrates this. The following code is used to generate random walk datasets:
It’s possible to sample data from this distribution and eventually find these results:


These images reinforces why we need to be careful when running causal impact analyzes (or any other statistical study for that matter), using wrong models and assumptions can lead to quite costly conclusions.
The default structural model that tfcausalimpact uses is a combination of a local level with a linear regression, or mathematically:

This model is appropriate for impacts that are expected to last a few data points into the future and then fades away (that’s why the definition "local" level). There’s a default Bayesian prior for the standard deviation σ of μ whose value is 0.01 which means that "a priori" we have an expectation that the observed time series under investigation is well-behaved and can be well explained by the covariates being used. This is not always the case and for data that is less noisy (more random fluctuations) one can use a prior of 0.1 like so:
ci = CausalImpact(data, pre_period, post_period,
model_args={'prior_level_sd': 0.1})
One can also choose other structural models instead; we’ll see more on that soon. Let’s now use some real data to run causal impact inference and see how it unfolds.
Bitcoin: Money of the Future?
For those into technology and software development probably find Bitcoin quite an exciting framework. Not only does its implementation seem quite sophisticated and complex but also the concept has faced every sort of criticism one can imagine and still prevailed so far.
Just recently the Bitcoin community received the news that PayPal would be accepting the cryptocurrency on its platform. So let’s use this information and data to evaluate what expected impact this news had on Bitcoin prices. First, let’s query some data:
! pip install pandas-datareader
Preparing our dataset (data cleaning, removing duplicates, filling empty values, fixing indexes and so on):

If we run Causal Impact with this input, here’s what we get:
pre_period=['2018–01–03', '2020–10–14']
post_period=['2020–10–21', '2020–11–25']
ci = CausalImpact(data, pre_period, post_period)
Summary:
Posterior Inference {Causal Impact}
Average Cumulative
Actual 15434.73 92608.35
Prediction (s.d.) 10791.02 (936.1) 64746.11 (5616.59)
95% CI [8975.75, 12645.19][53854.51, 75871.15]
Absolute effect (s.d.) 4643.71 (936.1) 27862.24 (5616.59)
95% CI [2789.53, 6458.97] [16737.2, 38753.84]
Relative effect (s.d.) 43.03% (8.67%) 43.03% (8.67%)
95% CI [25.85%, 59.86%] [25.85%, 59.86%]
Posterior tail-area probability p: 0.0
Posterior prob. of a causal effect: 100.0%
For more details run the command: print(impact.summary('report'))
Plots:

We clearly see that the news did cause a positive (and considerable) impact on Bitcoin prices with the hypothesis testing being quite confident on results. Still, notice that the 95% interval is somewhat large (goes from 25% up to 60%. One way of improving this is finding more covariates that helps predicts Bitcoin prices).
One interesting technique one may use to confirm everything is working is running some back-testing analyzes on data. For instance, let’s select a random _preperiod and _postperiod values where we know there was no known source of impact to see what happens:
pre_period=['2018–01–03', '2018–12–05']
post_period=['2018–12–12', '2019–02–06']
ci = CausalImpact(data, pre_period, post_period)
Summary:
Posterior Inference {Causal Impact}
Average Cumulative
Actual 3689.89 33208.98
Prediction (s.d.) 4033.26 (600.35) 36299.36 (5403.19)
95% CI [2882.4, 5235.74] [25941.58, 47121.68]
Absolute effect (s.d.) -343.38 (600.35) -3090.38 (5403.19)
95% CI [-1545.85, 807.49] [-13912.69, 7267.41]
Relative effect (s.d.) -8.51% (14.89%) -8.51% (14.89%)
95% CI [-38.33%, 20.02%] [-38.33%, 20.02%]
Posterior tail-area probability p: 0.27
Posterior prob. of a causal effect: 72.63%
For more details run the command: print(impact.summary('report'))
Notice that the p-value is 27% which indicates that there was no observable impact on those periods, as expected. This is one indication that the covariates are doing a good job on explaining the response. Still, a good practice is to run several back-tests on several pieces of the data (if there’s enough points for doing so) to see the proportion on which it concludes there’s no impact (if it’s lower than 5% then it’s working as expected).
For plotting the decomposition of the contributions for each state component used in the structural model, this helper function does the trick:
For our PayPal example, here’s the result:

It’s interesting to observe that after March 2020 the linear predictors becomes less and less certain on the forecasts, as expected given the effects of the pandemic.
Let’s see now what happens if we try to improve upon the default model. Before doing so, we can use statsmodels decomposition to extract some ideas from data, like so:
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(btc_data.loc[:'2020–10–14'].iloc[:, 0])
result.plot();
Gives us the following plots:

The data doesn’t seem to indicate a clear linear trend component (it seems to be better modeled through the random walk of the local level model) and the seasonal component also is not big enough to better explain the data.
Still, let’s play around and use a customized model to test some suppositions. While this won’t lead to much useful results, it still works to help us to understand about how to build your own structural model. Here’s one way of doing so:
Notice that the structural components (such as linear level and seasons) keeps being added and sent as input to tfp.sts.Sum(). The linear regression must contain the union of pre and post data as required by TensorFlow Probability.
Building customized models is certainly a more advanced feature and will only be required when the default model doesn’t do a good job at modeling the observed data. Here are the results for the above customized model:
Posterior Inference {Causal Impact}
Average Cumulative
Actual 15434.72 92608.34
Prediction (s.d.) 11714.66 (1886.01) 70287.98 (11316.03)
95% CI [8216.98, 15609.99][49301.89, 93659.93]
Absolute effect (s.d.) 3720.06 (1886.01) 22320.37 (11316.03)
95% CI [-175.26, 7217.74] [-1051.58, 43306.45]
Relative effect (s.d.) 31.76% (16.1%) 31.76% (16.1%)
95% CI [-1.5%, 61.61%] [-1.5%, 61.61%]
Posterior tail-area probability p: 0.03
Posterior prob. of a causal effect: 97.0%
For more details run the command: print(impact.summary('report'))
And plot:

Thanks to the seasonal effects the uncertainty on predictions increases considerably which leads to those plots with wide uncertainty ranges at the beginning. Notice from the summary that this new model is less certain about the real impact; it still sees that something happened but its confidence intervals are wider as consequence of the wrong assumptions used when building the customized model.
Another important approach when trying to improve results is changing the back-end algorithm for finding the posterior of each component. The default algorithm in tfcausalimpact uses Variational Inference methods which are fast but less precise. Another possibility is to use Hamiltonian Monte Carlo algorithm which is considered the State of the Art when it comes to Bayesian adaptation. But the price we pay is that it’s really slow and demands a lot of hardware resources.
An interesting approach is to use the default variational inference algorithm for the most part but when precision is highly required then after all tests have been performed then a final run is performed on top of the Hamiltonian technique. Still, keep in mind that running on a GPU the method can take in the range of hours to converge so it’s that heavy. For our Bitcoin example, here’s how to select the algorithm:
from causalimpact import CausalImpact
pre_period=['2018–01–03', '2020–10–14']
post_period=['2020–10–21', '2020–11–25']
ci = CausalImpact(data, pre_period, post_period, model_args={'fit_method': 'hmc'})
Here’s the new summary:
Posterior Inference {Causal Impact}
Average Cumulative
Actual 15434.72 92608.34
Prediction (s.d.) 10464.95 (854.01) 62789.68 (5124.05)
95% CI [8765.46, 12113.11][52592.77, 72678.68]
Absolute effect (s.d.) 4969.78 (854.01) 29818.66 (5124.05)
95% CI [3321.61, 6669.26] [19929.66, 40015.58]
Relative effect (s.d.) 47.49% (8.16%) 47.49% (8.16%)
95% CI [31.74%, 63.73%] [31.74%, 63.73%]
Posterior tail-area probability p: 0.0
Posterior prob. of a causal effect: 100.0%
For more details run the command: print(impact.summary('report'))
As we can see, it’s not a huge difference but the convergence is a bit more precise and certain. Running this on google colab with GPU activated took 20 minutes.
As a final note, all results from the Bayesian processing happening on the back-end is available for later inspection in the ci object, including the samples for each component of the model. For printing the averages of the posterior for each component, just run:
for name, values in ci.model_samples.items():
print(f'{name}: {values.numpy().mean(axis=0)}')
The default model returns:
observation_noise_scale: 0.38779693841934204
LocalLevel/_level_scale: 0.13807716965675354
LinearRegression/_weights: [-0.06091186 -0.1838356 0.21132585 0.7798438 -0.54009306 0.39839688
0.27327323]
_observation_noisescale is the ε discussed on bsts. The local level is the standard deviation that sets how the random walk fluctuates over time. Despite the prior level being 0.01 it still converged to 0.13. Finally, the linear regression gives us an idea of the linear weights the optimization obtained between the tech companies (as well as gold) to Bitcoin prices.
Conclusion
Working with TensorFlow Probability and building this framework for causal inference was quite fun and challenging. Feels like new Machine Learning powers have been acquired along the way.
Analyzes that involves finding causal relationship between variables and their effects on each other can potentially be solved by using this simulated A/B test framework, such as the TV commercial as proposed by my colleague.
In fact, in some cases, it may be quite expensive or even prohibitive to run an A/B test. As an example I remember we once had a challenge to optimize price values for each product on the catalog and setting an A/B test showed to be quite risk (mistakes on setting prices can have disastrous consequences) and so expensive that the only solution for assessing whether the new algorithms were working or not was to evaluate the causal impact that the new pricing strategy caused over time.
Hopefully the new library will be helpful for those working in Python as it is for the R community. If you have any issues while using it please let us know by opening an issue on its official repository.
Well, that’s it for now. Time for some resting and preparing for a new adventure 😅 !
And, as always, see you next mission!
References
https://github.com/google/CausalImpact
https://www.jmlr.org/papers/volume19/18-009/18-009.pdf
https://www.tensorflow.org/probability
https://towardsdatascience.com/demystifying-tensorflow-time-series-local-linear-trend-9bec0802b24a
http://www.unofficialgoogledatascience.com/2017/07/fitting-bayesian-structural-time-series.html
https://blog.tensorflow.org/2019/03/structural-time-series-modeling-in.html
http://krasserm.github.io/2019/03/14/bayesian-neural-networks/
https://towardsdatascience.com/synthetic-instrumental-variables-968b12f68772
https://towardsdatascience.com/youre-plotting-the-wrong-things-3914402a3653