Inferring causality in time series data

A concise review of the major approaches.

Shay Palachy Affek
Towards Data Science

--

The question of what event caused another, or what brought about a certain change in a phenomenon, is a common one. Examples include whether a drug caused an improvement in some medical condition (versus the placebo effect, additional hospital visits, etc.), tracking down the cause for a malfunction in an assembly line or determining what caused an upsurge in a website’s traffic.

While a naive interpretation of the problem may suggest simple approaches like equating causality with high correlation, or to infer the degree to which x causes y from the degree of x’s goodness as a predicator of y, the problem turns out to be much more complex. As a result, rigorous ways to approach this question were developed in several fields of science.

The task of causal inference divides into two major classes:

  1. Causal inference over random variables, representing different events. The most common example are two variables, each representing one alternative of an A/B test, and each with a set of samples/observations associated with it.
  2. Causal inference over time series data (and thus over stochastic processes). Examples include determining whether (and to what degree) aggregate daily stock prices drive (and are driven by) daily trading volume, or causal relations between volumes of Pacific sardine catches, northern anchovy catches, and sea surface temperature.

This post deals only with the second class of problems.

Scope

This post is meant to provide a concise technical review of the major approaches found in academic literature and online resources for the purpose of inferring causality in time series data, the methods derived from them and their implementation in code form.

It aims to touch upon both (1) classical statistical approaches, created mainly in the econometrics field of research, including modern developments (2) and adaptions and original approaches coming from various other research communities, such as those dealing with dynamic systems or information theory.

The general subject of causal inference is both too large and not directly applicable enough to cover in this post. The same is true for the intersection between causal inference in general (which in many cases is done on general probability distributions, or their samples, and not on time series data) and machine learning. Nevertheless, I have included a few notable resources I encountered on these topics in the Other Notable Literature section.

Organization & Notation

General remarks are sometimes given in quote format in-body, while optional notes are given in clickable footnotes (each with a link to send you back to its origin). A clickable table-of-contents is also provided to assist in navigation. I’ve added the 📖 link to the header of each section; click on it to quickly return to the table of contents. Finally, complete references for the literature used accompanies the post.

Required background

The post is written in technical language, and while I obviously cannot go as deep as the academic papers referenced throughout this post, I will not shy away from including equations, notations and text requiring technical background for their understanding.

As such, a background of — or equivalent to — at least a thorough undergraduate theoretical course in probability and statistics, including the required mathematical background is assumed. If you are not familiar with stochastic processes theory, you can find a concise review of it in my post on stationarity in time series data. Familiarity with this theory is required for further reading, as it is the framework upon which statistical notions of causality in time series are built.

Table of Contents

  1. Background: Notions of causality in time series data
    - Granger causality
    - Sims causality
    - Structural causality
    - Intervention causality
  2. Classical methods for causality inference in time series data
    - Non-directional lagged interactions
    - Parametric VAR-based tests for Granger causality
  3. Alternative parametric Granger causality measures for time series data
    - Conditional Granger Causality Index (CGCI)
    - MLP-based F-test for Granger causality
    - RBF models for Granger causality
    - Partial Granger Causality Index (PGCI)
    - Directed coherence measures
  4. Alternative non-parametric causality measures for time series data
    - The Bouezmarni-Taamouti test
  5. Chaos and dynamic system theory approaches for causality inference in time series data
    - The Hiemstra-Jones test
    - The Diks-Panchenko test
    - Extended Granger Causality Index (EGCI)
    - Convergent cross mapping (CCM)
  6. Information theoretic approaches to causality inference in time series data
    - Coarse-grained trans-information rate (CTIR)
    - Transfer entropy based measures
    - Mutual Information from Mixed Embedding (MIME)
  7. Graphical approaches for causality inference in time series data
    - Causal graph search algorithms (SGS, PC and FCI)
    - PCMCI
    - Lasso-Granger
    - Copula-Granger
    - Forward-Backward Lasso Granger (FBLG)
  8. Choosing which approach to use
  9. Researchers to follow
  10. Other notable literature
  11. References
    - Academic Literature: Causality and causality inference
    - Academic Literature: Causality inference in time series data
    - Academic Literature: Other
    - Other Online Sources
  12. Footnotes

Background: Notions of causality in time series data

Throughout the years a number of different notions of causality were suggested by scholars of statistics and economics. I give here an overview of the major ones. This part is based, in part, on [Eichler, 2011] and [Runge, 2014]. 📖

Granger causality

The earliest concept of causality for time series data was suggested by Granger [1969, 1980], building on a notion from [Wiener, 1956]. It is based on contrasting the ability to predict a stochastic process Y using all the information in the universe, denoted with U, with doing the same using all information in U except for some stochastic process X; this is denoted with U\X. The core idea is that if discarding X reduces the predictive power regarding Y, then X contains some unique information regarding Y, and we thus say that X Granger-causes Y.

More formally:

  • Let X and Y be stationary stochastic processes.
  • Denote with 𝒰ᵢ=(Uᵢ₋₁,…,Uᵢ₋∞) all the information in the universe until time i, and with 𝒳ᵢ=(Xᵢ₋₁,…,Xᵢ₋∞) all information in X until time i.
  • Denote with σ²(Y|𝒰ᵢ) the variance of the residual of predicting Yᵢ using 𝒰ᵢ at time i.
  • Denote with σ²(Y|𝒰ᵢ\𝒳ᵢ) the variance of the residual of predicting Yᵢ using all information in 𝒰ᵢ at time i except 𝒳ᵢ.

Definition 1: If σ²(Y|𝒰ᵢ) < σ²(Y|𝒰ᵢ\𝒳ᵢ) then we say that X Granger-causes Y, and write X⇒Y.

Definition 2: If X⇒Y and Y⇒X we say that feedback is occurring, and write XY.

As noted by Granger himself, the requirement of having access to all the information in the universe is extremely unrealistic. In practice U is replaced by a limited set of observed time series X, with X∈X, and the above definition reads X Granger-causes Y with respect to X.

Finally, this definition does not specify the prediction method used for σ², and thus allows for both linear and non-linear models, but the use of the variance to quantify the closeness of prediction restricts this notion of causality to causality in mean.

This notion is usually referred to as strong Granger causality; other related notions of causality are Granger causality in mean [Granger 1980, 1988] and linear Granger causality [Hosoya 1977, Florens and Mouchart 1985].

Instantaneous causality: A related kind of causality, modifying Granger causality slightly, is instantaneous causality [Price, 1979]. We say that X and Y has instantaneous causality between them if, at time i, adding Xᵢ to the information set helps to improve the predicted value of Yᵢ.

More formally:

  • Let X and Y be stationary stochastic processes.
  • Denote with 𝒰ᵢ=(Uᵢ₋₁,…,Uᵢ₋∞) all the information in the universe until time i, and with 𝒳ᵢ=(Xᵢ₋₁,…,Xᵢ₋∞) all information in X until time i (in both cases, not including information from time i itself).

Definition 3: If σ²(Y|𝒰ᵢ∪{Xᵢ}) < σ²(Y|𝒰ᵢ) then we say that there is instantaneous causality between X and Y.

Note that this type of causality is not directional, but rather symmetric, and it can be shown that if the above definition holds then the symmetric statement — that σ²(Xᵢ|𝒰ᵢ∪{Yᵢ}) < σ²(Xᵢ|𝒰ᵢ) — also holds(see [Lütkepohl, 2007] for proof). Thus, we do not say that X instantaneously causes Y, but rather that there is instantaneous causality between X and Y.

Multi-step causality: In a bivariate system, if the 1-step ahead forecasts of one variable cannot be improved by using the information in the other variable, the same holds for all h-step forecasts for any h=1,2,…, and so the 1-step ahead criteria is sufficient to define Granger causality. This result does not hold anymore if the information set contains additional variables. [Lütkepohl and Müller, 1994]

Thus, in a multivariate system, we say that variable Xᵢ is h-step causal for another variable Yᵢ if the information in Xᵢ helps improve the j-step forecasts of Yᵢ for some j=1, 2, …, h.

Sims causality

In an influential paper, [Sims, 1972] showed — in the context of covariance stationary processes, and restricted to linear predictors — that in the bivariate case the definition of Granger causality is equivalent to parameter restrictions of the moving average or distributed lag representations of the processes x[t], y[t]. When the system is covariance stationary it can be represented as:

Equation 8: The Sims representation for covariant stationary processes

where aⱼ, bⱼ, cⱼ and dⱼ are constants and u[t] and v[t] are mutually uncorrelated white noise processes. Sims shows that the condition x[t] does not Granger cause y[t+1] is equivalent to cⱼ or ⱼ being chosen identically zero for all j.

In contrast to Granger’s definition, which considers temporal precedence in the form of a link from the past to the present, Sims’ notion considers temporal precedence in the form of a link from the present to the future. As a consequence, the potential causal relationship considered runs from the dependent variable to future values, or ‘leads’, of the regressor.

While at the time of its introduction in [Sims, 1972] it was presented as an equivalent definition to Granger’s, it was since contrasted with it and was shown to be inequivalent when the measure of uncorrelatedness of time series used is independence [Florens and Mouchart, 1982]; rather, it is shown that Granger causality is a stronger condition, and that while Granger causality implies Sims causality, the inverse is not true.

In spite of this inequivalency, most statistical tests for causal inference in time series data focus on Granger’s definition. However, at least in the case where a vector autoregressive (VAR) model is used, these tests can be modified to test for Sims’ causality (see here for an example highlighting the difference between the tests for the linear case).

Structural Causality

Introduced by White and Lu (2010), structural causality assumes that the data-generating process (DGP) has a recursive dynamic structure in which predecessors structurally determine successors. Specifically, for two processes X — the potential cause — and Y — the response, we assume they are generated by

Equation 9: The structural causality DGP

for all t∈Z. Here, the process Z includes all relevant observed variables while the realizations of U=(Uₓ, Uᵧ) are assumed to be unobserved, and the functions q[x,t] and q[y,t] are assumed to be unknown.

Observe that this dynamic structure is general, in that the structural relations may be nonlinear and non-monotonic in their arguments and non-separable between observables and unobservables. The unobservables may be countably infinite in number. Finally, this system may generate stationary processes, non-stationary processes, or both.

A structural notion of causality can then be defined:

The authors then go on to analyze the relations between Granger causality and this notion of structural causality. Additionally, building on classical notions of Granger causality, they introduce two extensions of it: weak Granger causality and retrospective weak Granger causality.

Finally, the authors construct practical tests for their two notions of Granger causality. Specifically, weak Granger causality is shown to be detectable by testing for the conditional independence of the response Y and the potential cause X given the history of the response and the near-histories of the observable covariates of the processes.

Intervention causality

The use of intervention as the basis for a statistical theory of causality inference, as championed by Judea Pearl, can be traced back at least to the early 90’s [Pearl and Verma, 1991] [Pearl, 1993]. It’s application to time series data, however, has begun to receive rigorous treatment only recently [White, 2006] [Eichler and Didelez, 2007]. This approach to causality is closely related to the use of impulse response analysis in economics.

Eichler and Didelez define a set of possible intervention regimes corresponding to different possible types of interventions in a multivariate stationary time series X with d components. Interventions are denoted by the intervention indicator σ which takes values in {∅, s∈ 𝓢}; I use Xₐ to denote a component in X, and X to denote a subset of components in X, where a∈V and U⊂V for V={1,…d}. I also use σₐ to denote an intervention in component Xₐ.

Intervention types are:

  1. Idle regime: When σ(t) = ∅, X(t) arises naturally without intervention. A.k.a. the observational regime.
  2. Atomic interventions: Here 𝓢=X, the domain of xₐ, and σₐ(t) = x⦁ denotes an intervention forcing Xₐ(t) to assume the value x⦁.
  3. Conditional intervention: Here 𝓢 consists of functions
    g(xᶸ(t-1))∈X , U⊂V , such that σₐ(t)=g means Xₐ(t) is forced to take on a value that depends on past observations of Xᶸ(t=1).
  4. Random intervention: Here 𝓢 consists of distributions
    meaning that Xₐ(t) is forced to arise from such a distribution.

Then, under some assumptions ensuring an intervention is an isolated exogenous change of the system, the average causal effect (ACE) of interventions in X according to strategy s on the response variable Y[t`] is defined to be (assuming w.l.o.g. that 𝔼[Y[t`]]=0):

Equation 10: The average causal effect (ACE) of interventions according to strategy s

Thus, the ACE[s] can be regarded as the average difference between no intervention and intervention strategy s. Additionally, different strategies can be compared by considering ACE[s]−ACE[s], or other functionals of the post-intervention distribution ℙ[s](Y[t`]).

Now, a priori there is no reason why data that is not collected under the intervention regime of interest should allow estimation of the ACE. However, the authors then go on to show the possibility of expressing the ACE in terms of quantities that are known or estimable under the observational regime, using what they call the back-door criterion.

I find this a very elegant reconciliation of causality in time series data with the the highly influential concept of intervention-based causality in general.

Additionally, a recent paper by Samartsidis et al. provides a thorough review of other methods for assessing the causal effect of interventions from aggregate time-series observational data for the specific case of binary interventions [Samartsidis et al. 2018].

Note: The notion of intervention causality is fundamentally different from the other three notions presented here; while Granger causality, Sims causality and structural causality all assume an observational framework, intervention causality makes the much stronger assumption that intervention can be performed in the studies processes. As such, it is significantly less applicable in many real-life scenarios.

Classical methods for causality inference in time series data 📖

This section covers the two most basic approaches to causality inference, based on classical statistical approaches.

Non-directional lagged interactions

The perhaps most basic approach to inferring causal relationships between two time series X and Y is to use non-directional measures of correspondence between a lagged (back-shifted) version of the potentially-causing time series X to the (non-lagged) potentially-caused time series Y.

If a high degree of correspondence is found between a k-lag of X and (non-lagged) Y, then a very weak notion of X-causing-Y can be inferred; the direction is thus inferred from the fact that a lag of X has high correspondence with Y. Various measures of correspondence can be used; among them are Pearson correlation (e.g. [Tsonis and Roebber, 2004]), mutual information (e.g. [Donges at al. 2009]) and phase synchronization (e.g. [Pikovsky et al. 2003]).

When the chosen correspondence measure is the Pearson correlation, this is equivalent to looking at the cross-correlation function of the two time series for different positive and negative lags, and taking the maximum value it attains over the chosen range as the strength of the causal link, with the sign of the lag indicating the causal direction. Naively, if the function achieves positive values over both positive and negative lags then bi-directional causality can be inferred. In any case, the autocorrelation of both series must be taken into account in order to arrive at a valid interpretation.

This approach is employed mainly in climate research [Yamasaki et al. 2008] [Malik et al. 2012] [Radebach et al. 2013]. It was shown to have significant problems that might produce misleading conclusions, as discussed in chapter 4.5 of SIFT’s online manual and demonstrated in section 5.2.1 of [Runge, 2014].

Parametric VAR-based tests for Granger causality

A concise breakdown of the classical parametric tests for Granger causality is given in [Greene, 2002]. A substantial number of these tests were constructed over the years to test for Granger causality. I thus give a brief overview of the ones I have encountered, focusing on tests for which I could find an implementation in common data-processing programming languages (i.e. Python and R).

In general, the first phase in these tests is to make sure that all examined series are stationary — stationarizing them, usually through trend removal and/or differencing, if they are not.

Then, in pair-wise tests, for each pair of time series and for each specific direction of causality X⇒Y, a (usually manually) number of negative (past) lags of the potentially-causing series X are generated (including the zero-lag, which is X itself). The maximal lag length to take is a model selection consideration, and thus should be chosen based on some information criteria (e.g. Akaike information criterion, Bayesian information criterion, etc.).

Note: If checking a large number of pairs, you will also have to consider how to deal with problems arising from multiple hypothesis testing.

The model used in all below cases is a vector autoregressive model of the endogenous (potentially caused) time series Y as a stochastic process; two such models are stated.

The first model — called the restricted model — assumes that Y linearly depends only on past values of itself with linear coefficients γ and a time-dependent noise term e[t]:

Equation 11: The restricted model in a VAR-based Granger causality test

Conversely, the second — called the unrestricted model — assumes that Y linearly depends past values of both X and Y, determined by coefficients αᵢ, β and a time-dependent noise term u[t]:

Equation 12: The unrestricted model in a VAR-based Granger causality test

The unformalized null hypothesis is that the second model does not add information, or provides a better model of Y, when comparing it to the first model. This need to be formalized into a testable null hypothesis; a common approach is to to state that the null hypothesis H₀ is that ∀i, βᵢ=0.

Finally, one of the below test procedures is applied to all such pairs of a lag of X and the unlagged Y. To check for causality in both directions, lags of Y are added to the set of examined series.

Note: Granger-causality tests are very sensitive to the choice of lag length and to the methods employed in dealing with any non-stationarity of the time series.

SSR-based F-test for Granger causality: Parameters are estimated for both the restricted and the unrestricted model (usually using ordinary least squares). An F-statistic¹ is then computed using the RSS of the two series, which is given by:

Equation 13: RSS-based F statistic for Granger causality

where T is time series length and p is the number of lags.

A nice overview of the bivariate case of this test is given here and here. A bivariate version is implemented in the statsmodels Python package [Python], in the MSBVAR package [R], the lmtest package [R]³, the NlinTS package [R] and the vars package [R].

Peasron’s Chi-Square test for Granger causality: First, model parameters are estimated using OLS. A Chi-square-statistic² is computed using the SSR of the two series and the Peasron’s Chi-Square test procedure is performed. A bivariate version is implemented in the statsmodels package [Python].

The likelihood ratio Chi-Square test (aka G-test) for Granger causality: A Chi-square-statistic² is computed using the likelihood ratio of the two series and the standard test procedure is followed. A bivariate version is implemented in the statsmodels package [Python].

Heteroskedasticity-robust F-test for Granger causality: Introduced in [Hafner and Herwartz, 2009], this procedure uses bootstrapping for parameter estimation that is robust to heteroskedasticity (and yields a more efficient estimator than OLS in this case, for example), and a custom Wald test statistic. A bivariate version is implemented in the vars package [R] (see the second test implemented in the causality method).

The Toda and Yamamoto procedure: Introduced in [Toda and Yamamoto, 1995], this procedure is meant to deal with testing for Granger causality in cases where the examined series are either integrated or cointegrated of an arbitrary order (or both). In these cases the tests statistics in the aforementioned tests do not follow their usual asymptotic distribution under the null; the procedure was developed to address this problem. The authors give a detailed test procedure that uses a standard Wald test as a component, in such a way that a properly distributed (under the null hypothesis) test statistic is achieved. Dave Giles has an outstanding blog post on the procedure.

I did not found a code implementation of this procedure, but I have included it because of its importance. It can be implemented by using existing implementations of all the procedures composing it.

Other tests for linear Granger causality: Linear Granger causality tests were developed in many directions, e.g., [Hurlin and Venet, 2001] proposed a procedure for causality tests with panel data, while [Ghysels et al. 2016] introduced a test for Granger causality with mixed frequency data.

The above linear methods are appropriate for testing Granger causality in the mean. However they are not able to detect Granger causality in higher moments, e.g., in the variance. To deal with this challenge, and additional deficiencies in the classical model for Granger causality, a plethora of methods were suggested; these include nonlinear parametric approaches, and various non-parametric approaches.

The following sections then aim to concisely cover the numerous alternatives approaches to inferring causality in time series data, inspired by various fields in nature sciences.

Alternative parametric Granger causality measures for time series data 📖

Most of the following causality measures were overviewed and compared in [Papana et al. 2013].

Conditional Granger Causality Index (CGCI)

Introduced in [Geweke, 1984], it was the first attempt to suggest measures for the degree of linear dependence and feedback between multiple time series. The author introduced the decomposition of the linear causal relationship between X and Y as the sum of linear causality from X to Y, linear causality from Y to X, and instantaneous linear feedback between the two series. Furthermore, the introduced measures can (under certain conditions) be additively decomposed by frequency.

Using the same VAR model as the original linear Granger causality measure, the CGCI is similarly defined to be the natural logarithm of the ratio between the residual variance of the restricted model and that of the unrestricted model. The difference is thus just in the inclusion of additional time series besides X₁ and X₂ in both the restricted and unrestricted models; thus, if X₂ is only mediating the influence of some other time series Z on X₁, we can again expect the residual errors of the unrestricted model to be similar to that of the restricted one, in which case the index will be close to zero.

Equation 14: The Conditional Granger Causality Index

Restricted variants of the VAR model CGCI uses were proposed to handle higher-dimensional data, a smaller amount of samples or non-linear causal relations. [Siggiridou and Kugiumtzis, 2016] includes an overview of several such variants (and introduces another).

MLP-based F-test for Granger causality

This approach is very similar to the aforementioned VAR-based approach, but a perceptron replaces the VAR as the explaining model. Two multi-layer perceptron (MLP) neural networks models are trained — one just for the endogenous time series and one for both — and an F-test is performed to test the null hypothesis that the exogenous time series does not improve predictability of the endogenous time series. Implemented in the NlinTS package [R].

RBF models for Granger causality

[Ancona et al. 2004] suggested a non-linear parametric model for Granger causality, replacing the VAR model with the more rich family of radial basis functions (RBF), which was shown to able to approximate any real function to a desired degree,

Recently, [Ancona and Stramaglia, 2006] showed that not all nonlinear prediction schemes are suitable to evaluate causality between two time series, since they should be invariant if statistically independent variables are added to the set of input variables. Driven by this finding, [Marinazzo et al. 2006] aims to find the largest class of RBF models suitable to evaluate causality.

Partial Granger Causality Index (PGCI)

CGCI (and its extensions) still assume the inclusion of all relevant variables. PGCI was introduced in [Guo et al. 2008] as a causality index that can handle the existence of exogenous inputs and latent (i.e. unobservable) variables in the examined system.

To determine the direct causal link from a variable Y to a variable X, given another variable Z (this can be naturally extended to multiple variables) and both exogenous input to the system and unobserved latent variables, the authors suggest the following restricted VAR model, with a noise covariance matrix S:

Equation 15: The Restricted VAR model for PGCI

And the following unrestricted VAR model, with a noise covariance matrix Σ:

Equation 16: The Unrestricted VAR model for PGCI

Like in previous VAR models, matrices A₁,B₁, A₂, E₂ and K₂ model the autoregressive effect in each series, other matrices model the effect of the different lag of each model on the others, and εᵢ are white noise processes. The new components here are εᵢᴱ, which are independent random vectors representing exogenous inputs, and εᵢᴸ, which are independent random vectors representing latent variables.

The authors go on to develop two measures: (1) A measure of the accuracy of the autoregressive prediction of X based on its previous values conditioned on Z by eliminating the influence of εᵢᴱ and εᵢᴸ. (2) A measure for the accuracy of predicting present value of X based on the previous history of both X and Y conditioned on Z by eliminating the influence of εᵢᴱ and εᵢᴸ. They then define PGCI to be the natural logarithm of the ratio between the two. Given in terms of the noise covariance matrices S and Σ of the two models, the index can be written as:

Equation 17: PGCI in terms of the noise covariance matrices of the VAR models

By comparison, the standard Granger causality index can be expressed as
GCI = ln(|S₁₁|\|Σ₁₁|).

The authors also extend their measure to the nonlinear case by using nonlinear RBF parametric models, as refined in [Marinazzo et al. 2006]. The index remains as in Eq. 1.

Directed Coherence Measures

The bivariate function of coherence is commonly used in signal processing to estimate the power transfer between the input and output of a linear system. [Saito and Harashima, 1981] has expanded on the concept by defining directed coherence (DC), which decomposes coherence into two components of direct coherence measures: one representing the feedforward dynamic and the other representing feedback dynamic in the examined system. The original paper used a bivariate autoregressive model, which was later generalized in several variations for the multivariate case.

[Baccalá and Sameshima, 2001] extended the concept of directed coherence with the definition of Partial Directed Coherence (PDC), which is based on the partial coherence function, as a coherence-based measure of Granger causality for the multivariate case.

Alternative non-parametric causality measures for time series data 📖

Note that most methods presented in the following sections, dealing with chaos and dynamic system theory approaches and information theoretic approaches to causality inference, are also non-parametric.

The Bouezmarni-Taamouti test

The authors of [Bouezmarni and Taamouti, 2010] give a non-parametric test for conditional independence and Granger causality for the bivariate case. Unlike most tests, which focus on causality in mean, the authors base their test on conditional distributions.

Other non-parametric approaches to testing causality are those suggested in [Bell et al. 1996] and [Su and White, 2003].

Chaos and dynamic system theory approaches for causality inference in time series data 📖

This section covers methods for causality inference based on the two closely related fields of chaos theory and dynamic system analysis. Naturally, these approaches are also related to information theory to some degree, which is covered in the next section.

The Hiemstra-Jones test

[Baek and Brock, 1992] developed a nonlinear Granger causality test which was modified by [Hiemstra and Jones, 1994] later on to study the bivariate nonlinear causal relationship between stock returns and stock trading volume. This test has become common in testing for non-linear Granger causality in the years that followed, and has been extended to the multivariate case by [Bai et al. 2010].

Although not commonly presented as such, their non-parametric dependence estimator is based on so-called correlation integral, a probability distribution and entropy estimator, developed by physicists Grassberger and Procaccia in the field of nonlinear dynamics and deterministic chaos as a characterization tool of chaotic attractors.[Hlaváčková-Schindler et al. 2007] It is thus also closely related to the CTIR measure examined in the following section, dealing with information theoretic measures for causality.

[Diks and Panchenko, 2005] have showed that the relationship tested by the process is not implied by the null hypothesis of Granger non-causality, and that the actual rejection rate may tend to one as the sample size increases. As a result, the test was revisited and a new version of it, overcoming some of the aforementioned problems (namely the growth of rejection rate with sample size), was suggested for both the bivariate case (in [Bai et al. 2017]) and the multivariate case (in [Bai et al. 2018]).

The Diks-Panchenko test

Building on their investigation into the problems of the Hiemstra-Jones test in [Diks and Panchenko, 2005], the authors suggest in [Diks and Panchenko, 2006] a new bivariate nonparametric test for Granger causality. They show significantly better behavior of the size and power of their test as the sample size increases, when comparing to the Hiemstra-Jones test, while also testing for a relationship equivalent to the desired null hypothesis. [Diks and Wolski, 2015] extend the test to multivariate settings.

Extended Granger Causality Index (EGCI)

Introduced in [Chen et al. 2004], this method extends the classical Granger causality index to the non-linear case by restricting the application to local linear models in reduced neighborhoods and then averaging the resulting statistical quantity over the entire dataset.

This approach makes use of a technique from the field of dynamical system theory; delay coordinate embedding is used to reconstruct a phase space R, and the autoregressive model is then fitted in the reconstructed space R instead of the original space of the samples. The model is fitted for all points in the neighborhood (determined by a distance parameter δ) of a reference point z₀. The residual variance in the EGCI measure are then estimated using averaging over the neighborhood sampling the entire attractor. Finally, the EGCI is computed as a function of the neighborhood size δ. For linear systems the index should stay roughly the same as δ becomes smaller, while for nonlinear systems it (supposedly) reveals nonlinear causal relation as δ grows smaller.

The authors also suggest a conditional variant of the index, the Conditional Extended Granger Causality Index (CEGCI), to deal with the multivariate case.

Convergent cross mapping (CCM)

Introduced in [Sugihara et al. 2012], CCM is a method for causality inference based on nonlinear state space reconstruction, a mathematical model commonly used in the theory of dynamical systems, and which can be applied to systems where causal variables have synergistic effects (unlike Granger causality tests). The authors demonstrate successful discerning between true coupled variables and the case of external forcing of non-coupled variables.

The method was implemented by some of the authors in the rEDM package [R], the pyEDM package [Python] and cppEDM package [C++], which are accompanied by a comprehensive tutorial.

Information theoretic approaches to causality inference in time series data 📖

Most of the following causality measures were overviewed and compared in [Papana et al. 2013].

Coarse-grained trans-information rate (CTIR)

CTIR is a measure introduced in [Paluš et al. 2001] for the detection of the “direction of information flow’’ between coupled systems in a bivariate time series scenario, based on conditional mutual information.

Defined in information theoretic form, rather than a measure of strength, it measures the average rate of the net amount of information ‘‘transferred’’ from a process Y to the process X, or, in other words, the average rate of the net information flow by which the process Y influences the process X.

[Hlaváčková-Schindler et al. 2007] provide an extremely thorough overview of both CTIR and conditional mutual information, and the various methods used to estimate them. The same paper also includes a proof that the two measures are equivalent, given proper conditioning.

Transfer entropy measures

The information theoretic concept of transfer entropy was introduced in [Schreiber, 2000] as a measure quantifying the statistical coherence between systems evolving in time in a way that can distinguish and exclude information that is actually exchanged from shared information due to common history and input signal. Alternatively, it can be said to quantify the the amount of information explained in X₁ at h steps ahead from the state of X₂, accounting for the concurrent state of X₁. Transfer entropy is given by:

Equation 18: Transfer Entropy

Where I(x₁+h;x₂|x₁) is the conditional mutual information, which gives the expected value of the mutual information of X₁ at h steps and X₂ given the current value of X₁; H(X) is Shannon’s entropy; and H(X,Y) is the joint Shannon’s entropy. The first equivalency was shown in [Paluš and Vejmelka, 2007].

Transfer entropy was since used frequently as a measure of causality in various papers, in fields such as neuroscience (for example[Vicente, 2011]) and extended to use other entropy measures, such as Reyni’s, in [Jizba et al. 2012]. [Verdes, 2005] suggested a variant measure better suited for causality detection in homogeneous spatially extended systems.

Partial transfer entropy (PTE), presented in [Vakorin et al, 2009], is an extension of transfer entropy designed to measure the direct causality of X₂ on X₁, conditioning on the remaining variables in Z:

Equation 19: Partial Transfer Entropy

Symbolic Transfer Entropy (STE): The STE measures amounts to transfer entropy estimated on an embedding space (of dimension d) of rank-points (i.e. symbols) formed by the reconstructed vectors of the variables.

Equation 20: Symbolic Transfer Entropy

Where X̂₁,t is the ordinal pattern of order d at time t of the vector X₁,t, (see [Keller and Sinn, 2005]) which, given time delay τ, is defined to be the permutation (r₀, r₁,⋯, rd) of (0, 1,⋯, d) satisfying

Partial symbolic transfer entropy (PSTE): STE was extended to multivariate settings in an identical way to PTE:

Equation 21: Partial Symbolic Transfer Entropy

Additional transfer entropy based measures for causality include transfer entropy on rank vectors (TERV), introduced in [Kugiumtzis, 2012], and its multivariate extension partial transfer entropy on ranks (PTERV), introduced in [Kugiumtzis, 2013A].

Mutual Information from Mixed Embedding (MIME)

Introduced in [Vlachos and Kugiumtzis, 2010], MIME is mutual information driven state space reconstruction technique for time series analysis, including causality (and thus could also be placed under the dynamic system theory approaches section).

In the bivariate case the scheme gives a mixed embedding of varying delays from the variables, X₁ and X₂, that best explains the future of X₁. The mixed embedding vector, W[t], may contain lagged components of both X₁ and X₂, defining two complementary subsets W[t]=[Wˣ¹t, Wˣ²t]. The MIME is then estimated as:

Equation 22: Mutual Information from Mixed Embedding

The numerator in Eq. 22 is the conditional as for the TE in Eq. 18, but for non-uniform embedding vectors of X₁ and X₂. MIME can thus be considered as a normalized version of the TE for optimized non-uniform embedding of X₁ and X₂.

Partial Mutual Information from Mixed Embedding (PMIME) is an extension of MIME for multivariate settings, described in [Kugiumtzis, 2013B], done by additionally conditioning on all environment variables Z, much like in PTE and PSTE. The mixed embedding vector that best describes the future of X₁ is now formed potentially by all K lagged variables, i.e., X₁, X₂ and the other K-2 variables in Z, and it can be decomposed to the three respective subsets as W[t]=[Wˣ¹t, Wˣ²t, Wt]. The PMIME is then estimated as:

Equation 22: Partial Mutual Information from Mixed Embedding

The method was implemented by the authors in a Matlab package.

Graphical approaches for causality inference in time series data 📖

A graphical approach is often used to model Granger causality in multivariate setting: Each variable (in our case, corresponding to a time series) is considered to be a node in a Granger network, with directed edges denoting a causal link, possibly with a delay (see Figure 2).

Causal graph search algorithms (SGS, PC and FCI)

A family of causal search algorithms that use principles of conditional dependence and application of the causal Markov condition to reconstruct the causal graph of the data­ generating process, made up of three related algorithms: SGS, PC and FCI. See [Spirtes et al. 2000] for a thorough overview.

The main structure of these algorithms is similar:

  1. Initialization: The full undirected graph over all variables V is initialized (i.e. assuming all causal connections).
  2. Skeleton Construction: Then, edges are eliminated by testing for conditional independence with increasing degrees of dependence (here the algorithms differ; SGS tests for every possible conditioning set, while PC only includes connected variables).
  3. Edge elimination: Finally, a set of statistical and logical rules are applied to determine the direction of edges (i.e. causality) in the graph.

Between the first two, SGS is considered as possibly more robust to nonlinearities, while the complexity of PC — the more commonly used of the two — does not grow exponentially with the number of variables (as a result of the difference in the edge elimination phase). Finally, the PC algorithm cannot handle unobserved confounders, a problem which its extension, FCI, aims to remedy.

[Runge et al, 2017] deem PC to be inappropriate to use with time series data, claiming the use of autocorrelation can lead to high false positive rates based on numerical experiments.

PCMCI

PCMCI is a causal discovery method described in [Runge et al, 2017], and implemented in the Tigramite Python package. The authors claim it is suitable for large datasets (~O(100k)) of variables featuring linear and nonlinear, time-delayed dependencies, given sample sizes of a few hundreds or more, and that is shows consistency and higher detecting power with reliable false positive control, when compared with methods such as Lasso Granger and the CI family of algorithms.

The method consists of two stages:

  1. PC₁ — A Markov set discovery algorithm based on the PC algorithm that removes irrelevant conditions for each variable by iterative independence testing.
  2. MCI — The momentary conditional independence test, meant to addresses false positive control for the highly-interdependent time series case, conditions on the parents of both variables in the potential causal link. To test whether Xⁱ affects Xʲ with lag τ, the following is tested (where 𝒫(Xⁱ) is the set of parent nodes of Xⁱ):
The MCI test

Like in the skeleton construction phase of the PC-family of algorithms, both steps of PCMCI can be combined with any conditional independence test. The authors examine linear partial correlation tests for the linear case, and the GPDC and CMI tests for the non-linear case.

Lasso-Granger

This method was introduced in [Arnold et al. 2007] as way to apply Granger causality models in high-dimensional multivariate settings, by utilizing the variable selection nature of the Lasso method.

The method was also adapted to deal with irregular time series (series with samples missing at blocks of sampling points or collected at non-uniformly spaced time points) in [Bahadori and Liu, 2012A] and [Bahadori and Liu, 2012B] with the Generalized Lasso Granger (GLG) and the Weighted Generalized Lasso Granger (W-GLG) variants.

A thorough review of Lasso-Granger methods, although in the specific context of gene expression regulatory networks, is given in [Hlaváčková-Schindler and Pereverzyev, 2015].

Copula-Granger

Copula-Granger is a semi-parametric Granger causality inference algorithm developed and introduced in [Bahadori and Liu, 2012B] and [Bahadori and Liu, 2013]. The copula approach was first suggested for time series analysis in [Embrechts et al. 2002], and later used in [Liu et al, 2009] to learn the dependency graph among time series.

The authors examine in [Bahadori and Liu, 2013] both two existing approaches and their algorithm in terms of two main properties: (1) The ability to handle spurious effects of confounders, and (2) consistency. The entire analysis is done under the strong assumption of Causal Sufficiency — i.e. that no common cause of any two observed variables in the system is left out.

Figure 2: A toy Granger graphical model, with delays τ. When X₄ is unobserved, a spurious edge X₁ ← X₃ is detected by some algorithms.

The authors highlight deficiencies of two major approaches for VAR models — significance test and Lasso-Granger — in the above terms, and show that their approach is both consistent in high dimensions, and can capture non-linearity in the data (for a simple polynomial case). The two main novelties in the proposed approach is the explicit treatment of delays in causality paths in the graphs, used to prevent identification of spurious effects (see Figure 2), and the projection of observations into copula space alongside the incorporation of non-paranormal (nonparametric normal) distributions into the DGP.

It was generalized, in the same way that elastic net generalizes lasso, in [Furqan et al. 2016], to use the elastic net regularization method, to overcome the natural limitations of lasso: instability when used for high-dimensional data and limited variable selection before saturation when the number of variables is greater than the number of observation points.

Forward-Backward Lasso Granger (FBLG)

Both Lasso Granger and Copula-Granger were extended in [Cheng et al. 2014] with a bagging-like meta-algorithm called Forward-Backward, which enriches the dataset with a reversal of the input time series.

Choosing which approach to use 📖

In general, the decision on which approach to infer or detect causality in your data will be driven mainly by the data itself and it characteristics, and by what assumptions you feel confident you can make about it and the real processes generating it.

Granger causality vs other approaches

The key requirement of Granger causality is separability, meaning that information about causing effects is not contained in time series for the caused effects, and can be eliminated by removing that variable from the model.

Generally, separability is characteristic of purely stochastic and linear systems, and Granger causality can be useful for detecting interactions between strongly coupled (synchronized) variables in nonlinear systems. Separability reflects the view that systems can be understood a piece at a time rather than as a whole. [Sugihara et al. 2012]

As such, the first criteria to using classical Granger-causality-based approaches is the ability to separate your data into several mutually-exclusive (information-wise) time series, for which the ability to determine that several specific time series cause some other specific time series is valuable.

In complex dynamic systems where these conditions cannot be met, modern approaches meant to infer causality in such systems, like CCM or PCMCI, might be more appropriate.

Parametric vs non-parametric approaches

Model misspecification is always a challenge in causal inference, whether the selected system model is linear or non-linear. If you believe non of the available methods can model the system under question and the flow of causality in it to a good degree— a determination usually made using domain knowledge and intuition — then a non-parametric approach might be more appropriate, such as most of those presented in the sections dealing with dynamic system theoretic and information theoretic causality measures.

A notable caveat, that is often not obvious at first sight, is that the requirements or assumptions some approaches make can render a non-parametric approach parametric in practice. A notable example is PCMCI, which assumes input time series are generated by stationary processes. As non-parametric tests for stationarity (unlike unit root) are few and far between, not to mention that no method or transformation can guarantee the transformation of non-stationary data to stationary one, this assumption will force the user PCMCI to use parametric approaches for the detection and transformation of non-stationarity in input data. This situation is made worse by the lack of accepted and well-defined notions of near-stationarity (a few do exist) and of ways to quantify it and determine when it is sufficient for an inference method to function properly.

Causal graph extraction

If the extraction of a causal graph is a goal, then the PCMCI and Copula-Granger (and its extension FBCLG) stand out among the graphical algorithms. Both methods can handle confounders successfully, with PCMCI also claiming resiliency to high autocorrelation in the data and boasting a convenient Python implementation.

System observability

The observability of the system is also a parameter to be taken in to account. If the strong assumption of causal sufficiency cannot be met, then the many methods that presuppose it — including PCMCI and Copula-Granger — cannot be relied upon for correct inference of causal relations. In that case, alternative causality measures aimed at dealing with latent variables in the system, such as PGCI, should be considered.

Deciding between different tests for Granger causality

The takeaway here is pretty simple: Unless you can justify the very strong assumption of a linear relationship between the exogenous and the endogenous variables, a non-parametric approach is the proper one, as it makes much weaker assumptions on your stochastic system and on the flow of causality.

In that case, the Diks-Panchenko test stands out among the non-parametric tests for Granger causality, in terms of power and size of the test. It also solves the discrepancy between the definition of Granger causality and the actual relationship tested by the Hiemstra-Jones test, which is not solved even by the Bai et al variants of the test.

If a linear model of the system is sufficient, then the Toda and Yamamoto procedure is the most rigorous method for linear Granger causality inference, dealing with important phenomena such as integrated or cointegrated time series.

Researchers to follow 📖

Prof. Cees Diks consistently publishes papers on non-linear Granger causality and non-linear dynamics in general. This include, among other topics, building Financial networks based on Granger causality, examining the effect of different resampling methods for causality testing and causality measures for multivariate analysis.

Prof. Dimitris Kugiumtzis does incredible work on time series analysis generally, and causality inference in time series data specifically, driven by information theoretic approaches, notably the MIME and PMIME measures.

Prof. George Sugihara is a theoretical biologist who has worked across a variety of fields, introducing inductive theoretical approaches to understanding complex dynamic systems in nature from observational data. Chief among those is Empirical Dynamic Modeling, a non-parametric approach for the analysis and forecast of complex dynamical systems rooted in chaos theory, represented in this post by the CCM method. His work involves inductive theoretical approaches to understanding nature from observational data.

Dr. Jakob Runge did substantial work on causality in time series data, mainly in the context of climate research; he is also the creator of tigarmite, a Python library for causal inference in time series data using the PCMCI method.

Youssef Hmamouche is one of the authors and maintainers of the NlinTS R package for neural network-based time series forecasting and causality detection in time series data, and recently wrote about a Causality-Based Feature Selection Approach For Multivariate Time Series Forecasting.

Other notable literature 📖

Learning and causal inference

Judea Pearl, who is a prominent researcher in the field and who has developed the structural approach to casual inference, recently wrote a very interesting piece on causal inference tools and reflections on machine learning [Pearl, 2018]. He also wrote a thorough overview of the topic of causal inference [Pearl, 2009].

David Lopez-Paz, a research scientist at Facebook AI Research, leads very interesting research on casual inference in general, and in the context of learning frameworks and deep learning specifically. Highlights include posing causal inference as a learning problem (specifically of classifying probability distributions), causal generative neural networks, incorporation of an adversarial framework for causal discovery and discovering causal signals in images.

Uri Shalit, Asst. Prof. at the Technion, heads a lab dedicated to machine learning and causal inference in healthcare, and one of his main research interests is the intersection of machine learning and causal inference, with a focus on using deep-learning methods for causal inference.

Krzysztof Chalupka has done some fascinating research in the intersection of deep learning and causal inference. Highlights include a deep-learning-based conditional independence test, causal feature learning, visual causal feature learning and causal regularization.

Finally, [Dong et al. 2012] have used Multi-Step Granger Causality Method (MSGCM), a method developed to identify feedback loops embedded in biological networks using time-series experimental measurements, for the identification of feedback loops in neural networks.

References 📖

Academic Literature: Causality and causality inference

Academic Literature: Causality inference in time series data

Academic Literature: Other

Other Online Sources

Footnotes 📖

  1. For a brief overview of the F-test see here and here.
  2. For an overview of the Chi-squared test see the Wikipedia article on the topic.
  3. In lmtest, the grangertest method calls the waldtest method without assigning a value to its test parameter (which determines whether an F test or a chi-square test is applied), so an F test is used by default.
  4. Geweke at al. performed a comparison of 8 methods for inferring causality in time series data [Geweke et al. 1983] and found that Wald variants of a test attributed to Granger, and a lagged dependent variable version of Sims’ test introduced in that paper, are equivalent in all relevant respects and are preferred to the other tests discussed.
  5. The size of a statistical test is the probability of it making a Type I error; i.e. falsely rejecting the null hypothesis.
  6. A method is consistent if its probability of errors goes to zero as the number of observations increase.
  7. The Causal Markov Condition: A variable in a graph is, conditional on its parents, probabilistically independent of all other variables that are neither its parents nor its descendants.
  8. 𝛹-causality vs 𝛹-non-causality: The same definition of causality can sometimes be referred to in two seemingly contradictory names; e.g. Granger causality and Granger non-causality refer to the same definition of causality. This is the case because in many cases the definitions are given for the inverse condition, and X is said to be 𝛹-causing Y if the given condition does not hold. I use only the first form for consistency.
  9. A note on notation: I try to keep notation as close to the source material as possible, but as Medium does not support inline math expressions, I do the best I can with Unicode characters. Specifically, square brackets are used repeatedly where subscript would have been used but is not available in Unicode; for example, the i-th element of a vector v will be denoted by vᵢ, but the t-th element will be denoted by v[t].

--

--

Data Science Consultant. Teacher @ Tel Aviv University's business school. CEO @ Datahack nonprofit. www.shaypalachy.com