Getting Started
/II

What do elections to the great council of Genoa in the 17th century and prices of stocks all over the world in the 21st century have in common? They are somehow random and people bet on them (see [1]). Starting from simple games of chance to sophisticated derivatives, people tried (and still try) to make money by modelling randomness.
Breakthroughs in the probability theory of the 18th and 19th century include Bernoulli’s law of large numbers and the de Moivre–Laplace theorem (an early version of the central limit theorem). Both are based on strong assumptions about the given observations. With advances in modern probability theory, both results were generalized in many directions and are proven to hold under rather weak assumptions. Today practitioners and data scientists can draw from a variety of tools from time series analysis, i.e. methods that allow for temporal dependence of the data and even for changes in the distribution of the underlying data generating process.
It is particularly important for data scientists to be able to work with time series since, on a high level, the quality of any machine learning model in production is a time series and should be analyzed accordingly.
In the first part of this mini-series, we introduce the basic concepts that are needed to understand the difference between i.i.d. data and time series.
In the next part, we tackle tests for the two common assumptions of independence and stationarity and see an implementation of them in Python.
Independence
Let’s go to a place where arguably many theories in probability theory and Statistics came up: a casino. If we play roulette, the result of one round does not seem to influence the result of the next. The probability of any given number is 1/37 because there are 37 numbers in total. The probability that the same number comes up again in the next round is still 1/37. If there was a streak of 10 ‘reds’, the probability of another ‘red’ in the next round is the same as before, namely 18/37 since there are 18 red and 19 non-red numbers. The results of several rounds of roulette are independent.
Contrarily, when playing poker with a regular deck of cards and our first card is an ace of spades, the probability of drawing another ace of spades should be zero. Otherwise, something is going on there… Anyway, the results of drawing cards from a deck are dependent.
The above examples illustrate the concept of independence. Mathematically, two real-valued random variables X and Y are independent if the joint probability "factorizes", that is P(X ≤ x, Y ≤ y) = P(X ≤ x) P(Y ≤ y) for any real numbers x and y. If the probability does not factorize for some choice of x and y, the random variables are dependent. What might appear to be mathematical nitpicking, is a very important concept because the single probabilities P(X ≤ x) and P(Y ≤ y) are easier to handle than the joint probability. This is even more true if we deal not only with two, but n random variables.

Identical Distribution
Going back to our new favorite place, the casino, we can observe even more. The probability of a certain event, such as ‘red’ at a roulette table, is the same at every roulette table. Similarly, the probability of a particular card, such as the ace of spades, is the same at any poker table. In both cases the distributions are identical.
Again, there is a mathematical definition that formalizes this concept: two real-valued random variables X and Y are identically distributed if P(X ≤ z) = P(Y ≤ z) for any real number z. And again, this mathematical property simplifies drastically the statistical analysis.

Time Series
A sequence of random variables X(1), …, X(n) that satisfies the two conditions of independence and identical distribution is called independent and identically distributed or i.i.d. Many important results in statistics, such as the central limit theorem, were formulated for i.i.d. random variables first. In some cases the assumption of i.i.d. data might be reasonable, but in many fields, the data of interest is neither independent nor identically distributed.
For example the price of a stock today depends on its price yesterday and the volatility of the stock might change over time, which implies a change in the underlying distribution.

In these cases we want to relax the strong assumption of i.i.d. data and have a methodology that is valid for data that is neither independent nor identically distributed and this is where time series come into play.
A time series is a collection of random variables indexed by time, for example X(1), …, X(n). In particular, the random variables can be dependent and their distribution might change over time. As this definition does not ensure a lot of structure to work with, we need some additional assumptions to deduce meaningful results.

Stationarity (strong, weak, local)
The definition of time series does not imply any restrictions on the underlying distribution. So theoretically, the distribution could change completely at any time instant. This behavior does not allow much statistical inference and is often restricted by the assumption of stationarity. A time series is called stationary if its distribution does not change over time.
In quantitative finance, for example, the log returns are often assumed to be stationary under the efficient-market hypothesis.
More formally, a time series X(i), indexed by the integers, is (strongly) stationary if for any real numbers x₁, x₂,…, xₖ and any integers t₁, …, tₖ and h it holds P[ X(t₁) ≤ x₁, X(t₂) ≤ x₂, …, X(tₖ) ≤ xₖ ] = P[ X(t₁+h) ≤ x₁, X(t₂+h) ≤ x₂, …, X(tₖ+h) ≤ xₖ ].
In practice many time series are stationary, such as certain ARMA models, and a vast amount of literature is based on this assumption. However, in other settings the assumption might be too restrictive. Thus, a first relaxation of stationarity is weak stationarity, which basically means that the expected value and the covariance structure do not change over time. Formally, a time series X(i) with finite variance is weakly stationary if E[X(i)] = E[X(0)] and E[X(i) X(i+h)] = E[X(0) X(h)] for any time instant i and any integer h. Note that stationarity implies weak stationarity but not vice versa.
Probably most time series that we encounter in practice are (weakly) stationary or can be transformed into stationary time series through differencing or other transformations. However, there is a third and less known notion that I want to mention briefly without going into too much detail. Consider a sequence of i.i.d. errors ε(1), ε(2), …, ε(n) and a continuous function μ. Then X(i,n) = μ(i/n) + ε(i) is clearly not stationary. However, it can be approximated locally through a family of stationary time series, namely for any u such that i/n is close to u, X(i,n) is close to Y(i|u) = μ(u) + ε(i). This model is part of a larger class of time series that is called locally stationary and there are different definitions of local stationarity in literature, that allow for different methods (see [2], [3] or [4]). Locally stationary time series arise naturally whenever the underlying distribution varies smoothly. An important example is the global climate, which changes gradually over time.
Dependence Structure
Not only the time-varying distribution but also the dependence structure imposes an additional challenge to statistical analysis. It is difficult to measure the dependence between random variables directly, thus we often use their correlation as a proxy. Two random variables X and Y are uncorrelated if their covariance Cov(X,Y) is zero. As a rule of thumb, it holds that the larger the covariance, the more dependent the two random variables are. Note that independence of two random variables implies uncorrelatedness but not vice versa.
To measure the dependence of an entire time series, we use a similar approach. For a time series X(i) indexed by the integers, we define the long-run variance as

For stationary time series the (auto-)covariances in the sum do not depend on i and σ²(i)=σ². If X(i) is a sequence of i.i.d. random variables, the long-run variance is simply the variance Var(X(i)). We will see later that the long-run variance plays a role similar to the variance in the central limit theorem. Of course, it is not clear whether the series in the definition of the long-run variance converges. Indeed, we say that a time series is weakly or short range dependent if the series converges and the long-run variance exists. If the long-run variance diverges to infinity, the time series is long range dependent.
Intuitively, a time series is weakly dependent if events in the past have only a small influence on the value of the time series at the present moment. For example, the temperature last year only has a small influence on the temperature today, whereas the temperature yesterday has a larger influence.
There are different concepts that imply weak dependence, such as different mixing conditions, weak physical dependence or cumulant conditions. For most time series in applications, these conditions should be satisfied and it is reasonable to use methods based on them.
Wrap Up
By now we have seen the concepts of independence and identical distribution and the generalization from i.i.d. data to time series. In the second part of this mini-series we will use the central limit theorem to explore further differences between i.i.d. data and time series.
Many tools that we use daily as data scientists are based on the assumptions of independence or stationarity. In these cases, it is crucial to make sure that the assumptions are satisfied. In the next part, we will see how to test these assumptions in practice.
[1] D. R. Bellhouse, The Genoese Lottery (1991), Statist. Sci. 6 , no. 2, 141–148. doi:10.1214/ss/1177011819. https://projecteuclid.org/euclid.ss/1177011819 [2] R. Dahlhaus, On the Kullback-Leibler information divergence of locally stationary processes (1996), Stochastic processes and their applications, 62(1), 139–168. [3] Z. Zhou, and W. B. Wu, Local linear quantile estimation for nonstationary time series (2009), The Annals of Statistics, 37(5B), 2696–2729. [4] M. Vogt, Nonparametric regression for locally stationary time series (2012), The Annals of Statistics, 40(5), 2601–2633.