See the Future with Stochastic Processes

Introductory guide with R code included

Christopher Kazakis
Towards Data Science

--

Image by joakant from Pixabay

Many of you might not have heard of stochastic processes before and be wondering how they might be relevant to you. Firstly, statisticians might find stochastic processes a nice way of modeling probabilistic events. Additionally, those interesting in reinforcement learning may find that this information solidifies their understanding of RL concepts such as Markov Chains. Lastly, this article is short and easy-to-follow, so if you’re curious about stochastic processes themselves, then this is a good introduction.

Any readers of the following tutorial should know matrix operations, such as multiplication, as well as a basic understanding of linear algebra. For the sake of brevity, many linear algebra concepts are not explained in this tutorial. This article has R code for those who are interested, but it is not necessary to follow the code as you read the article.

Basics and Definition

The first question many of you may have is the exact nature of a stochastic process. The word ‘stochastic’ literally means ‘random’, though stochastic processes are not necessarily random: they can be entirely deterministic, in fact. Simply put, a stochastic process is any mathematical process that can be modeled with a family of random variables. A coin toss is a great example because of its simplicity. We start with a coin head-ups and then flip it exactly once. The probability of the coin landing on heads is .5, and tails is .5. If the coins lands on heads, the ‘state’ of the world is unchanged, while if the coin lands on tails, the ‘state’ has changed. This can be modeled with the matrix below:

A simple transition matrix.

Or, a more visual approach:

The same matrix, this time as a directed graph.

The rows represent states, with probabilities of transitioning to other states in are in the columns. If you’re in state 0 (heads) and you flip the coin, your chances of ending up in state tails (state 1) can be seen in the (0, 1) entry in the matrix above (.5).

As another example, suppose you have 2 printers in a library. When both printers are working, there is a 30 percent chance that one breaks every day, and a 0 percent chance that they both break. If one printer is broken, then there is a 20 percent chance that a repairman comes by on that day and fixes it, and a 40 percent chance that the second printer breaks. When both are broken, then there is a 100 percent chance that a repairman comes by and fixes both printers. The matrix representing this process is shown below:

A transition matrix. Notice the indexing starts from 0.

Notice that (0, 0) = .7 since (0, 1) = .3 and (0, 2) = 0. Each row must add up to 1 because they’re populated by discrete probability distributions. The same holds true for entry (1, 1). Notice that the final row deterministically goes from state 2 to state 0. To make sure you understand, answer the following questions:

1. What does the entry at (0, 1) represent?

2. What would the new matrix look like if the repairman fixed both printers with P = .7, but sometimes only has time to fix one printer when both are broken with P = .1?

The Markov Property

Now that we have seen two simple examples, let’s talk about the hallmark property of many random processes: the Markov property. A system or variable is said to be Markov when knowing the current state of the process is just as good as knowing the entire history. In the previous example, if we define the random variable X to be the current number of printers functioning at the nth step of the process (with lowercase x denoting the actual value at some step), then knowing X is sufficient to give a distribution for Xn+1. Simply put:

Mathematically, this equation states that Xn+1 depends only on the value of Xn.

Intuitively, knowing that both printers are functioning is enough to know the probabilities of all future states. It does not matter how many printers were broken in the previous state or the state before that. A coin toss is another example: the odds of a flipped coin landing on heads is entirely independent of the result of any previous coin tosses. Markov processes are said to be ‘memoryless’ because the history of their states is irrelevant. Only the current state matters (though, not necessarily. In the coin toss example, you don’t even need to know the current state as P(heads) = .5 and P(tails) = .5 in all states). In terms of a reinforcement learning example, where the idea is to specify an optimal move with some probability distribution, many board games can be considered Markov. Chess is one such game, as selecting an optimal move is unaffected by the exact move order preceding the current state. Though there are a few exceptions, such as the en passant rule and whether either side has castled, these can be encoded into the current state, which would make the chain Markov.

Predicting the Future 🔮

Now that we’ve covered the Markov property, let’s return to the printer example. Notice that it is trivial to calculate P(Xn = 1 | Xn-1 = 0) ( the probability that the chain goes to 1 given that it is currently in state 0) as it is the (0, 1) entry in the matrix. But what if we wanted to calculate 2-step-ahead chance? To do that we would need to sum all possible two-step transitions that start at 0 and lead to 1:

Why is this the case? Well, this probability is the sum of the chances of every possible sequence of steps that could lead to state. Starting at state n-2, there are three ways to progress to n-1 (each value in row 0 of the matrix). In all possible values of n-1, row 1 contains the transition probability leading to 1.

You may have noticed that this essentially amounts taking the dot product of the first row and the second column. That’s actually exactly it! In fact, if we do this for every entry, then we can get the two-step ahead probabilities for all P(i, j).

Let’s use R to square the matrix:

matrixpower <- function(mat,k) {
if (k == 0) return (diag(dim(mat)[1]))
if (k == 1) return(mat)
if (k > 1) return( mat %*% matrixpower(mat, k-1))
} # matrix power function from Introduction to Stochastic Processes with R by Robert Dobrow
printers <- matrix(c(.7, .2, 1, .3, .4, 0, 0, .4, 0), nrow = 3, ncol = 3) # make the matrixprinters [,0] [,1] [,2]
[0,] 0.7 0.3 0.0
[1,] 0.2 0.4 0.4
[2,] 1.0 0.0 0.0
matrixpower(printers, 2) [,0] [,1] [,2]
[0,] 0.55 0.33 0.12
[1,] 0.62 0.22 0.16
[2,] 0.70 0.30 0.00
# Note that I've manually changed the indexing to start from 0, as that's the typical way of indexing matrices for stochastic processes. R indexing, however, starts from 1. I did this for the sake of consistency. You only need to keep this in mind should you explore this in you own R Notebook

To get three-step probabilities, we do it again:

matrixpower(printers, 3)      [,0]  [,1]  [,2]
[0,] 0.571 0.297 0.132
[1,] 0.638 0.274 0.088
[2,] 0.550 0.330 0.120

It works for any n-step ahead probability:

matrixpower(printers, 6)
[,0] [,1] [,2]
[0,] 0.588127 0.294525 0.117348
[1,] 0.587510 0.293602 0.118888
[2,] 0.590590 0.293370 0.116040

So, the probability that we have 2 broken printers after 6 steps in the chain, given that we start with 0 broken printers is M6(0, 2) = .117.

Using this method, we can calculate something called an invariant, or stationary distribution. The invariant distribution describes the long-term probability of being in any state, as well as the proportion of time spent in that state. Since this probability is ‘long-term’, the start state does not matter. Therefore, the columns of the long-term probability distribution should all have the same value. The matrix of this distribution, denoted by Π, takes the following form for 3 by 3 matrices, which can be generalized to n by n:

The invariant probability Distribution.

You can find this easily by raising a transition matrix to a very large power. More formally:

matrixpower(printers, 100)
[,0] [,1] [,2]
[0,] 0.5882353 0.2941176 0.1176471
[1,] 0.5882353 0.2941176 0.1176471
[2,] 0.5882353 0.2941176 0.1176471

Or you can solve it using a system of equations. The system can be constructed using just 2 facts: firstly, the rows must sum to one (which must be true intuitively, given that the practical interpretations of the π values); second, πn must be equal to the other long-term probabilities multiplied by their transition probabilities leading to πn. In the printer example:

Feel free to validate that these equations hold using the calculated invariant distributions.

Communication Matters 💬

It is important to note, however, that there are cases where a limiting probability distribution does not exist in the form described. To understand when this is the case, we must then define the reducibility of Markov processes. Reducibility is related to how the states of a Markov chain are connected — States A and B communicate if and only if A is reachable from B and B is reachable from A. A communication class consists a set of states that all communicate with one another (see graphic below).

A Markov chain with 2 communication classes.

We say that a Markov chain is irreducible if it consists of a single communication class. If this is not the case, then we say that the chain is reducible, and Π does not exist in the form above. However, there is still a limiting distribution.

reducible <-  matrix(c(0, 0, 0, 0, 0, .4, .9, .7, 0, 0, 0, .1, .3, 0, 0, .6, 0, 0, .5, .5, 0, 0, 0, .5, .5), nrow = 5, ncol = 5)reducible      [,0] [,1] [,2] [,3] [,4]
[0,] 0 0.4 0.0 0.6 0.0
[1,] 0 0.9 0.1 0.0 0.0
[2,] 0 0.7 0.3 0.0 0.0
[3,] 0 0.0 0.0 0.5 0.5
[4,] 0 0.0 0.0 0.5 0.5

In the scenario above, state 0 goes to 1 of 2 loops that do not communicate with other classes. So, assuming we start in start 0, we will navigate to either state 1 or 3 and then change between those states for eternity. What does the limiting distribution look like in this scenario?

matrixpower(reducible, 100)
[,0] [,1] [,2] [,3] [,4]
[0,] 0 0.350 0.050 0.3 0.3
[1,] 0 0.875 0.125 0.0 0.0
[2,] 0 0.875 0.125 0.0 0.0
[3,] 0 0.000 0.000 0.5 0.5
[4,] 0 0.000 0.000 0.5 0.5

We see that the two squares within the matrix appear that look like limiting probability distributions. They are, in fact. Try to verify that each of these individual distributions is correct. State 0 is interesting, in that it will be visited exactly once, and then there is no way back. One might assume that that would imply that π0 = 0, and that’s exactly what we see. This indicates that the long-term probability of being in state 0 is 0, which means the state is transient. Transient states can also be defined as states that have a finite number of expected visits, while recurrent states, in contrast, have an infinite number of expected visits.

Also take note that the probabilities in row 0 do not match the long-term distributions in the individual loops. In the previous example, all the probabilities in individual columns were the same, because the start state was irrelevant to long term behavior (as any state could be reached from any other eventually). However, now the start state is relevant because the chain now has 3 communication classes. If we were to start in state 2, then it would be impossible to reach state 4. However, if we start in state 0, then any state can be reached, but states 3 and 4 are collectively more likely, because there’s a one-time .6 chance to enter that chain forever, while states 1 and 2 only have a .4 chance of being reached from state 1. (the expected number of visits to all other states when starting at state 0 is still infinity, because any fraction times infinity is still infinite).

The Random Walk

Image by Bela Balla from Pixabay.

With this knowledge of how transient states look in the long-term future, we can visit a classic stochastic process: the random walk.

In this example, a drunk person is trying to reach home. State 4 is his house. If he reaches his house, then he goes to sleep and does not leave. State 0 is the ocean. If he happens to stumble in the ocean, again, he will not leave (poor guy). Because he is a bit out of it, he moves in either direction with probability .5 when he is in the states in between. Below is a matrix showing his movement:

random_walk <- matrix(c(1, .5, 0, 0, 0, 0, 0, .5, 0, 0, 0, .5, 0, .5, 0, 0, 0, .5, 0, 0, 0, 0, 0, .5, 1), nrow = 5, ncol = 5)random_walk     [,0] [,1] [,2] [,3] [,4]
[0,] 1.0 0.0 0.0 0.0 0.0
[1,] 0.5 0.0 0.5 0.0 0.0
[2,] 0.0 0.5 0.0 0.5 0.0
[3,] 0.0 0.0 0.5 0.0 0.5
[4,] 0.0 0.0 0.0 0.0 1.0

The matrix above is a random walk with absorbing boundaries. States 1 and 4 are considered absorbing because they only communicate with themselves, and therefore once reaching them, you never leave. It should be clear that the boundaries are recurrent while the middle states are transient.

If we raise the matrix to the 100th power, we get the following:

matrixpower(random_walk, 100)
[,0] [,1] [,2] [,3] [,4]
[0,] 1.00 0.000000e+00 0.000000e+00 0.000000e+00 0.00
[1,] 0.75 4.440892e-16 0.000000e+00 4.440892e-16 0.25
[2,] 0.50 0.000000e+00 8.881784e-16 0.000000e+00 0.50
[3,] 0.25 4.440892e-16 0.000000e+00 4.440892e-16 0.75
[4,] 0.00 0.000000e+00 0.000000e+00 0.000000e+00 1.00

Notice that all the states in the middle have P ~ 0 (a sign of transience). We also see that the long-term probabilities change depending on the row of the matrix. Again, this is because the start state matters, only now there are multiple start states. We see that starting is states 1 or 4 will only ever lead to being in states 1 and 4, respectively. Starting in state 1 leads to a 75 percent chance of falling into the ocean, and a 25 percent chance of getting home. As we might expect, the middle state, state 2, leads to either absorbing state with equal probability.

What if the drunkard sobered up a little and now has a .75 chance of walking towards his home, and a .25 chance of going towards the sea?

random_walk_2 <- matrix(c(1, .25, 0, 0, 0, 0, 0, .25, 0, 0, 0, .75, 0, .25, 0, 0, 0, .75, 0, 0, 0, 0, 0, .75, 1), nrow = 5, ncol = 5)random_walk_2     [,0] [,1] [,2] [,3] [,4]
[0,] 1.00 0.00 0.00 0.00 0.00
[1,] 0.25 0.00 0.75 0.00 0.00
[2,] 0.00 0.25 0.00 0.75 0.00
[3,] 0.00 0.00 0.25 0.00 0.75
[4,] 0.00 0.00 0.00 0.00 1.00
matrixpower(random_walk_2, 100) [,1] [,2] [,3] [,4] [,5]
[1,] 1.000 0.000000e+00 0.000000e+00 0.000000e+00 0.000
[2,] 0.325 2.514973e-22 0.000000e+00 7.544920e-22 0.675
[3,] 0.100 0.000000e+00 5.029947e-22 0.000000e+00 0.900
[4,] 0.025 8.383245e-23 0.000000e+00 2.514973e-22 0.975
[5,] 0.000 0.000000e+00 0.000000e+00 0.000000e+00 1.000

As we might expect, the odds of making it home greatly improve.

Thank you for reading this article. Please contact me with any questions. There’s certainly much more to cover, but this is a good start for anyone wanting a quick conceptual overview coupled with some easy example to solidify things. If you enjoyed reading this article, then I would encourage you to check out my second piece on this topic: Magic Tricks with Markov Chains.

Further reading:

Introduction to Stochastic Processes by Greg Lawler, Ch. 1

Introduction to Stochastic Processes with R by Robert Dobrow, Ch. 2 and 3

Reinforcement Learning, An Introduction by Sutton and Barto, Ch. 2 and 3

--

--

BA Computer Science & Statistics. Fullstack React dev. Amateur historian. Twitter: @chrispkazakis