The world’s leading publication for data science, AI, and ML professionals.

Conditional Random Field Tutorial in PyTorch 🔥

Let’s say I have two identical dice, but one is fair and the other is loaded such that number 6 appears with 80% probability, while numbers 1–5 are equally likely with 4% probability. If I gave you a sequence of 15 rolls, could you predict which dice I used for each roll?

Credit Brett Jordan at Unsplash
Credit Brett Jordan at Unsplash

A simple model would be to predict I used the biased dice whenever a 6 comes up and say I used the fair dice for all other numbers. In fact, if I was equally likely to use either dice for any roll, then this simple rule is the best you could do.

But what if after using the fair dice, I have an 90% chance of using the biased dice on the next roll? If the next roll is a 3, your model will predict I used the fair dice when the biased dice is the more probable choice. We can verify this with Bayes’ theorem:

y_i is the random variable of the dice type for roll i. x_i is the value of the dice for roll i.
y_i is the random variable of the dice type for roll i. x_i is the value of the dice for roll i.

The takeaway is that making the most likely choice at each time step is only a viable strategy when I’m equally likely to use either dice. In the far more likely scenario that previous choices of dice affect what my future choices will be, you’re going to have to take the interdependence of the rolls into account in order to be successful.

A Conditional Random Field* (CRF) is a standard model for predicting the most likely sequence of labels that correspond to a sequence of inputs. There are plenty of tutorials on CRFs but the ones I’ve seen fall into one of two camps: 1) all theory without showing how to implement or 2) code for a complex Machine Learning problem with little explanation of what’s going on.

I don’t fault those authors for picking either theory or code. CRF’s are a deep topic in a broader and deeper subject called probabilistic graphical models so covering theory and implementation in depth will take a book, not a blog post, but this makes learning about CRFs harder than it needs to be.

My goal for this tutorial is to cover just enough theory so that you can dive into the resources in category 1 with an idea of what to expect and to show how to implement a CRF on a simple problem you can replicate on your own laptop. This will hopefully equip you with the intuition needed to adapt this simple toy CRF for a more complicated problem.

Theory

Our theoretical discussion will be divided into three parts: 1) specifying model parameters, 2) how to estimate these parameters, and 3) using these parameters to make predictions. These three broad categories apply to any statistical model, even a simple logistic regression, so in that sense CRFs aren’t anything special. But that doesn’t mean that CRFs are as simple as logistic regression models. We’ll see that things will get a bit more complicated once we tackle the fact we’re making a sequence of predictions as opposed to a single prediction.

Specifying model parameters

In this simple problem, the only parameters we need to worry about are the costs associated with transitioning from one dice to the next in consecutive rolls. There are six numbers that we need to worry about and we’ll store them in a 2×3 matrix called the transition matrix:

The first column corresponds to transitions from the fair dice in the previous roll to the fair dice (value in row 1) and biased dice (value in row 2) in the current roll. So the first entry in the first column encodes the cost of predicting that I use the fair dice on the next roll given that I used the fair dice on the current roll. If the data show that I’m unlikely to use fair dice in consecutive rolls, the model will learn this cost should be high and vice versa. The same logic applies to the second column.

The second and third columns of the matrix assume we know which dice we used in the previous roll. Therefore we have to treat the first roll as a special case. We’ll store the corresponding costs in the third column.

Parameter Estimation

Let’s say I give you a set of rolls X and their corresponding dice labels Y. We will find the transition matrix T that minimizes the negative log likelihood over the training data. I’ll show you what the likelihood and negative log likelihood look like for a single sequence of dice rolls. To get it for the entire data set, you’d average over all the sequences.

P(x_i | y_i) is the probability of observing a given dice roll given the current dice label. To give an example, P(x_i | y_i) = 1/6 if y_i = dice is fair. The other term, T(yi | y{i-1}), is the cost of having transitioned from the previous dice label to the current one. We can just read this cost off the transition matrix.

Notice how in the denominator we’re computing the sum over all possible sequences of labels y`. In a traditional logistic regression for a classification problem of two classes, we’d have two terms in the denominator. But now we’re dealing with sequences and for a sequence of length 15, there are a total of 2¹⁵ possible sequences of labels so the number of terms in the denominator is huge. The "secret sauce" of the CRF is that it exploits how the current dice label only depends on the previous one to compute that huge sum efficiently.

This secret sauce algorithm is called the forward-backward algorithm*. Covering this in depth is out of the scope for this blog post but I’ll point you to helpful resources below.

Sequence Prediction

Once we estimate our transition matrix, we can use it to find the most likely sequence of dice labels for a given sequence of dice rolls. The naive way to do this is to compute the likelihood for all possible sequences but this will be intractable for even sequences of moderate length. Just like we did for parameter estimation, we’ll have to use a special algorithm to find the most likely sequence efficiently. This algorithm is closely related to the forward-backward algorithm and it’s called the Viterbi algorithm.

Code

PyTorch is a deep learning library in Python built for training deep learning models. Although we’re not doing deep learning, PyTorch’s automatic differentiation library will help us train our CRF model via gradient descent without us having to compute any gradients by hand. This will save us a lot of work. Using PyTorch will force us to implement the forward part of the forward-backward algorithm and the Viterbi algorithms, which is more instructive than using a specialized CRF python package.

Let’s start by envisioning what the result needs to look like. We need a method for computing the log likelihood for an arbitrary sequence of rolls, given the dice labels. Here is one way it could look:

This method does three main things: 1) maps the value on the dice to a likelihood, 2) computes the numerator of the log likelihood term, 3) computes the denominator of the log likelihood term.

Let’s first tackle the _data_to_likelihood method, which will help us do step 1. What we’ll do is we’ll create a matrix of dimension 6 x 2 where the first column is the likelihood of rolls 1–6 for the fair dice, and the second column is the likelihood of rolls 1–6 for the biased dice. This is what this matrix looks like for our problem:

array([[-1.79175947, -3.21887582],
       [-1.79175947, -3.21887582],
       [-1.79175947, -3.21887582],
       [-1.79175947, -3.21887582],
       [-1.79175947, -3.21887582],
       [-1.79175947, -0.22314355]])

Now, if we see a roll of 4, we can just select the fourth row of the matrix. The first entry of that vector is the likelihood of a four for the fair dice (log(1/6)) and the second entry is the likelihood of a four for the biased dice (log(0.04)). This is what the code looks like:

Next, we’ll write the methods to compute the numerator and denominator of the log likelihood.

That’s it! We have all the code we need to start learning our transition matrix. But if we want to make predictions after training our model, we’ll have to code the Viterbi algorithm:

There’s more to our implementation but I’ve only included the big functions we discussed in the theory section.

Evaluating on Data

I evaluated the model on some data I simulated using the following probabilities:

  1. P(first dice in sequence is fair) = 0.5
  2. P(current dice is fair | previous dice is fair) = 0.8
  3. P(current dice is biased | previous dice is biased) = 0.35

Check out the notebook I made to see how I generated the model and trained the CRF.

The first thing we’ll do is look at what the estimated transition matrix looks like. The model learned that I am more likely to roll the fair dice on the current roll if I used the fair dice on the previous roll (-1.38 < -0.87). The model also learned that I am more likely to use the fair dice after using the biased dice, but not by a lot (-0.59 < -0.41). The model assigns equal cost to both dice in the first roll (-0.51 ~ -0.54).

array([[-0.86563134, -0.40748784, -0.54984874],
       [-1.3820231 , -0.59524935, -0.516026  ]], dtype=float32)

Next, we’ll see what the predictions looks like for a particular sequence of rolls:

# observed dice rolls
array([2, 3, 4, 5, 5, 5, 1, 5, 3, 2, 5, 5, 5, 3, 5])
# corresponding labels. 0 means fair
array([0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1])
# predictions
array([0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0])

The model recognizes long sequences of 6’s (these are the 5’s since we’re starting from 0) as coming from the biased dice, which makes sense. Notice that the model doesn’t assign every 6 to the biased dice, though (eighth roll). This is because prior to that 6, we’re pretty confident we’re at the fair dice (we rolled a 2) and transitioning to the biased dice from the fair dice is less likely. I’m ok with that mistake – I’d say our model is successful!

Conclusion

I’ve shown you a little bit of the theory behind CRF’s as well as how one can be implemented for a simple problem. There’s certainly a lot more to them than I’ve been able to cover here, so I encourage you to check out the sources I’ve linked below.

Further Reading:

An Introduction to Conditional Random Fields: Overview of CRFs, Hidden Markov Models, as well as derivation of forward-backward and Viterbi algorithms.

Using CRFs for named entity recognition in PyTorch: Inspiration for this post. Shows how a CRF can be applied to a more complex application in NLP.

Footnotes

*To be precise, we’re covering a linear-chain CRF, which is a special case of the CRF in which the sequences of inputs and outputs are arranged in a linear sequence. Like I said before, this topic is deep.

*Since we’re using PyTorch to compute gradients for us, we technically only need the forward part of the forward-backward algorithm .


Related Articles