Hands-on Tutorials

Compact prediction tree

A Lossless Model for Accurate Sequence Prediction over a finite alphabet

Louis Fruleux
Towards Data Science
7 min readOct 29, 2020

--

Image by Bela Geletneky from Pixabay

What is a Sequence Prediction problem?

The sequence prediction problem consists of finding the next element of an ordered sequence by only looking at the sequence’s items.

This problem covers a lot of applications in a variety of domains. It includes applications such as product recommendation, forecasting, and web page prefetching.

A lot of different approaches have been studied for this problem, popular ones include PPM (Prediction by Partial Matching), Markov chains, and more recently LSTM (Long short-term memory).

The Compact Prediction Tree (CPT) is an approach published in 2015 which aims to match the accuracy and outmatches the performances (time to train and predict) of popular algorithms with a lossless compression of the entire training set.
We will now enter into detail the methodologies to train, and predict and what are the pros and cons of this method.

Compact Prediction Tree definition

Before entering into details of how it is used for making a prediction, let’s describe the different elements that compose a Compact Prediction Tree (CPT):

  • A trie, for efficient storage of sequences.
  • An inverted index, for constant time retrieving of sequences containing a certain word.
  • A lookup table, for retrieving the sequence from the sequence Id.

Trie

A trie, commonly called a Prefix tree, is an ordered tree-based data structure to store sequences (such as strings). The elements of the sequences are stored in the edges, hence every descendent of a given node has the same prefix.

In this well-known example, we would like to store ["tea", "ten", "inn"].
We first put “tea” in the empty tree. Each branch of the tree corresponds to a letter. Then we add “ten”: as “tea” and “ten” share the same “te” prefix, we just create a new branch in our tree after the “te” prefix. Finally, we add “inn” in the tree that has no common prefix with the two previous sequences.

Trie is commonly used to fetch, with a tree search for every word starting with a prefix. In our case, we use it to compress and efficiently store our training set.

Inverted index

An inverted index is an index used to store a mapping from elements to its location. Here we use it to store a mapping from elements of the sequences to the IDs of the sequences.

Considering again our previous example, we notice that “T” appears in sequence 0 and 1, “E” in sequence 0 and 1, “A” in sequence 0, “N” in sequence 1 and 2, and “I” in sequence 2 only.
Hence we have the following inverted index:

{
"T": [0, 1],
"E": [0, 1],
"A": [0],
"N": [1, 2],
"I": [2]
}

As a matter of efficiency, we will use bitsets to store the inverted index (this also helps in finding “similar sequences” efficiently).

Lookup table

The lookup table is a data structure (usually an array) used to store pointers of the last node (leaf) of each sequence.

This data structure is essential to iterate on the elements of a given sequence.
For instance, if we want to retrieve the sequence of id 0, we can simply iterate over its parents. In our example the parent of the leaf of the sequence of id 0 is linked with an “A”, then an “E” and finally a “T”. Which gives us our first sequence backward: “TEA”.

Compact Prediction Tree use

Training

The training time is linear with the number of sequences in the training set.
For each training sequence, we need to do 3 steps:

  • Insert the sequence in the trie.
  • Add the elements in the inverted index.
  • Add the location of the last node in the lookup table.

A picture is worth a thousand words, here is a fully trained compact prediction tree step by step.

Predictions

To predict a sequence S we need to do these 3 steps:

  • Find similar sequences (sequences containing every element of S).
  • Compute the consequent sequence of each similar sequence (the subsequence starting after the last item in common with S).
  • Count occurrences of each element in all the subsequent sequences.

For this part, let’s take a slightly more complex example where our model trains on ["ABC", "ABD", "BC", "BAC"]. We should have an algorithm trained as:

And let’s try to predict what should come after "AB".

First, let’s compute similar sequences.
The sequences similar to S are the sequences containing every item of S in any order, any position.
To compute similar sequences we could simply use our inverted index and compute the intersections. For our example, we need to intersect [0, 1, 3] (ids of sequences where A appears) and [0, 1, 2, 3] (where B appears). Which gives [0, 1, 3].

Then, we need to compute the consequent sequences.
The consequent of a sequence Y with respect to a sequence S is the subsequence of Y starting after the last item in common with S until the end of Y.
The consequent sequences for our example should be "C" , "D" and "C" . (For our sequences 0, 1, and 3 with respect to "AB").

Finally, we simply count the number of occurrences of each element that appear in all consequent sequences and predict the letter with the most occurrences.
Which is "C" (2 occurrences) in our example.

Compact Prediction Tree performances

Test Framework

In a few words, the authors predicted the next element on several public datasets with several algorithms (including CPT, Dependency graphs, Markov-like algorithms…).
To see the results (to take with caution), I suggest you have a look at the original paper, Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction.

Another paper, Predicting train journeys from smart card data: a real-world application of the sequence prediction problem, shows the steps to tackle a sequence prediction problem. From the alphabet construction to the choice of the model.
In this paper, Dr. Hoekstra compares CPT with different approaches such as PPM and AKOM. And depending on your needs, it also shows CPT can be a relevant solution with relatively good accuracy.

Pros and Cons

The pros of Compact Prediction Trees are:

  • CPT is an algorithm to store losslessly all the data from the training set.
  • The algorithm is highly explainable (easy to trace back the similar sequences, the subsequent sequences…)
  • Fast prediction. Some benchmarks can be found in the original paper or here.

The cons of Compact Prediction Trees are:

  • CPT ignores the order within the sequences, hence CPT tends to predict the most frequent item. (If you are ready to have a slightly slower prediction that takes into consideration order. I suggest you check out the subseq algorithm, developed by a common coauthor.)
  • CPT is quite sensitive to noise. If a new element is in a sequence to predict, CPT won’t be able to find any similar sequence. This has been addressed by CPT+, which you can find details in the “Go further” section.

To go further

Python example

For this small example, we will use this implementation. It is referenced on the official website of one of the co-authors as an implementation of CPT. It also includes some features of CPT+ such as noise reduction.

from cpt.cpt import Cptmodel = Cpt()# training
model.fit([['hello', 'world'],
['hello', 'this', 'is', 'me'],
['hello', 'me']
])
# predictions
model.predict([['hello'], ['hello', 'this']])
# Output: ['me', 'is']

For more information about the parameters of the predict method or the hyperparameters, you can check out the documentation.

Further reading

If you are interested in the subject, I strongly suggest the following papers:

Special thanks to Joseph Rocca for all the help.
Thanks to my friends who also proofread this article.

--

--