The world’s leading publication for data science, AI, and ML professionals.

Building a simple Auto Encoder via Decision Trees

How to build an Auto Encoder using random Decision Trees: eForest

Autoencoder

© by my lovely wife Tinati Kübler
© by my lovely wife Tinati Kübler

Decision trees are extremely versatile structures. You can use them for classification and regression (CART, Random Forests, …), for anomaly detection (Isolation Forests, …), and, as we will see, also for building auto encoders, among other constructions.

Disclaimer: I did not come up with this idea. This new tree-based auto encoder is due to Ji Feng and Zhi-Hua Zhou [1], and is called "eForest". They presented their work at the AAAI-18.

However, I think the paper lacks some clarity on how to decompress elements from the latent space, which I will address in this article. I will also give an easy-to-digest introduction to all the ingredients used for creating an eForest.

So, before we start, let’s do a quick recap of what decision trees and auto encoders are.

High-Level Recap of Important Concepts

Decision Trees

A decision tree is a method to cluster a set of samples from an n-dimensional feature space into different bins. This is done by checking for certain constraints on the n features of the samples.

Take the feature space X = ℝ×ℝ×{0, 1}×{A, B, C}, and the samples (1.7, 4.3, 0, A), (2.2, 3.6, 1, B), (3.5, 2.6, 0, C) and (4.1, 1.9, 1, A) living within X for example. I made up an example decision tree to sort these samples into four bins. Maybe the following picture is even self-explaining, even if you have never heard of decision trees before:

A decision tree. In each node (blue diamonds) the set of samples is split into two disjoint subsets, depending on a feature fulfilling a special requirement or not. For example, in bin 4 we can find all samples with the first coordinate being strictly greater than 4 and the last coordinate being A or C (= not B).
A decision tree. In each node (blue diamonds) the set of samples is split into two disjoint subsets, depending on a feature fulfilling a special requirement or not. For example, in bin 4 we can find all samples with the first coordinate being strictly greater than 4 and the last coordinate being A or C (= not B).

In the end, the four samples get sorted into the four bins. Since a single sample cannot take two paths at once (a feature cannot be less or equal to 4 and strictly greater than 4), each sample gets sorted into exactly one bin.

Auto Encoders

Auto encoders are basically lossy compression algorithms. They allow data to be expressed by fewer bits, which in turn saves storage. It also speeds up (machine learning) algorithms using these compressed samples, since a shorter input size usually results in lower running time.

Auto encoders consist of two algorithms:

  • Encoder: Compresses data by removing redundancies and noise.
  • Decoder: Tries to restore the original data from its compressed form.

To be more precise, the encoder projects a sample from the n-dimensional feature space to its compressed form in the m-dimensional so-called latent space, with m < n.

The decoder takes a sample from the m-dimensional latent space and decompresses it again into the n-dimensional feature space.


A good auto encoder should be able to

  1. take a sample x
  2. compress it to x’ = e(x)
  3. and decompress it again to x” = d(x’) with the property x ≈ x” = d(e(x)).

This means that packing and unpacking a sample x should not alter it too much. Of course, this should hold for any x of the feature space.

Usually, we can not expect to receive a perfect reconstruction of x. Take n=1 and m=0 (i.e. the latent space only consists of a single point x’) for example. Two different samples x₁ and x₂ will be sent to the same point x’ in the latent space by any encoder. So, in the best case, any decoder can only restore x’ to x₁ or x₂ and therefore gets one decoding wrong. If x’ gets decompressed to some other x, the decoder even got both samples wrong.

A popular representative of auto encoders is the Principal Component Analysis (PCA) which linearly maps data from n to m dimensions and back, although people usually only use the encoder to reduce the dimension of the data.

A Toy Example: eTree

Let us now see how to build an auto encoder out of decision trees. I claim that we have already seen an eForest mapping into a latent space of dimension 1 in this article! Let’s call this an eTree.

Encoding

Scroll back to the decision tree example picture, or let me do it for you:

Shortcut to the picture above. Really, it's the same decision tree again.
Shortcut to the picture above. Really, it’s the same decision tree again.

We see that each of the four samples gets sorted into one of four bins.

The core idea is to use the number of the bin the sample gets mapped into as the encoding.

So (1.7, 4.3, 0, A) and (2.2, 3.6, 1, B) get encoded as 1, (3.5, 2.6, 0, C) gets encoded as 2 and (4.1, 1.9, 1, A) is encoded as 4, or in short: e((1.7, 4.3, 0, A))=e((2.2, 3.6, 1, B))=1, e((3.5, 2.6, 0, C))=2 and e((4.1, 1.9, 1, A))=4, where e is the eTree encoder we just introduced.

Easy, right?

The problem begins when we want to decode this number again.

Decoding

In [1] the authors introduce rules and so-called MCRs to conduct the decoding. The concept is not difficult, but I found it hard to understand from the paper at first. So, let me explain it differently.

Let us assume that we get the encoding 1, i.e. there is an x from the feature space with e(x)=1 where e is the eTree encoder. Following the unique path leading into bin 1, we pick up what I will call clues about the input x.

Landing in bin 1 requires

  1. Clue 1: X₁ to be less or equal to 4 and
  2. Clue 2: X₂ to be greater or equal to 3,

i.e. we know that x looks something like this:

or, to be more precise, we can deduce that

Then we have to pick any element from this set of potential candidates and use this as the decoding. We can do this in a deterministic or probabilistic way, it does not matter.

One way, that is also used in [1] is the minimum pick. Just pick the minimum of each set, and if the set is an interval without a minimum, pick the right bound of it. Using this deterministic rule, we would decode 1 to (4, 3, 0, A) (assuming A<B<C). Other easy methods include picking the maximum or the mean or just sampling randomly from the set.


In our example, the set of potential candidates for decoding 1 is quite large, so we can expect the reconstruction to be quite bad. One way to counter this is to make the tree deeper, i.e. use more splits. More splits generate more clues which in turn generate smaller candidate sets. Let us use one more split in our toy example:

Toy eTree with one more level. The encoding can now take values from 1 to 5.
Toy eTree with one more level. The encoding can now take values from 1 to 5.

Here, an encoding of 1 would be decoded into one of

which leaves less room for messing up the first feature now, giving a lower reconstruction error. Other ways to improve decoding are to embed knowledge about the feature space, e.g. sometimes features are always greater than zero or less than some bounds like 1 or 255 (picture data), so we can remove bounds that are infinity or minus infinity.

But making the decision tree deeper and deeper is no sustainable solution since we will run into overfitting problems then, as in regression and classification tasks. Yet, so far we are still using only one dimension! So, what can we do?

eForest

Let us connect more decision trees in parallel for encoding! Then every single tree gives us a dimension in the latent space and the bin number of each tree is the feature value. Consider the following example with 3 trees:

An eForest consisting of three decision trees. The bold arrows indicate the paths that x takes.
An eForest consisting of three decision trees. The bold arrows indicate the paths that x takes.

Each of the three decision trees gives us one coordinate of the latent space, 3 in this example. Each feature in the latent can be 1, 2, 3, or 4 in our case. But it is also fine if each decision tree has a different number of bins, in general.

So, let us imagine that our input x is encoded to (2, 1, 1) in the latent space, and we wish to decode it now. We only have to collect all the clues again and put them together to create a set of potential candidates.

The clues are:

  • From Tree 1, bin 2:
  • X₁ is less or equal to 4_- X_₂ is less than 3
  • From Tree 2, bin 1:
  • X₁ is greater or equal to 2_- X_₃ is equal to 0
  • From Tree 3, bin 1:
  • X₄ is equal to C
  • _X_₂ is greater than 1

Putting everything together gives us

So the minimum decoding would give us d(x)=(2, 1, 0, C).


That is nearly everything there is to understanding eForests! What is still missing is how to actually train the eForest, but this is a short one:

Use random decision trees, i.e. trees that use random features and random cut-off points (from a reasonable range) for splitting.

That is how I also made up my examples. There was no fancy algorithm involved.

Now, let us move to some experiments to see if they are any good in terms of performance.

Experiments

The authors of [1] conducted several experiments, among them image encoding, where CNN auto encoders shine. I will directly paste the results, tables, and images from [1].

You can find the code used for the experiments on GitHub.

The datasets are the classical MNIST with 28x28x1=784 features (1 color channel) and CIFAR-10 with 32x32x3=3072 features (3 color channels).

About the Auto Encoders used in the Experiments

MLP is the "normal" auto encoder using several linear layers. The exact specifications can be taken from [1], but MLP₁ uses a latent space of dimension 500 and MLP₂ of 1000. CNN-AE is a Convolutional auto encoder, i.e. it uses convolutions internally. It follows this specification. The SWW-AE is a fancier CNN-AE. eForestˢ₅₀₀ is an eForest also utilizing the labels of the samples, which we didn’t cover here. It compresses the input data to 500 features (with subscript 1000 compresses it to 1000 dimensions). Finally, eForestᵘ₅₀₀ is the eForest as discussed in this article. The subscript number indicates the dimension of the latent space again.

Reconstruction Errors

The results look promising! The reconstruction errors for the MNISt and CIFAR-10 datasets are quite low. From [1]:

The original test samples (first rows) and the reconstructed samples.
The original test samples (first rows) and the reconstructed samples.

But one has to conduct further experiments to check if the eForests are overfitting or not. At least the authors did model reusing and ended up with good results, which is a good indicator that the eForest generalizes well and didn’t merely learn the training samples by heart. Model reusing means training a model on one training set and evaluating it on a different training set.

The original test samples (first rows) and the reconstructed samples. All models were trained on some training sets and tested on different training sets! The results look good when using eForests.
The original test samples (first rows) and the reconstructed samples. All models were trained on some training sets and tested on different training sets! The results look good when using eForests.

I just don’t understand why the authors used a latent space of dimension 1000 for the MNIST dataset when the original data is of dimension 28*28=784 only. Using 500 dimensions would have been more meaningful, in my opinion. But even if we use this artificial setting: The eForest performs better than other methods and neark achieves lossless compressing, what I would also expect.

Computation Efficiency

The training time is incredibly low compared to the other methods. However, the encoding and decoding take more time, which still leaves room for optimization, according to the authors. From [1]:

My Proof-of-Concept Implementation

Since the algorithm is quite easy, I tried to implement it myself. I didn’t even have to start from scratch since scikit-learn kind of already comes with the encoder!

The class RandomTreesEmbedding also creates numerous random decision trees and encodes inputs x depending on the bin they land in. However, they use a binary encoding, which bloats the latent space dimension. Using my example with three trees and 4 bins per tree the encoding would be (0, 1, 0, 0 | 1, 0, 0, 0 | 1, 0, 0, 0) instead of (2, 1, 1), i.e. the latent space would have a dimension of twelve instead of just three. But converting the binary representation to a single number per tree is easy, once you know the number of leaves per tree. Luckily, scikit-learn got us covered here and provides a method for this.

Thus, implementing the encoding part is simple! You can also see it from the shortencodemethod in my code. It is merely

output = self.random_trees_embedding.transform(X)
indices = []
start = 0
for estimator in self.random_trees_embedding.estimators_:
    n_leaves = estimator.get_n_leaves()
    indices.append(output[:, start:start + n_leaves].argmax(axis=1))
    start = start + n_leaves
return np.array(indices).transpose()

The decoding took me a bit more time, and it is also inefficient the way I did it, please bear with me! 😀 I just needed a quick proof-of-concept.

You can find my code here. Please, feel free to improve it!

My graphical results:

I think the reconstructions look fine, so probably my decoding works. But please, review the code before you use it for anything productive.

And if you want the real deal, go to the official clone of the implementation on GitHub and work with this. But be prepared to create a virtual environment and patch some scikit-learn files manually.

Conclusion

We have seen how to build an auto encoder with decision trees. It is an auto encoder with a unique, yet easy-to-understand design that can be trained blazingly fast. Moreover, the reconstruction accuracy can compete with well-established Convolutional auto encoders.

On the other hand, the encoding and decoding speed is still low and the eForest has to be tested on real data sets in the wild. Maybe you can be the one to write an article about it. 🙂

References

[1] J. Feng and Z. Zhou, AutoEncoder by Forest (2017), Thirty-Second AAAI Conference on Artificial Intelligence


I hope that you learned something new, interesting, and useful today. Thanks for reading!

As the last point, if you

  1. want to support me in writing more about Machine Learning and
  2. plan to get a Medium subscription anyway,

why not do it via this link? This would help me a lot! 😊

To be transparent, the price for you does not change, but about half of the subscription fees go directly to me.

Thanks a lot, if you consider supporting me!

If you have any questions, write me on LinkedIn!


Related Articles