The world’s leading publication for data science, AI, and ML professionals.

Improving Autoencoder Performance with Pretrained RBMs

A Pytorch tutorial

Hands-on Tutorials

Autoencoders are unsupervised neural networks used for representation learning. They create a low-dimensional representation of the original input data. The learned low-dimensional representation is then used as input to downstream models. While autoencoders are effective, training autoencoders is hard. They often get stuck in local minima and produce representations that are not very useful.

This post will go over a method introduced by Hinton and Salakhutdinov [1] that can dramatically improve autoencoder performance by initializing autoencoders with pretrained Restricted Boltzmann Machines (RBMs). While this technique has been around, it’s an often overlooked method for improving model performance.

What are Autoencoders?

Autoencoders are a combination of two networks: an encoder and a decoder. Raw input is given to the encoder network, which transforms the data to a low-dimensional representation. The low-dimensional representation is then given to the decoder network, which tries to reconstruct the original input. They are trained by trying to make the reconstructed input from the decoder as close to the original input as possible. Deep autoencoders are autoencoders with many layers, like the one in the image above.

What about RBMs?

RBMs are generative neural networks that learn a probability distribution over its input. Structurally, they can be seen as a two-layer network with one input ("visible") layer and one hidden layer. The first layer, the "visible" layer, contains the original input while the second layer, the "hidden" layer, contains a representation of the original input. Similar to autoencoders, RBMs try to make the reconstructed input from the "hidden" layer as close to the original input as possible. Unlike autoencoders, RBMs use the same matrix for "encoding" and "decoding." Trained RBMs can be used as layers in neural networks. This property allows us to stack RBMs to create an autoencoder.

The Theory

The difficulty of training deep autoencoders is that they will often get stuck if they start off in a bad initial state. To address this, Hinton and Salakhutdinov found that they could use pretrained RBMs to create a good initialization state for the deep autoencoders. Let’s say that you wanted to create a 625–2000–1000–500–30 autoencoder. You would first train a 625–1000 RBM, then use the output of the 625–2000 RBM to train a 2000–1000 RBM, and so on. After you’ve trained the 4 RBMs, you would then duplicate and stack them to create the encoder and decoder layers of the autoencoder as seen in the diagram below. The researchers found that they could fine-tune the resulting autoencoder to perform much better than if they had directly trained an autoencoder with no pretrained RBMs.

The Code

Now that we understand how the technique works, let’s make our own autoencoder! I didn’t find any great Pytorch tutorials implementing this technique, so I created an open-source version of the code in this Github repo. The Github repo also has GPU compatible code which is excluded in the snippets here. The code portion of this tutorial assumes some familiarity with pytorch.

Data

We’ll run the autoencoder on the MNIST dataset, a dataset of handwritten digits [2]. First, we load the data from pytorch and flatten the data into a single 784-dimensional vector.

Training RBMs

We’ll start with the hardest part, training our RBM models. Hinton and Salakhutdinov employ some tricks that most RBM implementations don’t contain. I’ll point out these tricks as they come.

In the constructor, we set up the initial parameters as well as some extra matrices for momentum during training. Note that this class does not extend pytorch’s nn.Module because we will be implementing our own weight update function.

Next, we add methods to convert the visible input to the hidden representation and the hidden representation back to reconstructed visible input. Both methods return the activation probabilities, while the sample_h method also returns the observed hidden state as well. Of note, we have the option to allow the hidden representation to be modeled by a Gaussian distribution rather than a Bernoulli distribution because the researchers found that allowing the hidden state of the last layer to be continuous allows it to take advantage of more nuanced differences in the data.

Finally, we add a method for updating the weights. This method uses contrastive divergence to update the weights rather than typical traditional backward propagation. RBMs are usually implemented this way, and we will keep with tradition here. For more details on the theory behind training RBMs, see this great paper [3].

Now that we have the RBM class setup, let’s train. For training, we take the input and send it through the RBM to get the reconstructed input. We then use contrastive divergence to update the weights based on how different the original input and reconstructed input are from each other, as mentioned above. After training, we use the RBM model to create new inputs for the next RBM model in the chain.

For the MNIST data, we train 4 RBMs: 784–1000, 1000–500, 500–250, and 250–2 and store them in an array called models.

Building an Autoencoder

Next, let’s take our pretrained RBMs and create an autoencoder. The following class takes a list of pretrained RBMs and uses them to initialize a deep autoencoder. The autoencoder is a feed-forward network with linear transformations and sigmoid activations. Of note, we don’t use the sigmoid activation in the last encoding layer (250–2) because the RBM initializing this layer has a Gaussian hidden state. We separate the encode and decode portions of the network into their own functions for conceptual clarity.

We then pass the RBM models we trained to the deep autoencoder for initialization and use a typical pytorch training loop to fine-tune the autoencoder. We use the mean-squared error (MSE) loss to measure reconstruction loss and the Adam optimizer to update the parameters.

Results

After the fine-tuning, our autoencoder model is able to create a very close reproduction with an MSE loss of just 0.0303 after reducing the data to just two dimensions. The autoencoder seems to learned a smoothed-out version of each digit, which is much better than the blurred reconstructed images we saw at the beginning of this article.

As a final test, let’s run the MNIST test dataset through our autoencoder’s encoder and plot the 2d representation. For comparison, we also show the 2d representation from running the commonly-used Principal Component Analysis (PCA). Our deep autoencoder is able to separate the digits much more cleanly than PCA. Awesome!

Hope you enjoyed learning about this neat technique and seeing examples of code that show how to implement it. You can checkout this Github repo for the full code and a demo notebook. If you use what you read here to improve your own autoencoders, let me know how it goes!

References

[1] G. Hinton and R. Salakhutidnov, Reducing the Dimensionality of Data with Neural Networks (2006), Science

[2] Y. LeCun, C. Cortes, C. Burges, The MNIST Database (1998)

[3] A. Fischer and C. Igel, Training Restricted Boltzmann Machines: An Introduction (2014), Pattern Recognition


Related Articles