Introduction To Autoencoders

A Brief Overview

Abhijit Roy
Towards Data Science
14 min readDec 12, 2020

--

Autoencoders are neural network-based models that are used for unsupervised learning purposes to discover underlying correlations among data and represent data in a smaller dimension. The autoencoders frame unsupervised learning problems as supervised learning problems to train a neural network model. The input only is passed a the output. The input is squeezed down o a lower encoded representation using an encoder network, then a decoder network decodes the encoding to recreate back the input.

The encoding produced by the encoder layer has a lower-dimensional representation of the data and shows several interesting complex relationships among data.

An Autoencoder has the following parts:

  1. Encoder: The encoder is the part of the network which takes in the input and produces a lower Dimensional encoding
  2. Bottleneck: It is the lower dimensional hidden layer where the encoding is produced. The bottleneck layer has a lower number of nodes and the number of nodes in the bottleneck layer also gives the dimension of the encoding of the input.
  3. Decoder: The decoder takes in the encoding and recreates back the input.
Image by author

The bottleneck layer is the lower dimension layer. In the diagram, we have the neural networks encoder and decoder. Phi and Theta are the representing parameters of the encoder and decoder respectively.

The target of this model is such that the Input is equivalent to the Reconstructed Output. To achieve this we minimize a loss function named Reconstruction Loss. Basically, Reconstruction Loss is given by the error between the input and the reconstructed output. It is usually given by the Mean Square error or Binary Crossentropy between the input and reconstructed output. Binary Crossentropy is used if the data is binary.

Now, we have a basic understanding of encoders. So, let’s understand a basic tradeoff we need to know while designing an autoencoder. We have to keep in mind that the reason behind using an autoencoder is that we want to understand and represent only the deep correlations and relationships among data. We need a generalized lower-dimensional representation. That is why, if the features of the data are not correlated at all then it is hard for an autoencoder to represent the data in a lower dimension. If while designing the neural network, we use a very large number of nodes in the bottleneck layer, it will create a large dimensional encoding. The problem that exists here is, the network might cheat and overfit to the input data by simply remembering the input data. In this case, we will not be able to get the correct relationships in our encodings. Again, if we use a shallow network with a very less number of nodes, it will be very hard to capture all the relationships. So, we must be very careful during designing the network.

Now, a question may arise, why go for autoencoder, when we have methods like PCA for dimensionality reduction?

Well, here goes the explanation. PCA or principal component analysis tries to find lower-dimensional orthogonal hyperplanes that describe the original data by capturing the maximum possible variance in the data and the important correlations consequently. We need to focus on the fact that we are talking about finding a hyperplane, so it's linear. But often correlations are non-linear, which are not covered by PCA.

Source

As we can see in the above diagram autoencoders cover non-linear data dependencies, thus are a better way than PCA for dimensionality reduction.

Let’s look at some of the applications of autoencoders:

  1. Autoencoders are used largely for anomaly detection: As we know, autoencoders create encodings that basically captures the relationship among data. Now, if we train our autoencoder on a particular dataset, the encoder and decoder parameters will be trained to represent the relationships on the datasets the best way. Thus, will be able to recreate any data given from that kind of dataset in the best way. So, if data from that particular dataset is sent through the autoencoder, the reconstruction error is less. But if some other kind of data is sent through the autoencoder it will generate a huge reconstruction error. If we are able to apply a correct cutoff we will be able to create an anomaly detector.
  2. Autoencoders are used for Noise Removal: If we can pass the noisy data as input and clean data as output and train an autoencoder on such given data pairs, trained Autoencoders can be highly useful for noise removal. This is because noise points usually do not have any correlations. Now, as the autoencoders need to represent the data in the lowest dimensions, the encodings usually have only the important relations there exists, rejecting the random ones. So, the decoded data coming out as output of an autoencoder is free of all the extra relations and hence the noise.
  3. Autoencoders as Generative models: Before the GAN’s came into existence, Autoencoders were used as generative models. One of the modified approaches of autoencoders, variational autoencoders are used for generative purposes.
  4. Autoencoders used for collaborative filtering: Collaborative filtering normally uses matrix factorization methods, but autoencoders can learn the dependencies and learn to predict the item-user matrix

Types of Autoencoders

Several kinds of Autoencoders have been developed to answer the different tradeoffs. Let’s look at some of them.

Undercomplete Autoencoder

The undercomplete autoencoders are the simplest architecture for autoencoders. The architecture depends on putting constraints on the number of nodes that can be added to the hidden layers and the central bottleneck. The theory behind this is, the approach tries to restrict the flow of information through the network. The architecture depends on the fact, that if the flow of information is less and the network needs to learn the encoding the best way, it will only consider the most important dependencies and reject the rest. Thus we will be able to create the encoding for best reconstruction.

The loss function used is normal reconstruction error loss, which is MSE or Binary Crossentropy. As we are restricting the flow of information using the bottleneck, there is no chance that the model memorizes the input and cheats.

Image by Author

The above diagram shows an undercomplete autoencoder. We can see the hidden layers have a lower number of nodes.

Let’s see the application of TensorFlow for creating undercomplete autoencoder.

Simple Undercomplete autoencoder:

The models created by the above code are:

Image by author

The first model is the decoder, the second is the full autoencoder and the third is the encoder model. The bottleneck layer is the place where the encoded image is generated.

We use the autoencoder to train the model and get the weights that can be used by the encoder and the decoder models.

If we send image encodings through the decoders, we will see that the images are reconstructed back.

Image by author

The upper row is the original images and the lower row is the images created from the encodings by the decoder.

Now, the images are of dimensions 28x28, and we have created encodings of dimensions of 32. if we represent the encodings as 16x2, it will look something like this:

Image by author

The lower row represents the corresponding encodings.

As we can see here, we have built a very shallow network, we can build a deep network, as the shallow networks may not be able to uncover all the underlying features, but we need to be very careful about restricting the number of hidden nodes.

The above implemented as a deep undercomplete network.

Sparse Autoencoders

When we were talking about the undercomplete autoencoders, we told we restrict the number of nodes in the hidden layer to restrict the data flow. But often this approach creates issues because the limitation on the nodes of the hidden layers and shallower networks prevent the neural network to uncover complex relationships among the data items. So, we need to use deeper networks with more hidden layer nodes. Again, if we use more hidden layer nodes, the network may just memorize the input and overfit, which will make our intentions void. So, to solve this we use regularizers. The regularizers prevent the network from overfitting to the input data and prevent the memorization problem.

During regularization, we normally regularize weights but in this case, we regularize activations that are actually passed from one hidden layer to another. In simpler words, the idea is we won’t let all the nodes in the hidden layers learn. Now, if we go to the basics of neural networks, an activation function controls how much information a particular node passes. The activation function works like a gate. If the activation for a particular node is 0, then the node is not contributing its information. The idea of sparse autoencoders is something like that.

Now, one thing to note is, the activations are dependent on the input data ad will change with the change in input. So, we let our model decide the activations and penalize their activation values. We usually do this in two ways:

L1 Regularization: L1 regularizers restrict the activations as discussed above. It forces the network to use only the nodes of the hidden layers that handle a high amount of information and block the rest.

It is given by:

The reconstruction loss is given by L and the second part is the regularizers that penalize the activations. As we can see the regularizer part is a summation of activations of all nodes in the hidden layer h. So, when we try to minimize the loss function we decrease the activations. Again, we use a tuning parameter lambda. Lambda helps to ensure how much attention we want to pay for the regularization aspect.

KL Divergence: Kullback-Leibler Divergence is a way to measure the difference and similarity between two mathematical probability distributions. It is given by:

So, basically, it tells us how similar p and q are. This method uses a sparsity parameter ρ (Rho). Rho is said to be the average activation of a neuron over a set of samples. The idea is to use a very low Rho value such that the neuron or the nodes keep a low value as average and in order to achieve that the node will have just 0 activations for some of the samples in the collection, where it is not essential.

Now, the question is how does the KL divergence help. For this, we will need to know what is a Bernoulli Distribution.

In probability theory and statistics, the Bernoulli distribution, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p

So, basically it a binary level probability distribution. We want something similar to our nodes. We want it to fire with a probability and so its distribution can be similar to a Bernoulli distribution. Now, for a particular neuron j we can calculate Rho as:

where m is the number of observations and a is the activation of the neuron in the hidden layer h. The loss is given by:

A visualization will look like this:

Image by author

The above image shows the light red nodes do not fire.

Let’s see the application of TensorFlow for creating a sparse autoencoder.

The above code uses an L1 regularizer.

Image by author

The images represent the full autoencoder, followed by the encoder and the decoder.

Denoising Autoencoders

We already have talked about autoencoders used as noise removers. So, the idea is in order to represent the underlying relations and represent in the small size encoding the autoencoders only look at the object image and not the noise, which is eliminated. Again, here we do not need to restrict the number of nodes or use a regularizer as we have a different input and output and the memorization problems do not exist anymore.

Image by author

The above network represents denoising autoencoders.

The above code can be used to create the autoencoder.

The results on datasets are as follows:

Image by author

The above are the results of the fashion mnist datasets.

As you can see, I have used a convolutional network to create the autoencoder.

The model structure is as below:

Image by author

Contractive Autoencoders

The principle that the contractive autoencoders are based on is pretty similar to the denoising encoders. The idea is that the encodings produced for similar inputs will be similar. In other words, if we change the inputs or tweak them by just a little the encodings will remain the same and show no changes. They are used for feature extractions.

Autoencoders can be implemented using any kind of neural network, like for image data we can use Convolutional Neural Nets and for time series data we can use Recurrent Neural Nets.

There exists another type of autoencoders that are a bit different from the above-stated ones which are called Variational Autoencoders.

Variational Autoencoders

To understand the concept we need to dive into a bit of basic. Actually, what do we mean by the lower dimensional encoding? The answer is, as we have seen above our above input had 784x1 or 28x28 dimension, when we encode it to a say much smaller 32x1 dimension, we basically mean that now we have 32 features which are the most important features and reflect most of the information in the data, or image.

So, say for a face, when we encode a face image of say 32x32 dimension, it has the full facial two-dimensional image, now, if we encode it to 6x1 dimension, i.e, send it through a bottleneck layer of 6 nodes, we will basically get 6 features which contribute most or the major information about the facial image. Say, for the 6 features we have a smile, skin tone, gender, beard, wears glasses, and hair color. Our encoding has a numerical value for each of these features for a particular facial image. By sending the encodings through a decoder we can reconstruct back the image.

The 6 features we talked about in the lower dimension encoding are called latent features/attributes and the set of values feature can take is its latent space.

Source

Now, different values of the latent attributes represent different images as the feature varies as shown below.

Source

We have seen that the values of the latent attributes are always discrete. This is where the variational autoencoders are different. Instead of considering to pass discrete values, the variational autoencoders pass each latent attribute as a probability distribution. Some thing as shown below.

Source

So, our representing face example becomes as follows:

Source

Now, we can see each latent attribute is passed as a probability distribution. The decoder samples from each latent distribution and decodes to reconstruct the image. So, as the sampling is random and not backpropagated the reconstructed image is similar to the input but is not actually present in the input set. This is the reason for variational autoencoders to be known as a generative network.

Source

The above image defines the situation. Now, to create a distribution for each latent vector, the encoder in place of passing the value, pass the mean and standard deviation of the distribution, which is used to create construct the normal distribution.

Source

The above image shows the structure of a variational autoencoder. The Probabilistic encoder is called the recognition model and the decoder is called the generative model.

Now, as the z or the latent values are sampled randomly, they are unknown and hence called hidden variables. Again, we know the goal is such that our reconstructed output is equivalent to the input. So, our goal is to find out what is the probability of a value to be in z or the latent vector given that it is similar to x, P(z|x), because actually we need to reconstruct x from z. In simpler words, we can see x but we need to estimate z.

We obtain the above equation, using bayes theorem. This method requires finding p(x), given by:

This problem is untractable or it won’t complete in polynomial time as, it is a multiple integral problem and the number of integral increases with the increase in latent attributes or encoding dimensions.

In order to solve this problem, we use another distribution q(z|x) which is the approximation of p(z|x) and is designed to be a tractable solution. Now, again to determine the fact that q(z|x) is similar to p(z|x) we use KL divergence between the two distributions.

The variational autoencoders use a loss function as:

The first term is the reconstruction error and the second term is the KL divergence between the two distributions. It ensures that distributions are similar, as it minimizes the KL divergence to minimize the loss. We use a trick called the Reparameterization Trick to resample latent points from distributions without using back propagation.

Conclusion

We have seen all types of autoencoders there exist and their uses.

The codes are available here.

I hope this helps.

--

--