Build the right Autoencoder — Tune and Optimize using PCA principles. Part I

Here we will learn the desired properties in Autoencoders derived from its similarity with PCA. From that, we will build custom constraints for Autoencoders in Part II for tuning and optimization.

Chitta Ranjan
Towards Data Science

--

<<Download the free book, Understanding Deep Learning, to learn more>>

The availability of Deep Learning APIs, such as Keras and TensorFlow, have made model building and experimentation extremely easy. However, a lack of clear understanding of the fundamentals may put us in a directionless race to the best model. Reaching the best model in such a race is left to chance.

Here we will develop an understanding of the fundamental properties required in an Autoencoder. This will provide a well-directed approach for Autoencoder tuning and optimization. In Part I, we will focus on learning the properties and their benefits. In Part II, we will develop custom layer and constraints to incorporate the properties.

The primary concept that we will learn here and that will enable us to construct a right Autoencoder is, Autoencoders are directly related to Principal Component Analysis (PCA). A “right” Autoencoder mathematically means a well-posed Autoencoder. A well-posed model is easier to tune and optimize.

Autoencoder vis-à-vis PCA,

  • A linearly activated Autoencoder approximates PCA. Mathematically, minimizing the reconstruction error in PCA modeling is the same as a single layer linear Autoencoder.
  • An Autoencoder extends PCA to a nonlinear space. In other words, Autoencoders are a nonlinear extension of PCA.

Therefore, an Autoencoder should ideally have the properties of PCA. These properties are,

  • Tied Weights: equal weights on Encoder and the corresponding Decoder layer (clarified with Figure 1 in the next section).
  • Orthogonal weights: each weight vector is independent of others.
  • Uncorrelated features: output of the encoding layer are not correlated.
  • Unit Norm: the weights on a layer have unit norm.

However, Autoencoders as explained in most tutorials, e.g. Building Autoencoders in Keras [1], do not have these properties. A lack of which makes them sub-optimal.

Therefore, it is important to incorporate these properties for a well-posed Autoencoder. By incorporating them, we will also

  • Have Regularization. Orthogonality and Unit Norm constraint act as regularization. Additionally, the Tied Weights, as we will see later, reduces the number of network parameters to almost half — another type of regularization.
  • Address Exploding and Vanishing gradient. The Unit Norm constraint prevents weights from becoming large, and hence, resolves exploding gradient problem. Additionally, due to the Orthogonality constraint, only important/informative weights are non-zero. Therefore, sufficient information flows through these non-zero weights during back-propagation, and thus, avoiding vanishing gradients.
  • Have Smaller network: Without the orthogonality, the encoder has redundant weights and features. To compensate the redundancy, the encoder size is increased. On the contrary, the orthogonality ensures each encoded feature has a piece of unique information — independent of the other features. This obviates the redundancy and we can have the same amount information encoded with a smaller encoder (layer).

With a smaller network, we bring Autoencoder closer to Edge Computing.

This article will elucidate the above concept by showing the

  1. architectural similarity between PCA and Autoencoder, and
  2. suboptimality of the conventional Autoencoder.

The article will be continued in Part II with detailed steps to optimize an Autoencoder. In Part II, we find that the optimizations improved the Autoencoder reconstruction error by more than 50%.

This article assumes the reader has a basic understanding of PCA. If unfamiliar, please refer to Understanding PCA [2].

Architectural similarity between PCA and Autoencoder

Figure 1. Single layer Autoencoder vis-à-vis PCA.

For simplicity, we compare a linear single layer Autoencoder with PCA. There are multiple algorithms for PCA modeling. One of them is estimation by minimizing reconstruction error (see [3]). Following this algorithm gives a clearer understanding of the similarities between a PCA and an Autoencoder.

Figure 1 visualizes a single layer linear Autoencoder. As shown in the bottom of the figure, the Encoding process is similar to PC transformation. PC transformation is projecting the original data on the Principal Components to yield orthogonal features, called Principal Scores. Similarly, the Decoding process is similar to reconstructing the data from the Principal Scores. In both, Autoencoder and PCA, the model weights can be estimated by minimizing the reconstruction error.

In the following, we will further elaborate Figure 1 by showcasing the key Autoencoder components and their equivalent in PCA.

Suppose we have data with p features.

Input layer — data sample.

  • In the Autoencoder, the data is inputted using an Input layer of size p.
  • In PCA, the data is inputted as samples.

Encoding — the projection of data on Principal Components.

  • The size of the encoding layer is k. In PCA, k denotes the number of selected Principal Components (PCs).
  • In both, we have k<p for dimension reduction. k ≥ p leads to an over-representative model, and consequently (close to) zero reconstruction error.
  • A colored cell in the Encoding layer in Figure 1, is a computing node with p weights denoted as,

That is, for each Encoding node in 1,…,k we have a p-dimensional weight vector. This is equivalent to an eigenvector in PCA.

  • Encoding layer output in an Autoencoder is,

x is the input and W is the weight matrix. The function g is an activation function. g(Wx) is the output of the Encoding layer. If the activation is linear, this is equivalent to the Principal Scores in PCA.

Decoding — reconstruction of data from the Principal Scores.

  • The size of the decoding layer in Autoencoder and in PCA reconstruction must be the size of the input data, p.
  • In a decoder, the data is reconstructed from the encodings as,

and similarly, in PCA, it is reconstructed as,

Note that, we have W’ in Eq. 4 and W in Eq. 5. This is because, the weights on the Encoder and Decoder are not the same by default. The Decoder and PCA reconstructions will be the same if the Encoder and Decoder weights are tied, i.e.

The multi-colors in the Decoder cells indicate that the weights in different cells in the Encoder are present in the same cell in the Decoder.

This brings us to the mathematical comparisons between Autoencoder and PCA.

Mathematically a Linear Autoencoder will be similar to PCA if,

  • Tied Weights: In any general multilayer Autoencoder, the weight on layer l in the Encoder module is equal to the transpose of the weight on layer l from the end in the Decoder.
  • Orthogonal weights: The weights on the Encoding layer is Orthogonal (see Eq. 7b). The same orthogonality constraint can be enforced on intermediate Encoder layers for regularization.
  • Uncorrelated features: The output of PCA, i.e. Principal Scores, are uncorrelated. Therefore, the output of the encoder should have,
  • Unit Norm: An eigenvector in PCA is constrained to have a Unit Norm. Without this constraint, we will not get a proper solution as the variance of the projection can become arbitrarily large as long as the norm of the vector increases. For the same reason, the weights on the Encoding layer should be unit norm (see Eq. 7d). This constraint should also be applied on other intermediate layers for regularization.

Suboptimality of a regular unconstrained Autoencoder

Here we will implement PCA and a typical unconstrained Autoencoder on a random dataset. We will show their outputs differ in every aspect discussed above. This results in a suboptimal Autoencoder. Post this discussion, we will show how we can constrain an Autoencoder for proper estimations (Part II).

Complete code is available here.

Load libraries

from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(234)

import sklearn
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import decomposition
import scipy

import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense, Layer, InputSpec
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers, activations, initializers, constraints, Sequential
from keras import backend as K
from keras.constraints import UnitNorm, Constraint

Generate random data

We generate multivariate correlated normal data. The steps for data generation are elaborated in the GitHub repository.

n_dim = 5
cov = sklearn.datasets.make_spd_matrix(n_dim, random_state=None)
mu = np.random.normal(0, 0.1, n_dim)
n = 1000X = np.random.multivariate_normal(mu, cov, n)X_train, X_test = train_test_split(X, test_size=0.5, random_state=123)# Data Preprocessing
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)X_test_scaled = scaler.transform(X_test)

The test dataset will be used in Part II to compare the Autoencoder reconstruction accuracy.

Autoencoder and PCA models

We fit a single layer linear Autoencoder with encoding dimension as two. We also fit PCA with two components.

# Fit Autoencoder
nb_epoch = 100
batch_size = 16
input_dim = X_train_scaled.shape[1] #num of predictor variables,
encoding_dim = 2
learning_rate = 1e-3

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True)
decoder = Dense(input_dim, activation="linear", use_bias = True)

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)
# Fit PCA
pca = decomposition.PCA(n_components=2)
pca.fit(X_train_scaled)
Figure 2. Structure of the single-layer Autoencoder.

Visually, the Encoder-Decoder structure developed here is shown in Figure 3 below. The figure helps understand how the Weight matrices are aligned.

Figure 3. A simple linear Autoencoder to encode a 5-dimensional data into 2-dimensional features.

To follow the PCA properties, the Autoencoder in Figure 3 should follow conditions in Eq. 7a-d. Below, we will show that this conventional Autoencoder does not meet any of them.

1. Tied Weights

As we can see below, the weights on Encoder and Decoder are different.

w_encoder = np.round(autoencoder.layers[0].get_weights()[0], 2).T  # W in Figure 3.
w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 2) # W' in Figure 3.
print('Encoder weights \n', w_encoder)
print('Decoder weights \n', w_decoder)

2. Weight Orthogonality

As shown below, unlike PCA weights (i.e. the eigenvectors), the weights on Encoder and Decoder are not orthogonal.

w_pca = pca.components_
np.round(np.dot(w_pca, w_pca.T), 3)
np.round(np.dot(w_encoder, w_encoder.T), 3)
np.round(np.dot(w_decoder, w_decoder.T), 3)

3. Features Correlation

In PCA, the features are uncorrelated.

pca_features = pca.fit_transform(X_train_scaled)
np.round(np.cov(pca_features.T), 5)

But the Encoded features are correlated.

encoder_layer = Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[0].output)
encoded_features = np.array(encoder_layer.predict(X_train_scaled))
print('Encoded feature covariance\n', np.cov(encoded_features.T))

Weight non-orthogonality and features correlations are undesirable because it brings redundancy in information contained within the Encoded features.

4. Unit Norm

The unit norm for PCA weights is 1. This is a constraint applied in PCA estimation to yield a proper estimate.

print('PCA weights norm, \n', np.sum(w_pca ** 2, axis = 1))
print('Encoder weights norm, \n', np.sum(w_encoder ** 2, axis = 1))
print('Decoder weights norm, \n', np.sum(w_decoder ** 2, axis = 1))

Github Repository

The complete code is available here.

Conclusion

As such, an Autoencoder model is ill-posed. An ill-posed model does not have robust estimates. This adversely affects its test accuracy, i.e. reconstruction error on a new data.

Several recent research advancements are building and utilizing orthogonality conditions to improve Deep Learning model performance. Refer to [4] and [5] for some research directions.

In the sequel, Part II, we will implement custom constraints to incorporate the abovementioned properties derived from PCA into Autoencoders. We will see that adding the constraints improves the test reconstruction error.

Go to the sequel, Part II.

--

--