Build the right Autoencoder — Tune and Optimize using PCA principles. Part II

In continuation of Part I, here we will define and implement custom constraints for building a well-posed Autoencoder. A well-posed Autoencoder is a regularized model that improves the test reconstruction error.

Chitta Ranjan

Published in

Towards Data Science

9 min readJul 12, 2019

<<Download the free book, Understanding Deep Learning, to learn more>>

Go to the prequel, Part I.

In Part I, we learned that PCA and Autoencoders share architectural similarities. But despite this, an Autoencoder by itself does not have PCA properties, e.g. orthogonality. We understood that incorporating the PCA properties will bring significant benefits to an Autoencoder, such as resolving vanishing and exploding gradient, and overfitting via regularization.

Based on this, properties that we would like Autoencoders to inherit are,

Tied weights,
Orthogonal weights,
Uncorrelated features, and
Unit Norm.

In this article, we will

implement custom layer and constraints to incorporate them.
demonstrate how they work, and the improvements in reconstruction errors that they bring.

These implementations will enable constructing a well-posed Autoencoder and optimizing it. In our example, the optimizations improved the reconstruction error by more than 50%.

Note: regularization techniques, such as, dropout, are popularly used. But without a well-posed model, these approaches take longer to optimize.

The following section shows the implementation in detail. The reader can skip to Key Takeaways section for a brief summary.

A well-posed Autoencoder

We will develop an Autoencoder for a randomly generated dataset with five features. We divide the dataset into train and test. As we add constraints, we will evaluate the performance with test data reconstruction error.

This article contains the implementation details to help a practitioner try a variety of choices. The complete code is present here.

Import libraries

from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(234)import sklearn
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import decomposition
import scipyimport tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense, Layer, InputSpec
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers, activations, initializers, constraints, Sequential
from keras import backend as K
from keras.constraints import UnitNorm, Constraint

Generate and prepare Data

n_dim = 5
cov = sklearn.datasets.make_spd_matrix(n_dim, random_state=None)mu = np.random.normal(0, 0.1, n_dim)n = 1000X = np.random.multivariate_normal(mu, cov, n)X_train, X_test = train_test_split(X, test_size=0.5, random_state=123)# Scale the data between 0 and 1.
scaler = MinMaxScaler()
scaler.fit(X_train)X_train_scaled = scaler.transform(X_train)X_test_scaled = scaler.transform(X_test)X_train_scaled

Estimation Parameters

nb_epoch = 100
batch_size = 16
input_dim = X_train_scaled.shape[1] #num of predictor variables, 
encoding_dim = 2
learning_rate = 1e-3

Baseline Model

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True) 
decoder = Dense(input_dim, activation="linear", use_bias = True)

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
                epochs=nb_epoch,
                batch_size=batch_size,
                shuffle=True,
                verbose=0)

Baseline reconstruction error

train_predictions = autoencoder.predict(X_train_scaled)
print('Train reconstrunction error\n', sklearn.metrics.mean_squared_error(X_train_scaled, train_predictions))
test_predictions = autoencoder.predict(X_test_scaled)
print('Test reconstrunction error\n', sklearn.metrics.mean_squared_error(X_test_scaled, test_predictions))

Figure 2.2. Baseline Autoencoder Reconstruction Error.

Autoencoder Optimization

Keras provides a variety of layers and constraints. We have an available constraint for Unit Norm. For others, we will build custom layer and constraints.

Custom Layer: Tied weights.

With this custom layer, we enforce the weights on encoder and decoder as equal. Mathematically, the transpose of decoder weights equals the encoder weights (Eq. 7a in Part I).

class DenseTied(Layer):
    def __init__(self, units,
                 activation=None,
                 use_bias=True,
                 kernel_initializer='glorot_uniform',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 tied_to=None,
                 **kwargs):
        self.tied_to = tied_to
        if 'input_shape' not in kwargs and 'input_dim' in kwargs:
            kwargs['input_shape'] = (kwargs.pop('input_dim'),)
        super().__init__(**kwargs)
        self.units = units
        self.activation = activations.get(activation)
        self.use_bias = use_bias
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.bias_initializer = initializers.get(bias_initializer)
        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)
        self.kernel_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)
        self.input_spec = InputSpec(min_ndim=2)
        self.supports_masking = True
                
    def build(self, input_shape):
        assert len(input_shape) >= 2
        input_dim = input_shape[-1]

        if self.tied_to is not None:
            self.kernel = K.transpose(self.tied_to.kernel)
            self._non_trainable_weights.append(self.kernel)
        else:
            self.kernel = self.add_weight(shape=(input_dim, self.units),
                                          initializer=self.kernel_initializer,
                                          name='kernel',
                                          regularizer=self.kernel_regularizer,
                                          constraint=self.kernel_constraint)
        if self.use_bias:
            self.bias = self.add_weight(shape=(self.units,),
                                        initializer=self.bias_initializer,
                                        name='bias',
                                        regularizer=self.bias_regularizer,
                                        constraint=self.bias_constraint)
        else:
            self.bias = None
        self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dim})
        self.built = True

    def compute_output_shape(self, input_shape):
        assert input_shape and len(input_shape) >= 2
        output_shape = list(input_shape)
        output_shape[-1] = self.units
        return tuple(output_shape)

    def call(self, inputs):
        output = K.dot(inputs, self.kernel)
        if self.use_bias:
            output = K.bias_add(output, self.bias, data_format='channels_last')
        if self.activation is not None:
            output = self.activation(output)
        return output

Autoencoder with Tied Decoder.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True) 
decoder = DenseTied(input_dim, activation="linear", tied_to=encoder, use_bias = True)autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='sgd')
autoencoder.summary()autoencoder.fit(X_train_scaled, X_train_scaled,
                epochs=nb_epoch,
                batch_size=batch_size,
                shuffle=True,
                verbose=0)

Observations

1a. Equal weights.

w_encoder = np.round(np.transpose(autoencoder.layers[0].get_weights()[0]), 3)
w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 3)
print('Encoder weights\n', w_encoder)
print('Decoder weights\n', w_decoder)

1b. Biases are different.

b_encoder = np.round(np.transpose(autoencoder.layers[0].get_weights()[1]), 3)
b_decoder = np.round(np.transpose(autoencoder.layers[1].get_weights()[0]), 3)
print('Encoder bias\n', b_encoder)
print('Decoder bias\n', b_decoder)

2. Custom Constraint: Weights Orthogonality.

class WeightsOrthogonalityConstraint (Constraint):
    def __init__(self, encoding_dim, weightage = 1.0, axis = 0):
        self.encoding_dim = encoding_dim
        self.weightage = weightage
        self.axis = axis
        
    def weights_orthogonality(self, w):
        if(self.axis==1):
            w = K.transpose(w)
        if(self.encoding_dim > 1):
            m = K.dot(K.transpose(w), w) - K.eye(self.encoding_dim)
            return self.weightage * K.sqrt(K.sum(K.square(m)))
        else:
            m = K.sum(w ** 2) - 1.
            return m

    def __call__(self, w):
        return self.weights_orthogonality(w)

Applying Orthogonality on both Encoder and Decoder Weights.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias=True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=0)) 
decoder = Dense(input_dim, activation="linear", use_bias = True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=1))

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
                epochs=nb_epoch,
                batch_size=batch_size,
                shuffle=True,
                verbose=0)

Observation.

2a. The weights are close to orthogonal for both Encoder and Decoder.

w_encoder = autoencoder.layers[0].get_weights()[0]
print('Encoder weights dot product\n', np.round(np.dot(w_encoder.T, w_encoder), 2))

w_decoder = autoencoder.layers[1].get_weights()[0]
print('Decoder weights dot product\n', np.round(np.dot(w_decoder, w_decoder.T), 2))

3. Custom Constraint: Uncorrelated Encoded features.

For uncorrelated features, we will impose penalty on the sum of off-diagonal elements of the encoded features covariance.

class UncorrelatedFeaturesConstraint (Constraint):
    
    def __init__(self, encoding_dim, weightage = 1.0):
        self.encoding_dim = encoding_dim
        self.weightage = weightage
    
    def get_covariance(self, x):
        x_centered_list = []

        for i in range(self.encoding_dim):
            x_centered_list.append(x[:, i] - K.mean(x[:, i]))
        
        x_centered = tf.stack(x_centered_list)
        covariance = K.dot(x_centered, K.transpose(x_centered)) / tf.cast(x_centered.get_shape()[0], tf.float32)
        
        return covariance
            
    # Constraint penalty
    def uncorrelated_feature(self, x):
        if(self.encoding_dim <= 1):
            return 0.0
        else:
            output = K.sum(K.square(
                self.covariance - tf.math.multiply(self.covariance, K.eye(self.encoding_dim))))
            return output

    def __call__(self, x):
        self.covariance = self.get_covariance(x)
        return self.weightage * self.uncorrelated_feature(x)

Applying the constraint in the Autoencoder.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, activity_regularizer=UncorrelatedFeaturesConstraint(encoding_dim, weightage = 1.)) 
decoder = Dense(input_dim, activation="linear", use_bias = True)

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
                epochs=nb_epoch,
                batch_size=batch_size,
                shuffle=True,
                verbose=0)

Observation.

3a. We have less correlated encoded features. Imposing this penalty is harder. A stronger constraint function can be explored.

encoder_layer = Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[0].output)
encoded_features = np.array(encoder_layer.predict(X_train_scaled))
print('Encoded feature covariance\n', np.round(np.cov(encoded_features.T), 3))

4. Constraint: Unit Norm.

UnitNorm constraint is prebuilt in Keras. We will apply this constraint on both Encoder and Decoder layers. It is important to note that we keep axis=0 for Encoder layer and axis=1 for the Decoder layer.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, kernel_constraint=UnitNorm(axis=0)) 
decoder = Dense(input_dim, activation="linear", use_bias = True, kernel_constraint=UnitNorm(axis=1))autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='sgd')
autoencoder.summary()autoencoder.fit(X_train_scaled, X_train_scaled,
                epochs=nb_epoch,
                batch_size=batch_size,
                shuffle=True,
                verbose=0)

Observation.

4.a. The norms of weights on Encoder and Decoder along the encoding axis is,

w_encoder = np.round(autoencoder.layers[0].get_weights()[0], 2).T  # W in Figure 3.
w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 2)  # W' in Figure 3.print('Encoder weights norm, \n', np.round(np.sum(w_encoder ** 2, axis = 1),3))
print('Decoder weights norm, \n', np.round(np.sum(w_decoder ** 2, axis = 1),3))

As also mentioned before, the norms are not exactly 1.0 because this not a hard constraint.

Putting Everything Together

Here we will put together the above properties. Depending on the problem, a certain combination of these properties will work better than others.

Applying several constraints together can sometimes harm the estimation. For example, in the dataset used here, combining Tied Layer, Weight Orthogonality, and UnitNorm worked the best.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=0), kernel_constraint=UnitNorm(axis=0)) 
decoder = DenseTied(input_dim, activation="linear", tied_to=encoder, use_bias = False)autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='sgd')
autoencoder.summary()autoencoder.fit(X_train_scaled, X_train_scaled,
                epochs=nb_epoch,
                batch_size=batch_size,
                shuffle=True,
                verbose=0)train_predictions = autoencoder.predict(X_train_scaled)
print('Train reconstrunction error\n', sklearn.metrics.mean_squared_error(X_train_scaled, train_predictions))
test_predictions = autoencoder.predict(X_test_scaled)
print('Test reconstrunction error\n', sklearn.metrics.mean_squared_error(X_test_scaled, test_predictions))

GitHub Repository

The complete mentioning the steps and more details on model tuning is here.

cran2367/pca-autoencoder-relationship

Understand the relationship between PCA and autoencoder - cran2367/pca-autoencoder-relationship

github.com

Key takeaways

Improvement in Reconstruction Error

Table 1. Summary of reconstruction errors.

The reconstruction error on test data for baseline model is 0.027.
Adding each property to the Autoencoder reduced the test error. The improvement ranges from 19% by Weight Orthogonality to 67% by Tied Weights.
The improvements will vary from data to data.
In our problem, combining Tied Weights, Weight Orthogonality, and UnitNorm yielded the optimal model with the best reconstruction error.
Although the error in the optimal model is larger than the model with only Tied Weights, this is more stable and, hence, preferable.

Key Implementation Notes

Tied Weights

In the Tied Weights layer, DenseTied, the biases will be different in the Encoder and Decoder.
To have exactly all weights as equal, set use_bias=False.

Weight Orthogonality

kernel_regularizer is used for adding constraints or regularization on weights of a layer.
The axis of orthogonality should be by row, axis=0, for encoder and by column, axis=1, for decoder.

Uncorrelated Encoded features

activity_regularizer is used to apply constraints on the output features of a layer. Therefore, it is used here to constrain off-diagonal covariance of encoded features to zero.
This constraint is not strong. Meaning, it does not push the off-diagonal covariance elements extremely close to zero.
Another customization of this constraint can be explored.

Unit Norm

UnitNorm should be on different axes for encoder and Decoder.
Similar to weight orthogonality, this is applied on rows, axis=0, for encoder and on columns, axis=1, for decoder.

General

There are two classes, Regularizer and Constraints for building the custom functions. Practically, both are the same for our application. We used Constraints class for Weight Orthogonality and Uncorrelated features.
All the three constraints — Unit Norm, Uncorrelated Encoded features, and Weight Orthogonality — are soft constraints. Meaning, they can bring model weights and features close to the desired property but not exactly. For example, the Weights are nearly Unit Norm and not exactly.

In practice,

The performance of incorporating these properties will differ across problems.
Explore each property individually under different settings, e.g. with and without bias, and different weightages for orthogonality and uncorrelated features constraints on different layers.
In addition to these, include popular regularization techniques, such as Dropout layers, in the Autoencoder.

Summary

In the prequel, Part I, we learned the important properties that Autoencoders should inherit from PCA.
Here, we implemented custom layers and constraints to incorporate those properties.
In Table 1, we showed that these properties significantly improve the test reconstruction error.
As mentioned in the Key Takeaways, we require to do trial-and-error to find the best settings. However, these trials are in a direction with interpretative meanings.

Go to the prequel, Part I.

References

https://stackoverflow.com/questions/53751024/tying-autoencoder-weights-in-a-dense-keras-layer

Build the right Autoencoder — Tune and Optimize using PCA principles. Part II

In continuation of Part I, here we will define and implement custom constraints for building a well-posed Autoencoder. A well-posed Autoencoder is a regularized model that improves the test reconstruction error.

A well-posed Autoencoder

Import libraries

Generate and prepare Data

Autoencoder Optimization

Putting Everything Together

cran2367/pca-autoencoder-relationship

Understand the relationship between PCA and autoencoder - cran2367/pca-autoencoder-relationship

Key takeaways

Improvement in Reconstruction Error

Key Implementation Notes

In practice,

Summary

References

Written by Chitta Ranjan