Build the right Autoencoder — Tune and Optimize using PCA principles. Part II

In continuation of Part I, here we will define and implement custom constraints for building a well-posed Autoencoder. A well-posed Autoencoder is a regularized model that improves the test reconstruction error.

Chitta Ranjan
Towards Data Science

--

<<Download the free book, Understanding Deep Learning, to learn more>>

Go to the prequel, Part I.

In Part I, we learned that PCA and Autoencoders share architectural similarities. But despite this, an Autoencoder by itself does not have PCA properties, e.g. orthogonality. We understood that incorporating the PCA properties will bring significant benefits to an Autoencoder, such as resolving vanishing and exploding gradient, and overfitting via regularization.

Based on this, properties that we would like Autoencoders to inherit are,

  1. Tied weights,
  2. Orthogonal weights,
  3. Uncorrelated features, and
  4. Unit Norm.

In this article, we will

  • implement custom layer and constraints to incorporate them.
  • demonstrate how they work, and the improvements in reconstruction errors that they bring.

These implementations will enable constructing a well-posed Autoencoder and optimizing it. In our example, the optimizations improved the reconstruction error by more than 50%.

Note: regularization techniques, such as, dropout, are popularly used. But without a well-posed model, these approaches take longer to optimize.

The following section shows the implementation in detail. The reader can skip to Key Takeaways section for a brief summary.

A well-posed Autoencoder

Figure 1. Apply constraints for a well-posed Autoencoder.

We will develop an Autoencoder for a randomly generated dataset with five features. We divide the dataset into train and test. As we add constraints, we will evaluate the performance with test data reconstruction error.

This article contains the implementation details to help a practitioner try a variety of choices. The complete code is present here.

Import libraries

from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(234)
import sklearn
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import decomposition
import scipy
import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense, Layer, InputSpec
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers, activations, initializers, constraints, Sequential
from keras import backend as K
from keras.constraints import UnitNorm, Constraint

Generate and prepare Data

n_dim = 5
cov = sklearn.datasets.make_spd_matrix(n_dim, random_state=None)mu = np.random.normal(0, 0.1, n_dim)
n = 1000X = np.random.multivariate_normal(mu, cov, n)X_train, X_test = train_test_split(X, test_size=0.5, random_state=123)# Scale the data between 0 and 1.
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)X_test_scaled = scaler.transform(X_test)X_train_scaled

Estimation Parameters

nb_epoch = 100
batch_size = 16
input_dim = X_train_scaled.shape[1] #num of predictor variables,
encoding_dim = 2
learning_rate = 1e-3

Baseline Model

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True) 
decoder = Dense(input_dim, activation="linear", use_bias = True)

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)
Figure 2.1. Baseline Model Parameters.

Baseline reconstruction error

train_predictions = autoencoder.predict(X_train_scaled)
print('Train reconstrunction error\n', sklearn.metrics.mean_squared_error(X_train_scaled, train_predictions))
test_predictions = autoencoder.predict(X_test_scaled)
print('Test reconstrunction error\n', sklearn.metrics.mean_squared_error(X_test_scaled, test_predictions))
Figure 2.2. Baseline Autoencoder Reconstruction Error.

Autoencoder Optimization

Keras provides a variety of layers and constraints. We have an available constraint for Unit Norm. For others, we will build custom layer and constraints.

  1. Custom Layer: Tied weights.

With this custom layer, we enforce the weights on encoder and decoder as equal. Mathematically, the transpose of decoder weights equals the encoder weights (Eq. 7a in Part I).

class DenseTied(Layer):
def __init__(self, units,
activation=None,
use_bias=True,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
tied_to=None,
**kwargs):
self.tied_to = tied_to
if 'input_shape' not in kwargs and 'input_dim' in kwargs:
kwargs['input_shape'] = (kwargs.pop('input_dim'),)
super().__init__(**kwargs)
self.units = units
self.activation = activations.get(activation)
self.use_bias = use_bias
self.kernel_initializer = initializers.get(kernel_initializer)
self.bias_initializer = initializers.get(bias_initializer)
self.kernel_regularizer = regularizers.get(kernel_regularizer)
self.bias_regularizer = regularizers.get(bias_regularizer)
self.activity_regularizer = regularizers.get(activity_regularizer)
self.kernel_constraint = constraints.get(kernel_constraint)
self.bias_constraint = constraints.get(bias_constraint)
self.input_spec = InputSpec(min_ndim=2)
self.supports_masking = True

def build(self, input_shape):
assert len(input_shape) >= 2
input_dim = input_shape[-1]

if self.tied_to is not None:
self.kernel = K.transpose(self.tied_to.kernel)
self._non_trainable_weights.append(self.kernel)
else:
self.kernel = self.add_weight(shape=(input_dim, self.units),
initializer=self.kernel_initializer,
name='kernel',
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
if self.use_bias:
self.bias = self.add_weight(shape=(self.units,),
initializer=self.bias_initializer,
name='bias',
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)
else:
self.bias = None
self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dim})
self.built = True

def compute_output_shape(self, input_shape):
assert input_shape and len(input_shape) >= 2
output_shape = list(input_shape)
output_shape[-1] = self.units
return tuple(output_shape)

def call(self, inputs):
output = K.dot(inputs, self.kernel)
if self.use_bias:
output = K.bias_add(output, self.bias, data_format='channels_last')
if self.activation is not None:
output = self.activation(output)
return output

Autoencoder with Tied Decoder.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True) 
decoder = DenseTied(input_dim, activation="linear", tied_to=encoder, use_bias = True)
autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)
autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()
autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)

Observations

1a. Equal weights.

w_encoder = np.round(np.transpose(autoencoder.layers[0].get_weights()[0]), 3)
w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 3)
print('Encoder weights\n', w_encoder)
print('Decoder weights\n', w_decoder)

1b. Biases are different.

b_encoder = np.round(np.transpose(autoencoder.layers[0].get_weights()[1]), 3)
b_decoder = np.round(np.transpose(autoencoder.layers[1].get_weights()[0]), 3)
print('Encoder bias\n', b_encoder)
print('Decoder bias\n', b_decoder)

2. Custom Constraint: Weights Orthogonality.

class WeightsOrthogonalityConstraint (Constraint):
def __init__(self, encoding_dim, weightage = 1.0, axis = 0):
self.encoding_dim = encoding_dim
self.weightage = weightage
self.axis = axis

def weights_orthogonality(self, w):
if(self.axis==1):
w = K.transpose(w)
if(self.encoding_dim > 1):
m = K.dot(K.transpose(w), w) - K.eye(self.encoding_dim)
return self.weightage * K.sqrt(K.sum(K.square(m)))
else:
m = K.sum(w ** 2) - 1.
return m

def __call__(self, w):
return self.weights_orthogonality(w)

Applying Orthogonality on both Encoder and Decoder Weights.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias=True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=0)) 
decoder = Dense(input_dim, activation="linear", use_bias = True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=1))

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)

Observation.

2a. The weights are close to orthogonal for both Encoder and Decoder.

w_encoder = autoencoder.layers[0].get_weights()[0]
print('Encoder weights dot product\n', np.round(np.dot(w_encoder.T, w_encoder), 2))

w_decoder = autoencoder.layers[1].get_weights()[0]
print('Decoder weights dot product\n', np.round(np.dot(w_decoder, w_decoder.T), 2))

3. Custom Constraint: Uncorrelated Encoded features.

For uncorrelated features, we will impose penalty on the sum of off-diagonal elements of the encoded features covariance.

class UncorrelatedFeaturesConstraint (Constraint):

def __init__(self, encoding_dim, weightage = 1.0):
self.encoding_dim = encoding_dim
self.weightage = weightage

def get_covariance(self, x):
x_centered_list = []

for i in range(self.encoding_dim):
x_centered_list.append(x[:, i] - K.mean(x[:, i]))

x_centered = tf.stack(x_centered_list)
covariance = K.dot(x_centered, K.transpose(x_centered)) / tf.cast(x_centered.get_shape()[0], tf.float32)

return covariance

# Constraint penalty
def uncorrelated_feature(self, x):
if(self.encoding_dim <= 1):
return 0.0
else:
output = K.sum(K.square(
self.covariance - tf.math.multiply(self.covariance, K.eye(self.encoding_dim))))
return output

def __call__(self, x):
self.covariance = self.get_covariance(x)
return self.weightage * self.uncorrelated_feature(x)

Applying the constraint in the Autoencoder.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, activity_regularizer=UncorrelatedFeaturesConstraint(encoding_dim, weightage = 1.)) 
decoder = Dense(input_dim, activation="linear", use_bias = True)

autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)

autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()

autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)

Observation.

3a. We have less correlated encoded features. Imposing this penalty is harder. A stronger constraint function can be explored.

encoder_layer = Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[0].output)
encoded_features = np.array(encoder_layer.predict(X_train_scaled))
print('Encoded feature covariance\n', np.round(np.cov(encoded_features.T), 3))

4. Constraint: Unit Norm.

UnitNorm constraint is prebuilt in Keras. We will apply this constraint on both Encoder and Decoder layers. It is important to note that we keep axis=0 for Encoder layer and axis=1 for the Decoder layer.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, kernel_constraint=UnitNorm(axis=0)) 
decoder = Dense(input_dim, activation="linear", use_bias = True, kernel_constraint=UnitNorm(axis=1))
autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)
autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()
autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)

Observation.

4.a. The norms of weights on Encoder and Decoder along the encoding axis is,

w_encoder = np.round(autoencoder.layers[0].get_weights()[0], 2).T  # W in Figure 3.
w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 2) # W' in Figure 3.
print('Encoder weights norm, \n', np.round(np.sum(w_encoder ** 2, axis = 1),3))
print('Decoder weights norm, \n', np.round(np.sum(w_decoder ** 2, axis = 1),3))

As also mentioned before, the norms are not exactly 1.0 because this not a hard constraint.

Putting Everything Together

Here we will put together the above properties. Depending on the problem, a certain combination of these properties will work better than others.

Applying several constraints together can sometimes harm the estimation. For example, in the dataset used here, combining Tied Layer, Weight Orthogonality, and UnitNorm worked the best.

encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=0), kernel_constraint=UnitNorm(axis=0)) 
decoder = DenseTied(input_dim, activation="linear", tied_to=encoder, use_bias = False)
autoencoder = Sequential()
autoencoder.add(encoder)
autoencoder.add(decoder)
autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='sgd')
autoencoder.summary()
autoencoder.fit(X_train_scaled, X_train_scaled,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
verbose=0)
train_predictions = autoencoder.predict(X_train_scaled)
print('Train reconstrunction error\n', sklearn.metrics.mean_squared_error(X_train_scaled, train_predictions))
test_predictions = autoencoder.predict(X_test_scaled)
print('Test reconstrunction error\n', sklearn.metrics.mean_squared_error(X_test_scaled, test_predictions))

GitHub Repository

The complete mentioning the steps and more details on model tuning is here.

Key takeaways

Improvement in Reconstruction Error

Table 1. Summary of reconstruction errors.
  • The reconstruction error on test data for baseline model is 0.027.
  • Adding each property to the Autoencoder reduced the test error. The improvement ranges from 19% by Weight Orthogonality to 67% by Tied Weights.
  • The improvements will vary from data to data.
  • In our problem, combining Tied Weights, Weight Orthogonality, and UnitNorm yielded the optimal model with the best reconstruction error.
  • Although the error in the optimal model is larger than the model with only Tied Weights, this is more stable and, hence, preferable.

Key Implementation Notes

Tied Weights

  • In the Tied Weights layer, DenseTied, the biases will be different in the Encoder and Decoder.
  • To have exactly all weights as equal, set use_bias=False.

Weight Orthogonality

  • kernel_regularizer is used for adding constraints or regularization on weights of a layer.
  • The axis of orthogonality should be by row, axis=0, for encoder and by column, axis=1, for decoder.

Uncorrelated Encoded features

  • activity_regularizer is used to apply constraints on the output features of a layer. Therefore, it is used here to constrain off-diagonal covariance of encoded features to zero.
  • This constraint is not strong. Meaning, it does not push the off-diagonal covariance elements extremely close to zero.
  • Another customization of this constraint can be explored.

Unit Norm

  • UnitNorm should be on different axes for encoder and Decoder.
  • Similar to weight orthogonality, this is applied on rows, axis=0, for encoder and on columns, axis=1, for decoder.

General

  • There are two classes, Regularizer and Constraints for building the custom functions. Practically, both are the same for our application. We used Constraints class for Weight Orthogonality and Uncorrelated features.
  • All the three constraints — Unit Norm, Uncorrelated Encoded features, and Weight Orthogonality — are soft constraints. Meaning, they can bring model weights and features close to the desired property but not exactly. For example, the Weights are nearly Unit Norm and not exactly.

In practice,

  • The performance of incorporating these properties will differ across problems.
  • Explore each property individually under different settings, e.g. with and without bias, and different weightages for orthogonality and uncorrelated features constraints on different layers.
  • In addition to these, include popular regularization techniques, such as Dropout layers, in the Autoencoder.

Summary

  • In the prequel, Part I, we learned the important properties that Autoencoders should inherit from PCA.
  • Here, we implemented custom layers and constraints to incorporate those properties.
  • In Table 1, we showed that these properties significantly improve the test reconstruction error.
  • As mentioned in the Key Takeaways, we require to do trial-and-error to find the best settings. However, these trials are in a direction with interpretative meanings.

Go to the prequel, Part I.

--

--