Stacked Autoencoders.

Extract important features from data using deep learning.

Rajas Bakshi
Towards Data Science


Photo by Mika Baumeister on Unsplash

Dimensionality reduction

While solving a data science problem, did you ever come across a dataset with hundreds of features? or perhaps a thousand features? If no, then you don’t know how challenging it can be to develop an efficient model. Dimensionality reduction for those who don’t know is an approach to filter the essential features from the data.

Having more input features in the data makes the task of predicting the dependent feature challenging. A large number of elements can sometimes cause the model to have poor performance. The cause behind this could be the model may try to find the relation between the feature vector and output vector that is very weak or nonexistent. There are various methods used for reducing the dimensions of the data, and a comprehensive guide on the same can be found on the link below.

Principal Component Analysis (PCA)

PCA is one of the popular approach used for dimensionality reduction. PCA can help you to find a vector of the most relevant features. This new set of features are called principal components. The first main component is extracted so that it explains the most variation in the dataset. The second central component, which is unrelated to the first, attempts to explain the remaining variation in the dataset. The third principle component tries to explain the interpretation that the previous two principal components can’t explain, and so on. Although this approach helps us in reducing the dimensions, PCA is only efficient when the relation between the dependent features and independent features is linear. For a deeper understanding of PCA, visit the link below.


Autoencoders are used to reduce the dimensions of data when a nonlinear function describes the relationship between dependent and independent features. Autoencoders are a type of unsupervised artificial neural networks. Autoencoders are used for automatic feature extraction from the data. It is one of the most promising feature extraction tools used for various applications such as speech recognition, self-driving cars, face alignment / human gesture detection. The architecture of the Autoencoder is shown in the figure below

AutoEncoder Source: Introduction to autoencoders.

As seen in the figure above, an autoencoder architecture is divided into three parts: The encoder, bottleneck, and decoder. The encoder picks the crucial features from the data, while the decoder attempts to recreate the original data using the critical components. By retaining just the characteristics needed to reconstruct the data, the autoencoders decrease the data dimension. Autoencoders are a type of feed-forward network that may be trained using the same procedures as feed-forward networks. The output of the Autoencoder is the same as the input with some loss. Thus, autoencoders are also called lossy compression technique. Moreover, autoencoders can perform as PCA if we have one dense layer with a linear activation function in each encoder and decoder.

Stacked Autoencoder

Some datasets have a complex relationship within the features. Thus, using only one Autoencoder is not sufficient. A single Autoencoder might be unable to reduce the dimensionality of the input features. Therefore for such use cases, we use stacked autoencoders. The stacked autoencoders are, as the name suggests, multiple encoders stacked on top of one another. A stacked autoencoder with three encoders stacked on top of each other is shown in the following figure.

Image by author

According to the architecture shown in the figure above, the input data is first given to autoencoder 1. The output of the autoencoder 1 and the input of the autoencoder 1 is then given as an input to autoencoder 2. Similarly, the output of autoencoder 2 and the input of autoencoder 2 are given as input to autoencoder 3. Thus, the length of the input vector for autoencoder 3 is double than the input to the input of autoencoder 2. This technique also helps to solve the problem of insufficient data to some extent.

Implementing Stacked autoencoders using python

To demonstrate a stacked autoencoder, we use Fast Fourier Transform (FFT) of a vibration signal. The FFT vibration signal is used for fault diagnostics and many other applications. The data has very complex patterns, and thus a single autoencoder is unable to reduce the dimensions of the data. The figure below is a plot of the FFT waveform. The amplitude of the FFT is transformed to be between 0 and 1.

Image by author

To get a better visual understanding, we reshape the signal into a 63*63 matrix and plot it (As it is a vibration signal converted to the image, take it with a grain of salt). The figure below is the image representation of the vibration signal.

Image by author

I know it isn’t easy to see a lot in this image. However, we can still see a few features in the picture. The bright white shot approximately at (0,15) is the peak seen in the FFT of the vibration signal.

Now we start with creating our Autoencoder.

batch_size = 32
input_dim = x_train[0].shape[0] #num of predictor variables learning_rate = 1e-5
input_layer = Input(shape=(input_dim, ), name=”input”)
#Input Layer
encoder = Dense (2000, activation=”relu”, activity_regularizer=regularizers.l1(learning_rate))(input_layer)
#Encoder’s first dense layer
encoder = Dense (1000, activation=”relu”,
#Encoder’s second dense layer
encoder = Dense (500, activation=”relu”, activity_regularizer=regularizers.l1(learning_rate))(encoder)
# Code layer
encoder = Dense (200, activation=”relu”, activity_regularizer=regularizers.l1(learning_rate))(encoder)
# Decoder’s first dense layer
decoder = Dense(500, activation=”relu”, activity_regularizer=regularizers.l1(learning_rate))(encoder)
# Decoder’s second dense layer
decoder = Dense(1000, activation=”relu”, activity_regularizer=regularizers.l1(learning_rate))(decoder)
# Decoder’s Third dense layer
decoder = Dense(2000, activation=”relu”, activity_regularizer=regularizers.l1(learning_rate))(decoder)
# Output Layer
decoder = Dense(input_dim, activation=”sigmoid”, activity_regularizer=regularizers.l1(learning_rate))(decoder)

The autoencoder designed above has two dense layers on both sides: encoder and decoder. Notice that the number of neurons in each decoder and encoder are the same. Moreover, the decoder is the mirror reflection of the encoder.

As we see, the FFT signal has 4000 data points; thus, our input and output layers have 4000 neurons. When we go deep into the network, subsequently, the number of neurons decreases. Finally, at the code layer, we have only 200 neurons. Thus this autoencoder tries to reduce the number of features from 4000 to 200.

Now, we build the model, compile it and fit it on our training data. As the autoencoder’s target output is the same as the input, we pass x_train as both inputs as well as output.

autoencoder_1 = Model(inputs=input_layer, outputs=decoder)autoencoder_1.compile(metrics=[‘accuracy’],loss=’mean_squared_error’,optimizer=’adam’)satck_1 =, x_train,epochs=200,batch_size=batch_size)

Once we have trained our first autoencoder, we concatenate the output and input of the first autoencoder.

autoencoder_2_input = autoencoder_1.predict(x_train)autoencoder_2_input = np.concatenate((autoencoder_2_input , x_train))

Now the input for autoencoder 2 is ready. Thus, we build, compile and train autoencoder 2 on our new dataset.

autoencoder_2 = Model(inputs=input_layer, outputs=decoder)autoencoder_2.compile(metrics=[‘accuracy’],loss=’mean_squared_error’,optimizer=’adam’)satck_2 =, autoencoder_2_input,epochs=100,batch_size=batch_size)

Once we have trained our autoencoder 2 we move towards training our third autoencoder. As we did for our second autoencoder, the input to the third autoencoder is a concatenation of output and input of our second autoencoder.

autoencoder_3_input = autoencoder_2.predict(autoencoder_2_input)autoencoder_3_input = np.concatenate((autoencoder_3_input, autoencoder_2_input))

And now, lastly, we train our third autoencoder. As we did for the last two encoders, we build, compile and train on our new data.

autoencoder_2 = Model(inputs=input_layer, outputs=decoder)autoencoder_3.compile(metrics=[‘accuracy’], loss=’mean_squared_error’, optimizer=’adam’)satck_3 =, autoencoder_3_input, epochs=50, batch_size=16)

After training our stacked autoencoder, we achieve an accuracy of approximately 90%. This means that our stacked autoencoders can recreate our original input signal with about 90% of accuracy.

The image of the original and recreated signal is shown below.

Image by author

