Unsupervised Learning of Gaussian Mixture Models on a SELU auto-encoder (Not another MNIST)

Published in

Towards Data Science

5 min readAug 20, 2017

MNIST is a classical Machine Learning dataset. Imagine the following scenario:

Figure 1 - Alien who cannot understand human handwritten digits images

You are an alien who doesn’t understand human culture and for some reason managed to acquire all the images from the handwritten digit dataset (MNIST). Since you are an alien and you don’t know what the concept of a human number is, or what their corresponding symbols would look like, just by analyzing the images data, you had to guess how many different symbols (numbers) there were (and if possible generate new representations instead of just copying the ones you already had). In Machine Learning this problem can be seen as: “How to select the correct number of components in a Finite Mixture Model”

Finite Mixture Models is an algorithm for clustering and its purpose is to determine the inner structure of data when no information other than the observed values is available.

Figure 2 — Number 5

The first problem for clustering MNIST is that each image being 28x28 pixels means that each digit has 784 dimensions.

Most clustering methods suffer from curse of dimensionality. This way, to perform the unsupervised learning a dimensionality reduction method is necessary. Also, as the title suggests, a Gaussian Mixture Model will be used for the learning of the clusters (which means that the more Gaussian behavior the compressed space has, the better the fit will be..)

Figure 3 — SELU auto encoder with regularization to enforce space constraints (promote structure?)

SELU is a new activation function for Neural Networks which has the property of self-normalization. If we make a classical autoencoder and switch the activation functions to SELUs, the compressed space converges to a well defined mean and variance. Perhaps, some of the distributions which represent the compressed space will be Gaussianly distributed and can be captured nicely by a Gaussian Mixture Model. L2 regularization is used on the encoded layer to constrain its weight space. Since it is on this layer that the Gaussian fit will be made, this constraint, together with the fact that only 6 neurons are used, will create an information bottleneck which will favor structured representation of data.

First, lets check the distribution of the compressed space by doing a pairplot (a 6 dimensional one):

This space is produced by compressing the MNIST dataset from images of 784 dimensions (28x28 pixels) to a 6 dimensional space. It looks that there is some Gaussian behavior after all! Remember that the autoencoder has no access to the labels and the only reason they are included in the pairplot its for us to have a reference on possible clusters.

But now, a problem arises: How to choose the correct number of components for the Gaussian Mixture Model? Although there are many cost functions one can use to choose the appropriate number of components, such has the Bayesian Information Criterion (BIC) or the Akaike Information Criteria (AIC) there are algorithms who can do this selection for us (without having to perform a different fit for each number of components). One of them was suggested in this paper Unsupervised Learning of Finite Mixture Models

To fit the Gaussian Mixture Model, with the automatic selection of components, I’m going to use an implementation (still WIP) that I made using that paper’s insights (I essentially converted my Professor’s Matlab code to Python.) Github package

This code was built using the sklearn paradigm and can be used with that library since it implements ModelTransformer.

Figure 5 — Training the autoencoder on the MNIST dataset and fitting the Gaussian Mixture Model

It is possible to use the code on Figure 5 to fit the Gaussian Mixture Model to the compressed space while picking the appropriate number of components using the Minimum Message Length criteria.

After the model is fitted, it is possible to sample new images (in the compressed space) from the Gaussian Mixture Model, each component representing a different concept of those images ( Numbers! ). The sampled compressed space goes through the SELU decoder trained before, and the output is newly generated numbers which were sampled from the Gaussian Mixture Model.

Congratulations, you have successfully used your alienish algorithms and can now reproduce and create human digits!

Each of the following Figures were sampled from a different Gaussian component,

You can create your alien 0's:

Figure 6 — Alien generated 0’s by learning what human digit 0 is

You can create your alien 1's:

Figure 7— Alien generated 1’s by learning what human digit 1 is

You can create your alien 2's:

Figure 8— Alien generated 2’s by learning what human digit 2 is

…

You can create your alien 9's:

Figure 9— Alien generated 9’s by learning what human digit 9 is

It makes sense that 10 clusters would be the perfect choice to describe all the digits images. But if you think about it, its easy to understand why a clustering algorithm would identify similarities between some 9’s and 4’s, creating a specific cluster for this event. Remember that since this is a Gaussian Mixture, this cluster could be completely contained inside the cluster for 4’s and 9's!

The clustering algorithm actually found 21 clusters! You can check them in the jupyter notebook used for this Medium.

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” — Jain and Dubes (Jain and Dubes 1988)

Btw, here is a cool visualization of the clustering algorithm converging in a toy example:

Figure 10 — Unsupervised Learning of Gaussian Mixture Models

The code used to generate this Medium is here.

Unsupervised Learning of Gaussian Mixture Models on a SELU auto-encoder (Not another MNIST)

Written by Gonçalo Abreu