Learning from Multimodal Target

Mixture Density Neural Network — the violation of assumptions, implementation, analysis, and applications using Tensorflow.

Dwipam Katariya

Published in

Towards Data Science

8 min readApr 18, 2020

Introduction

Making a prediction and measuring the uncertainty around it for model evaluation is very important for statistical as well as business reasons. Fundamentally, for a supervised model, we have predictors(x) and a target (y) and we try to predict target using the predictor and quantify how well we predicted. This is usually done by minimizing the sum of squares or cross-entropy error function which outputs approximately means over every predictor in case of continuous predictor or posterior probabilities for categorical variables.

Business Case example

Let’s consider an e-commerce company(ABC) who sells hand watches and has a typical Amazon type business model where sellers list products to sell and buyer visits to shop those products. ABC wants to model product listing prices to make recommendation prices when another seller tries to list similar products. Below is the distribution of listing prices for a digital type watch.

Digital Type Watch Listing Price Distribution | Source: Author

Looking at the distribution, a digital watch can be sold from $20 to $2500. There are sellers who are selling watches are 3 different price ranges. $20–$500, $700–$1500, $1600–$2500. This type of data violates one of the unimodal normality assumptions of linear regression. However, in the practical scenario, we don’t know the underlying distribution as empirical distribution is still a Normal Distribution.

Fitting a model and violation of assumptions

If we fit a linear regression model with the type of the watch as only the predictor, it would predict 1133$ for digital type watch by overpredicting for 1st Type of watches and underpredicting for the 3rd type of watches. The standard deviation is an error in the prediction which would be 709$, making this model a complete failure. If we were to generate random samples with this predicted mean(mu) and standard deviation(sigma), it would generate prices between empty regions as well. Negative price as well! [Note: We log-transform price before fitting a model as the price is not normally distributed anyway, but for simplicity let’s assume it is]. Even fitting a neural network to this type of data is a complete failure model as it still tries to minimize the sum of the squared error function. What if the standard deviations are different for different types of watches? Let’s consider the below violin plot:

Bi-modal distribution with increasing variance | Source: Author

Every type is a bimodal distribution model with different variances at different levels. Variance across watch type changes with different categories. What happens to the prediction and standard deviation of error if we fit a neural network or a linear regression to this data with only watch type as a predictor? Does this data follow all the assumptions of regression?

Normal Distribution equation | Source: Wikipedia

Fundamentally, when we minimize the linear function or sum of squares we minimize squared error term ( μ(x, Θ)− y )² of the output of the linear function μ given x, and the parameters(Θ) of the function. It’s the only μ that depends upon x and discards the rest of the parameter (standard deviation) also known as Homoscedasticity.

Can we learn multimodal distribution parameters by watch type? Yes, with MDN

Gaussian Mixture Model and Mixture Density Network(MDN)

I know many people are well acquainted with GMM, yet I would like to describe it to contrast it with MDN as they are very similar. GMM is an expectation-maximization unsupervised learning algorithm as K-means except learns parameter of an assumed distribution. K-means does not work in case of overlapping clusters while GMM can perform overlapping cluster segmentation by learning the parameters of an underlying distribution. For the above example, we can derive the parameters of underlying tri-modal distribution by applying GMM. Theoretically, GMM computes n components (clusters)and associated mu, sigma and cluster membership probabilities by optimizing over given probability distribution given by:

GMM Likelihood Function | Source: https://en.wikipedia.org/wiki/Mixture_model

Where K is the number of clusters and phi are the cluster latent probabilities. Using the EM algorithm we can learn phi, mu, and sigma.

For our ABC example, we compute 3 mus and 3 sigmas with 1500*3[Number of Samples * Number of Clusters]phi. For understanding, I trained GMM on the above dataset and verified if cluster mus and sigmas are close to true mu and signals.

GMM Output [https://github.com/dwipam/MDN/blob/master/gmm.py] | Source: Author

This builds the base for the Mixture Density Network(MDN). We can learn model kernel parameters dependent upon predictors. For this purpose, we will restrict our kernel to the gaussian kernel. MDN optimizes the same likelihood function as GMM and outputs mu, sigma, and phi for each sample in our dataset.

3 Cluster MDN Architecture | Source: Author

Without going into further mathematical details, let’s jump to training a model. Christopher M. Bishop has explained better than what I can!

Building and training MDN

Synthetic data set equally distributed across 2 clusters and x dependent parameters(mu, sigma) | Source: Author

Generating Data

We will try to generate the data with below x and y relationship

x_ = sin(0.5 * x) * 3.0+ x * 0.5)
f1(x) = N(x_, square(x_)/15) + N(0,1) # Cluster 1 y
f2(x) = N(f1(x) + 12, square(x)/100) + N(0,1) # Cluster 2 y
This equation generates data with 2 distinct cluster and x dependent variance.

Fitting MLP

A fully connected neural network was fit to the data with the Sum-of-Squared loss function and random samples were generated from the prediction. As stated earlier, predictions are mean and the standard deviation is calculated with below formula

Standard Deviation(Sigma) from the model | Source: Author

where the error is (y-model_predictions)

Normal Random samples were generated from the mean and standard deviation.

This is a problem that model makes a prediction within an empty region by underpredicting for 50% data and overpredicting for the remaining 50%. As means and sigmas are data-dependent let’s try using MDN

Constructing MDN

Fully Connected 4 Layer Neural Network

The first part is equivalent to what someone would create a neural network for any other problem. Here we created 4 Hidden Layer NN with (50, 20, 20, 20) nodes each respectively. I am using Tanh activation, but here any other activation is fine as the function we are trying to learn is a simple function and won’t be a problem for any other activation function.

Defining mu, sigma, phi

Then, we create `K` size vector for each mu, sigma, and phi. K is 2 in our case

Our targets have values from -ve to +ve, hence we won’t use activation function for estimating Mean(mu).
Standard deviation(sigma) cannot be negative hence we can use softplus, elu, relu or any other variant as far as it outputs value > 0 and does not limit the output as with sigmoid.
Posterior Cluster Assignment probabilities(Pi) are calculated over all k. As they are mutually exclusive events we use softmax activation function.

Defining the Loss function can be done using TensorFlow-probability built-in function (tfp.distributions.MixtureSameFamily). However, if you really want to compute the loss we can still do it with the below formula.

Tensorflow MDN LOSS without tfp.MixtureSameFamily

Loss is computed using the same GMM likelihood equation mentioned above. First, compute the mu and sigma per component and compute the posterior probability. Then multiply with the component associated phi and sum all the posterior probabilities to get likelihood. Then Log the likelihood and sum them across all the samples gives log-likelihood. Then the mean loss is the mean of negative log-likelihood which further goes into the optimizer. If you want to save a step and let TensorFlow take care of this we can use TensorFlow-probability.distributions.MixtureSameFamily. Tensorflow-probability is widely used by data scientists, ML researchers, and statisticians for probabilistic modeling.

Tensorflow MDN LOSS with tfp.MixtureSameFamily

tfp.distributionsCategorical defines the categorical loss. As we limit to Normal distribution, we can define it using TensorFlow-probability. We could have defined the Normal distribution equation instead !. MixtureSameFamily would convert the vectors of mus and sigma into the covariance matrix for Mixture distribution. Then use log_prob to return the log_likelihood and finally minimize the negative log-likelihood.

Define optimizer

One can use any optimizer but I decided to you RMSProp as it’s still better than stochastic gradient descent optimizer.

Finally, we train our network, generate random samples from mixture distribution and also look at the predicted mean(mu), standard deviation(sigma) and cluster probabilities(phi).

After learning MDN with 2 components we get exactly what we hoped for. Now estimations of means are not at the empty regions and rightly estimates the bi-modal distribution means per x.

In order to look at the variance, we can generate random samples and make sure that those belong to their respective clusters.

Random Samples from MDN estimated means and sigmas | Source: Author

Compared to MLP model predictions, we can see two distinct clusters overlapping the underlying truth thereby accurately estimating x dependent bi-modal distribution parameters.

Estimated Cluster probabilities distribution | Source: Author

Cluster probabilities(phi) are very close to 50%(46%-54% is variance in model parameters, can different per training) as I generated same number of samples for each cluster for each x i.e. P(cluster=1|x) = P(cluster=2|x). If this was not the case then we could have seen the different distribution of cluster assignment probabilities.

Conclusion

With our previous ABC example, the company can fit MDN and gain a true understanding of the price distribution. If we don’t know our targets are multi-modal distribution cluster probabilities(phi) can give a good understanding of that. In practice, it can be extremely time-consuming and hard to train an MDN as assumptions of distributions might not hold true. MDN can be used to learn a mixture distribution of different families, but your model is as right as your assumptions are!

With this simple blog post, I demonstrated the disadvantage of the sum of squared functions with the advantage of applying MDN to measure the uncertainty around the parameter estimates and how important they are for the business problems. Given its simple architecture and ease of implementation (Thanks to TF), I would love to hear thoughts from the reader!

Training Notebook:

dwipam/MDN

Permalink Dismiss GitHub is home to over 40 million developers working together to host and review code, manage…

github.com

Cite:

@article{dkatariya2020MDN,
  title={Learning from Multimodal Target},
  author={Katariya, Dwipam},
  journal={Medium, Towards Data Science},
  volume={4},
  year={2020}
}

References:

MDN Paper-https://github.com/dwipam/MDN/blob/master/MDN.pdfTF MixtureSame Family-https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/MixtureSameFamilyShaked’s Blog-https://engineering.taboola.com/predicting-probability-distributions/
GMM-https://en.wikipedia.org/wiki/GMM