Bayesian Gaussian mixture models (without the math) using Infer.NET

A quick practical guide to coding Gaussian mixture models in Infer.NET

Jaco du Toit
Towards Data Science

--

(Image by author)

This post provides a brief introduction to Bayesian Gaussian mixture models and share my experience of building these types of models in Microsoft’s Infer.NET probabilistic graphical model framework. Being comfortable and familiar with k-means clustering and Python, I found it challenging to learn c#, Infer.NET and some of the underlying Bayesian principles used in probabilistic inference. That being said, there is a way for Python programmers to integrate .NET components and services using pythonnet, which I will cover in a follow-up post. My hope is that the content of this post will save you time, remove any intimidation that the theory may bring and demonstrate some of the advantages of what is known as the Model-based machine learning (MBML) approach. Please follow the guidelines provided in the Infer.NET documentation to get set up with the Infer.NET framework.

Bayesian Gaussian mixture models constitutes a form of unsupervised learning and can be useful in fitting multi-modal data for tasks such as clustering, data compression, outlier detection, or generative classifiers. Each Gaussian component is usually a multivariate Gaussian with a mean vector and covariance matrix, but for the sake of demonstration we will consider a less complicated univariate case.

We begin by sampling data from a univariate Gaussian distribution and store the data in a .csv file using Python code:

This is what our data looks like:

Left: a plot of 100 data points sampled from a Gaussian distribution with mean=5 and precision=10. Right: a histogram of the data. (Images by author)

Let us pretend for a moment we did not know the distribution that generated our data set. We visualise the data and make the assumption that the data was generated by a Gaussian distribution. In other words, we hope that a Gaussian distribution can sufficiently describe our data set. However, we do not know the location or the spread of this Gaussian distribution. A Gaussian distribution can be parameterised by a mean and variance parameter. Sometimes it is easier mathematically to use a mean and precision, where precision is simply the inverse of variance. We will stick with precision for which the intuition is that the higher the precision the narrower (or more “certain”) the spread of the Gaussian distribution.

Firstly, we are interested in finding the mean parameter of this Gaussian distribution, and will pretend that we know the value of its precision (we set the precision=1). In other words, we think our data is Gaussian distributed and we are unsure what its mean parameter is, but we feel confident that it has a precision=1. Can we learn its mean parameter from the data? It turns out we need a second Gaussian distribution to depict the mean of our first Gaussian distribution. This is known as a conjugate prior. Here is a graphical representation of learning the unknown mean (using a Gaussian prior with parameters mean=0, precision=1):

Bayes network for learning the mean of data with known precision. (Image by author)

Notice the difference between the mean random variable and the known precision in the graph. Here is the code in Infer.NET:

Posterior Gaussian (Gaussian mean): Gaussian(4.928, 0.009901)

After observing only 100 data points we now have a posterior Gaussian distribution, which depicts the mean of our data x. We have learned something useful from the data! But wait… we can also learn something about its precision, without having to pretend it is fixed at 1. How? We do the same thing we did to the mean and place a distribution over the precision (effectively removing our “infinitely confident” knowledge that it was equal to 1 by replacing it with something resembling our “uncertainty”). The conjugate prior for precision is the Gamma distribution. We update our graphical representation of the model by including the Gamma distribution (with prior parameters shape=2, rate=1) over a new precision random variable:

Bayes network for learning the mean and precision of data. (Image by author)

Here is the code in Infer.NET:

Posterior Gaussian (Gaussian mean): Gaussian(4.971, 0.001281)
Posterior Gamma (Gaussian precision): Gamma(52, 0.1499)[mean=7.797]

A recap of our assumptions up to this point (referring to the figures below):

  1. the data x is Gaussian distributed,
  2. we pretended to have full knowledge of its precision (precision=1) and learned its mean by using a Gaussian prior,
  3. we then stopped pretending to “know” the precision and learned the precision by using a Gamma prior. Notice the difference this makes in the figure on the left below. The first model could not learn the precision due to the restriction we imposed (shown in green).
  4. the parameters for an unknown mean and unknown precision are Gauss-Gamma distributed. After learning from the 100 data points, the prior distribution over these parameters (shown in red) updated to the posterior distribution (shown in blue in the figure on the right below).
Left: a Gaussian distribution with known precision (green) and unknown precision (blue). Right: a prior Gauss-Gamma distribution over the mean and precision parameters (red) and posterior distribution after learning from 100 data points (blue). (Images by author)

Infer.NET can produce a Factor graph of our model when setting ShowFactorGraph = true. Factor nodes are shown in black boxes and variable nodes are shown in white boxes. This graph shows our data x (the observed variable array at the bottom), which depends on a Gaussian factor. The Gaussian factor depends on a random variable called mean, and a random variable called precision. These random variables depend on a Gaussian prior and a Gamma prior respectively. The parameter values of both prior distributions are shown at the top of the graph.

Infer.NET produced factor graph of our model. (Image by author generated by Infer.NET)

We made certain assumptions in order to learn what the mean and precision of the Gaussian distribution are. In MBML, learning and inference is essentially the same. You can read more on the supported Infer.NET inference techniques and their differences here. The examples in this article make use of Variational message passing (VMP). In order to learn more complex distributions (i.e., multi-modal densities) this model will not be expressive enough and should be extended by introducing more assumptions. Ready to mix things up?

As with many things in life if one Gaussian is good, more should be better, right! First, we need a new data set and use the same Python code introduced at the start of the post. The only difference is that we set p=[0.4, 0.2, 0.4]. This means that 80% of the data should be sampled from the first and third Gaussian distribution, while 20% of the data should be sampled from the second. The data can be visualised:

Left: a plot of 100 data points sampled from three different Gaussian distributions with means=[1, 3, 5] and precisions=[10, 10, 10]. Right: a histogram of the data. (Images by author)

To create the model we will use k=3 number of Gaussian distributions, also known as components, to fit our data set. In other words, we have three mean random variables and three precision random variables that we need to learn, but we also need a latent random variable z. This random variable has a discrete distribution and is responsible for selecting the components that best describe its associated observed x value. For example, more weight should be assigned to state one of z₀ if the observed x₀ is best explained by Gaussian component one. In this example, we will pretend to know the mixture weights responsible for all data points and use a uniform assignment (w₀=1/3, w₁=1/3, w₂=1/3) as shown in the graph below.

Bayes network for learning a mixture of Gaussian distributions with known mixture weights. (Image by author)

Here is the code in Infer.NET:

Posterior Gaussian (Gaussian mean): Gaussian(1.008, 0.003284)
Posterior Gamma (Gaussian precision): Gamma(25.43, 0.2547)[mean=6.477]
Posterior Gaussian (Gaussian mean): Gaussian(5.045, 0.004061)
Posterior Gamma (Gaussian precision): Gamma(17, 0.4812)[mean=8.178]
Posterior Gaussian (Gaussian mean): Gaussian(2.889, 0.007502)
Posterior Gamma (Gaussian precision): Gamma(13.58, 0.4209)[mean=5.715]

The three Gaussian distributions/components learned from the data are plotted below:

The three learned Gaussian distributions and their sum with known weights set to 1/3. (Image by author)

Hold on…we claimed to have full knowledge of the component weights, but what if we do not? Can we also learn the weights from the data? Indeed, but we need a prior! A Dirichlet distribution is the conjugate prior for the discrete/categorical distribution. The graph below is updated showing a random variable for the unknown weights with its accompanying Dirichlet prior.

Bayes network for learning a mixture of Gaussian distributions with unknown mixture weights. (Image by author)

Here is the code in Infer.NET:

Posterior Gaussian (Gaussian mean): Gaussian(0.9955, 0.003208)
Posterior Gamma (Gaussian precision): Gamma(25.04, 0.2663)[mean=6.667]
Posterior Gaussian (Gaussian mean): Gaussian(2.719, 0.02028)
Posterior Gamma (Gaussian precision): Gamma(10.1, 0.3655)[mean=3.693]
Posterior Gaussian (Gaussian mean): Gaussian(4.513, 0.02233)
Posterior Gamma (Gaussian precision): Gamma(20.86, 0.06266)[mean=1.307]
Posterior weight distribution: 0.4563 0.168 0.3757

The three Gaussian distributions/components learned from the data and their learned weights are illustrated below:

The three learned Gaussian distributions and their sum with unknown weights = {0.45, 0.16, 0.37}. (Image by author)

In summary, we started this journey by assuming our first data set can sufficiently be described by the Gaussian distribution. We were able to learn the mean parameters and the precision parameters of the Gaussian distribution using the observed data and VMP inference in Infer.NET. Our second data set used a more complex generating mechanism, which requires a more expressive model. We then introduced a latent variable z and Dirichlet prior, which allows us to learn mixtures of Gaussian distributions and their mixture weights. All steps are provided in c# code using Infer.NET and can be accessed here.

For a more formal treatment the following books and links come recommended:

  1. https://dotnet.github.io/infer/InferNet101.pdf
  2. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
  3. Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
  4. http://mbmlbook.com/index.html
  5. http://www.jmlr.org/papers/volume6/winn05a/winn05a.pdf
  6. https://en.wikipedia.org/wiki/Exponential_family

Important concepts that were not mentioned in this post:

  1. appropriate prior parameters,
  2. identifiability,
  3. the predictive distribution,
  4. breaking symmetry,
  5. message-passing (variational message passing (VMP) & expectation propagation (EP)),
  6. Wishart conjugate prior of the inverse covariance-matrix,
  7. Occam’s razor and the Dirichlet prior.

--

--

PhD in channel coding and practicing machine learning individual. Passionate to learn and share.