Getting Started

Probabilistic Linear Discriminant Analysis (PLDA) Explained

Prachi Singh
Towards Data Science
9 min readNov 1, 2020

--

Source: Image by author. Contours of Gaussian distribution (ellipse) with different mean (center) represented with different colours. Each contour correspond to different class.

Explaining concepts and applications of Probabilistic Linear Discriminant Analysis (PLDA) in a simplified manner.

Introduction

As the name suggests, Probabilistic Linear Discriminant Analysis is a probabilistic version of Linear Discriminant Analysis (LDA) with abilities to handle more complexity in data. Although PLDA has wide variety of applications in many areas of research including computer vision, speech processing, Natural Language Processing (NLP), it is still not explained in a way to reach a wide audience. PLDA has been used for recognition, verification, generating similarity scores for clustering, class specific feature extraction.

My goal here is to discuss research papers which have introduced and applied PLDA . I am going to explain the concepts using equations , graphs, easy to implement codes so that it is understandable to all those who are in the field of data science.

Derivations presented here need prior knowledge of basic probability and linear algebra. You can refer to sources mentioned in references.

Why PLDA ?

Before I proceed further , I would like to give motivation for using PLDA over LDA.

LDA is a supervised dimensionality reduction technique. LDA projects the data to a lower dimensional subspace such that in the projected subspace , points belonging to different classes are more spread out (maximizing between-class covariance Sb) as compared to the spread within each class (minimizing within-class covariance Sw). It is demonstrated in below figure.

Source: Pattern Recognition and Machine Learning” by Bishop. Each colour represents a class. m1 and m2 are mean of class1 and class2 respectively. The left plot shows projections of data on the line joining m1 and m2. There is lot of overlaps among samples of two classes. Right plot shows projection of data using LDA hence minimising the overlaps between two classes

This works well for classification when we have data coming from the seen classes. But how do we perform similar tasks when observed data is coming from unseen class? For example, consider the task of face recognition, we trained a model using different face images such that each unique face represents a class.

Now given two images, we want to find whether they belong to the same person or not even though the model has not seen any image of that person before.The common approach is to project both images into lower dimensional space and find distance between them. If the distance is small that means they are from same class. LDA will project the images into a subspace obtained from training data and hence will not be optimal. Thus we need a model which is more flexible in finding the optimal direction of projections. One way to address this problem is to use probabilistic approach unlike LDA which is deterministic. It is called as Probabilistic LDA.

Advantages of PLDA

  • We can generate class center using continuous non-linear functions even from single example of unseen class.
  • In hypothesis testing , we can compare two examples from previously unseen class(es) to determine whether they belong to same class.
  • Perform clustering of samples from unseen classes

What is Probabilistic LDA ?

Let x={x₁,x₂,…,xₙ} be the D-dimensional observations or data samples. Probabilistic LDA or PLDA is a generative model which assumes that given data samples are generated from a distribution. We need to find the parameters of model which best describe the training data. The choice of distribution from which data is assumed to be generated is based on two factors: (1) It should represent different type of data (2) Computation of parameters is simple and fast. The most popular distribution which satisfies these conditions is Gaussian. Below figure shows the probability distribution function (pdf) of Gaussian, contour and samples generated from it.

Source: Plots by author

show code

In order to cluster data into classes, we need to represent each class with separate Gaussian distribution hence we can use Gaussian Mixture Model (GMM). GMM is a weighted mixture of Gaussians with different means and covariances for each Gaussian where each mixture can represent each class.

Probability distribution function (pdf) of GMM , is given as :

Source:Plots by author. pdf of GMM with πₖ, μₖ, Φₖ are weight, mean and covariance of k-th Gaussian

show code

Let y is a latent (hidden) class variable which represents the mean of a class/mixture in GMM. Now given this class variable y , probability of generating data sample x is given as:

Where Φw represents within class covariance of the given class. This states that once we know the class parameters of Gaussian, we can generate samples of this class. Here class variable y itself is assumed to be generated from separate distribution. The probability of generating a particular instance of y which represents a class from an assumed distribution is called prior probability.

LDA is also modeled as GMM, where mean of Gaussian of each mixture is the sample mean of training data belonging to respective class and prior probability of y is discrete, where y can take only discrete values given as

This will generate GMM which is discussed above. Maximising likelihood of this model with respect to parameters {πₖ, μₖ, Φw} recovers the standard LDA projections. But, in order to handle classes not seen during training we need to modify prior and make it continuous where y can take any real values generated from the distribution given as

This shows that latent variable y for each class can be generated using the Gaussian distribution with mean m and between-class covariance Φb . Hence it is called as probabilistic LDA. This is better explained using the figures shown below:

Source: Image by author. Small ellipses represent contours of Gaussian distribution for each class with mean at the center. Data points are represented by cross marks. The grey ellipse generates mean of each class from Gaussian distribution.
Source: Images of people are from LFW dataset, representation by author. Plots are for illustration.

Samples y1 and y2 are generated from Gaussian distribution eq (2), which represent the person identity. We sample examples x1 and x2 of each class (person) which represent different orientation of the person using eq (1) with y1 and y2 as the mean.

Latent Space

Goal of PLDA is to project data samples to a latent space such that samples from same class are modeled using same distribution. These projections are represented using latent variables which will be discussed in this section.

As discussed earlier, is Φb is between-class covariance positive semi-definite matrix and Φw is within-class covariance positive definite matrix. We obtain a transformation matrix V which converts Φw and Φb to diagonal matrices simultaneously, given as:

where I is identity matrix and Ψ is diagonal matrix. Thus we decorrelated each dimension of the data samples. Now parameters of PLDA model are {m, A, Ψ}.

You can refer to below derivations to obtain above equations. It requires knowledge of Linear algebra concepts like eigen values and eigen vectors of a matrix and eigen value decomposition.

show derivation

Let u, v are Gaussian random variables in latent space defined as,

We can find relation between data samples x, class variable y and these latent variables as follows:

Thus u represents example of class and v represents class variable in the projected space. The relationships in eq(5) and eq(6) are represented in the form of flow chart as shown below:

Source: Image by author. PLDA models the class center v and examples u1,u2 in latent space where variables are independent. Examples x1,x2 in the original feature space is related to its latent representation u via an invertible transformation A

If interested, you can refer to the derivation below:

Applications

PLDA allows to make inference about the classes not present during training. One example is speaker recognition. Model parameters are learned from training data but the model should handle examples from speakers not present during training. Some of the tasks in which PLDA can be used are discussed here.

Classification:

We have a set of examples xg ∈(x,x,…,xM) one from each M classes. Now given a probe example xp , task is find it belongs to which of the classes. This is determined by maximizing the likelihood. First, project the examples into latent space using eq(6) as

This de-correlates the data as the covariance of u is I. P(upug ) gives probability of probe example coming from same class as example from known set.

Therefore, class C assigned to probe example is given as

Computation of P(up|ug)

Class Inference:

One of the advantages of PLDA is that, we can find class variable y from a single example of the class also. For an example x, we compute posterior probability of y given x denoted as p(y|x) which will again be a Gaussian. The estimate of y is obtained by maximising p(y|x) with respect to y. It is nothing but the mean of Gaussian p(y|x). It can be written as follows:

Hypothesis testing:

Given two examples (u1,u2 in latent space) from unseen class(es), if we need to find whether they belong to same class or not , then we compute likelihood ratio R based on two hypothesis.

Clustering:

PLDA is also used to cluster examples into groups. Based on the log likelihood ratio R or PLDA scores, we compare each example with all the other examples. This will create a PLDA scores matrix like a similarity scores matrix which can used to perform clustering using available algorithms like k-means, agglomerative clustering, etc.

Example

Speaker diarization: The task of partitioning the input audio stream ​ into segments based on speaker sources.

The procedure is as follows

  1. Divide audio into short segments such that each segment contains only one speaker.
  2. Extract features for each segment
  3. Compute PLDA scores matrix using pre-trained PLDA
  4. Perform clustering using scores matrix
Source: Image by author. Speaker Diarization pipeline.

I used Speech recognition toolkit Kaldi to complete step 1 and 2 and train PLDA as given below.

Following code involve : Read features -> Apply PCA -> Project in PLDA latent space-> Computes PLDA scores matrix. Complete code with features and pre-trained model can be found in GitHub.

show code

Analysis

  1. PLDA latent representations(u): Below plot shows the effect of projecting data into PLDA latent space. We can see that, when we project the data into PLDA latent, it becomes separable into speakers even though PLDA model has never seen audio from these speakers.
2d projections of 128-d features. Left side plot is PCA transformed embeddings. Left side plot is PLDA latent representations. Each colour represents one speaker. Plot by author.

show code

2. Log likelihood ratio or PLDA scores matrix: We can compute a similarity score matrix S by finding similarity scores S(i,j) between x and xfor all i and j segments extracted from the audio. Below plot shows Normalized cosine scores matrix and PLDA scores matrix, normalized by dividing with highest score in the matrix. It depicts higher contrast in PLDA scores showing higher confidence in making decision of same and different speakers. Lighter colours (shades of yellow) indicates high scores whereas dark colours (shades of blue) indicates low scores. We can see blocks of light and colour which helps to identify same speaker and different speaker regions easily.

Comparison of Cosine affinity scores matrix and PLDA affinity scores matrix. Plot by author.

show code

3. Histogram: Below plot shows distribution of PLDA scores. Higher count is present on the extreme which helps in better clustering.

x-axis: Normalized PLDA scores, y-axis: Count. Plot by author.

show code

Summary

  1. PLDA is a generative model where we assume that the data samples X of a class are generated from a Gaussian distribution. The mean of Gaussian represents the class variable y is generated from another Gaussian distribution called as prior.
  2. For the task of recognition, we can compare two examples of unseen classes using PLDA scores by comparing likelihood of examples from same class vs likelihood of examples from different.
  3. We can perform clustering of examples into classes using PLDA scores between all pairs of examples from entire set of examples.

References

  1. Ioffe, Sergey. “Probabilistic linear discriminant analysis.” In European Conference on Computer Vision, pp. 531–542. Springer, Berlin, Heidelberg, 2006.
  2. Prince, Simon JD, and James H. Elder. “Probabilistic linear discriminant analysis for inferences about identity.” In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE, 2007.
  3. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
  4. Duda, Jarek. “Gaussian AutoEncoder.” arXiv preprint arXiv:1811.04751, 2018
  5. Strang, Gilbert. Introduction to Linear Algebra. 5th ed. Wellesley-Cambridge Press, 2016. ISBN: 9780980232776

--

--