The world’s leading publication for data science, AI, and ML professionals.

Feature Transformations: A Tutorial on PCA and LDA

Reducing the dimension of a dataset using methods such as PCA

Photo by Nicole Cagnina on Unsplash
Photo by Nicole Cagnina on Unsplash

Introduction

When dealing with high-dimension data, it is common to use methods such as Principal Component Analysis (PCA) to reduce the dimension of the data. This converts the data to a different (lower dimension) set of features. This contrasts with feature subset selection which selects a subset of the original features (see [1] for a turorial on feature selection).

PCA is a linear transformation of the data to a lower dimension space. In this article we start off by explaining what a linear transformation is. Then we show with Python examples how PCA works. The article concludes with a description of Linear Discriminant Analysis (LDA) a supervised linear transformation method. Python code for the methods presented in that paper is available on GitHub.

Linear Transformations

Imagine that after a holiday Bill owes Mary £5 and $15 that needs to be paid in euro (€). The rates of exchange are; £1 = €1.15 and $1 = €0.93. So the debt in € is:

Here we are converting a debt in two dimensions (£,$) to one dimension (€). Three examples of this are illustrated in Figure 1, the original (£5, $15) debt and two other debts of (£15, $20) and (£20, $35). The green dots are the original debts and the red dots are the debts projected into a single dimension. The red line is this new dimension.

Figure 1. An illustration of how converting £,$ debts to € is a linear transformation. Image by author.
Figure 1. An illustration of how converting £,$ debts to € is a linear transformation. Image by author.

On the left in the figure we can see how this can be represented as matrix multiplication. The original dataset is a 3 by 2 matrix (3 samples, 2 features), the rates of exchange form a 1D matrix of two components and the output is a 1D matrix of 3 components. The exchange rate matrix is the transformation; if the exchange rates are changed then the transformation changes.

We can perform this matrix multiplication in Python using the code below. The matrices are represented as numpy arrays; the final line calls the dot method on the cur matrix to perform matrix multiplication (dot product). This will return the matrix [19.7, 35.85, 55.55].

The general format for such data transformations is shown in Figure 2. Y is the original dataset (n samples, d features); this is reduced to k features ** in X’ by multiplying by the transformation matrix P** which has dimension d by k.

Figure 2. If we have a dataset Y of n samples described by d features, this can be reduced to k features (X') by multiplying by the transformation matrix P of dimension d x k. Image by author.
Figure 2. If we have a dataset Y of n samples described by d features, this can be reduced to k features (X’) by multiplying by the transformation matrix P of dimension d x k. Image by author.

Principal Components Analysis

In general the transformation matrix P determines the transformation. In the example in Figure 1 the details of the transformation are determined by the rates of exchange and these are given. If we wish to use PCA to reduce the dimension of a dataset, how do we decide on the nature of the transformation? Well PCA is driven by three principles:

  1. Select a transformation that preserves the spread in the data, i.e. prefer dimensions that preserve distances between data points.
  2. Select dimensions that are orthogonal to each other (no redundancy).
  3. Select k dimensions that capture most of the variance in the data (say 90%).

These principles are illustrated in Figure 3. We have a dataset of individuals described by two features, waist measurement and weight. These features are correlated with each other so the objective is to project the data into different, uncorrelated dimensions without losing the ‘spread’ in the data. These new dimensions are the principal components that give PCA its name. As an alternative to thinking about this as a projection, you can think of it as rotating the data cloud to align with the red axes in Figure 3. Either way, the new axes are PC1, the first principal component and PC2 which is perpendicular to PC1. If it were felt that PC1 captured enough of the variation in the data then PC2 might be dropped.

Figure 3. A 2D dataset where the features are weight and waist measurement. The first principle component (PC1) should be in the direction that captures most of the variance in the data. PC2 should be orthogonal to PC1 so that they are independent. Image by author.
Figure 3. A 2D dataset where the features are weight and waist measurement. The first principle component (PC1) should be in the direction that captures most of the variance in the data. PC2 should be orthogonal to PC1 so that they are independent. Image by author.

The steps to perform PCA on a data matrix Y as shown in Figure 2 are as follows:

  1. Calculate the means and standard deviations of the columns of Y.
  2. Subtract the column means from each row of Y and divide by the standard deviation to create the normalised centred matrix Z.
  3. Calculate the covariance matrix C = 1/(n-1) Zᵀ Z where Zᵀ is the transpose of Z.
  4. Calculate the eigenvectors and eigenvalues of the covariance matrix C.
  5. Examine the eigenvalues in descending order to determine the number of dimensions k to be retained – this is the number of principle components.
  6. The top k eigenvectors make up the columns of the transformation matrix P which has dimension (p × k).
  7. The data is transformed by X′ = ZP where X′ has dimension (n × k).

In the following examples we will work with the Harry Potter TT dataset that is shared on Github. The format of the data is shown below. There are 22 rows and fve descriptive features. We will use PCA to compress this to two dimensions.

Python code to do this is shown below. Y_dfis a Pandas dataframe with the dataset. The eigenvalues and eigenvectors (ev,evec) are calculated in line 8. If we examine the eigenvalues they tells us the amount of variance captured in each PC; [49%, 31%, 11%, 5%, 4%]. The first two PCs will retain 80% of (49% + 31%) of the variance in the data so we decide to go with two PCs. X_dashcontains this data projected into two dimensions. The data projected in 2D is shown in Figure 4. It could be argued that PC1 dimension represents competence/incompetence and the PC2 dimension represents goodness. Fred & George Weasley (twins) are plotted at the same point as they have exactly the same feature values in the original dataset.

Figure 4. The Harry Potter dataset projected into two dimensions (2 PCs). Image by author.
Figure 4. The Harry Potter dataset projected into two dimensions (2 PCs). Image by author.

If we use the PCA implementation in scikit learn we can do this in three lines of code:

In line 4 the data is normalised, the PCA object is set up in line 5 and the data transformation is done in line 6. Again, the code for this is available in the notebook on Github.

Linear Discriminant Analysis

It should be clear that PCA is inherently an unsupervised ML technique in that it does not take any class labels into account. In fact, PCA will not necessarily be effective in a supervised learning context – this should not be surprising given the focus on maintaining the spread in the data without consideration for class labels. In Figure 4 we see how PCA does on the penguins dataset [2]. This is a three class dataset described by four features (also available on GitHub).

Figure 5. These scatter plots compare the performance of PCA and LDA on the penguins dataset. PCA does not do so well as it does not take class label information into account. Image by author.
Figure 5. These scatter plots compare the performance of PCA and LDA on the penguins dataset. PCA does not do so well as it does not take class label information into account. Image by author.

On the right in Figure 5 we see how Linear Discriminant Analysis (LDA) performs on the same dataset. LDA takes the class labels into account and seeks a projection that maximises the separation between the classes. The objective is to uncover a transformation that will maximise between-class separation and minimise within-class separation. These can be calculated in two matrices S for between-class separation and S𝓌 for within-class separation:

where _n𝒸 is the number of objects in class c,_ μ __ is the mean of all examples and _μ𝒸 is the mean of all examples in class c:_

The components within these summations _μ, μ_𝒸, xⱼ are vectors of dimension p so S__ and S𝓌__ are matrices of dimension p × p. The objectives of maximising between-class separation and minimising within-class separation can be combined into a single maximisation called the Fisher criterion:

This formulation and the task of finding the best Wˡᵈᵃ matrix is discussed in more detail in [3]. For now we just need to recognise that _Wˡᵈ_ᵃ has the same role as the P matrix in PCA. It has dimension p × k and will project the data into a k dimension space that maximises between class separation and minimises within class separation. The objectives are represented by the two S matrices. We see on the right in Figure 5 that it does a pretty good job.

While the inner workings of LDA might seem complicated, it is very straightforward in scikit-learn as can be seen in the code block below. This is very similar to the PCA code above; the main difference is that the ytarget variable is considered when the LDA is fitted to the data; this is not the case with PCA.

Conclusion

The objective for this article was to explain the general principles underlying data transformations, to show how PCA and LDA work in scikit-learn and to present some examples of these methods in operation.

The code and data for these examples is available on on GitHub). A more in-depth treatment of these methods is presented in this arXiv report [3].


References

[1] P. Cunningham, "Feature Subset Selection", Towards Data Science, 2022, [Online], https://towardsdatascience.com/feature-subset-selection-6de1f05822b0

[2] A.M. Horst, A.P. Hill, K.B. Gorman KB, palmerpenguins: Palmer Archipelago (Antarctica) penguin data, 2020, doi:10.5281/zenodo.3960218, R package version 0.1.0, https://allisonhorst.github.io/palmerpenguins/.

[3] P. Cunningham, B. Kathirgamanathan, & S.J. Delany, Feature selection tutorial with python examples, 2021, arXiv preprint arXiv:2106.06437.


Related Articles