Written by Chandan Durgia and Prasun Biswas

Truth be told, with the increasing democratization of the AI/ML world, a lot of novice/experienced people in the industry have jumped the gun and lack some nuances of the underlying mathematics.
It is very much understandable as well. AI/ML world could be overwhelming for anyone because of multiple reasons:
a. One has to learn an ever-growing coding language(Python/R), tons of statistical techniques and finally understand the domain as well.
b. The pace at which the AI/ML techniques are growing is incredible. Recently read somewhere that there are ~100 AI/ML research papers published on a daily basis.
c. Underlying math could be difficult if you are not from a specific background. The unfortunate part is that this is just not applicable to complex topics like neural networks etc., it is even true for the basic concepts like regressions, classification problems, Dimensionality Reduction etc.
Through this article, we intend to at least tick-off two widely used topics once and for good:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
Both these topics are "dimensionality reduction techniques" and have somewhat similar underlying math. We have covered t-SNE in a separate article earlier (link).
When one thinks of dimensionality reduction techniques, quite a few questions pop up:
A) Why dimensionality reduction? What does it mean to reduce dimensionality?
B) How is linear algebra related to dimensionality reduction?
C) Why do we need to do linear transformation?
D) How are Eigen values and Eigen vectors related to dimensionality reduction?
E) Could there be multiple Eigenvectors dependent on the level of transformation?
F) How are the objectives of LDA and PCA different and how it leads to different sets of Eigen vectors?
G) Is there more to PCA than what we have discussed? Shall we choose all the Principal components?
H) Is the calculation similar for Lda other than using the scatter matrix?
I) PCA vs LDA key areas of differences? When should we use what?
and many more…
We have tried to answer most of these questions in the simplest way possible. Feel free to respond to the article if you feel any particular concept needs to be further simplified.
Let’s kick it off.
A) Why dimensionality reduction? What does it mean to reduce dimensionality?
When a data scientist deals with a data set having a lot of variables/features, there are a few issues to tackle:
a) With too many features to execute, the performance of the code becomes poor, especially for techniques like SVM and Neural networks which take a long time to train.
b) Many of the variables sometimes do not add much value.
Dimensionality reduction is a way used to reduce the number of independent variables or features.
The figure below depicts our goal of the exercise, wherein X1 and X2 encapsulates the characteristics of Xa, Xb, Xc etc.

B) How is Linear Algebra related to dimensionality reduction?
One can think of the features as the dimensions of the coordinate system.

*4 Features diagram is illustrative
Though the objective is to reduce the number of features, it shouldn’t come at a cost of reduction in explainability of the model. i.e. how much of the dependent variable can be explained by the independent variables. Therefore, the dimensionality should be reduced with the following constraint – "the relationships of the various variables in the dataset should not be significantly impacted."
And this is where linear algebra pitches in (take a deep breath). In simple words, linear algebra is a way to look at any data point/vector (or set of data points) in a coordinate system from various lenses.
Let’s decode this.
Consider a coordinate system with points A and B as (0,1), (1,0). Now to visualize this data point from a different lens (coordinate system) we do the following amendments to our coordinate system:
a) Rotate
b) Stretch or Squish

As you can see above, the new coordinate system is rotated by certain degrees and stretched. Note that it is still the same data point, but we have changed the coordinate system and in the new system it is at (1,2), (3,0).
If you analyze closely, both coordinate systems have the following characteristics:
a) All lines remain lines. i.e. lines are not changing in curves.
b) Origin (0,0) remains the same.
c) Stretching/Squishing still keeps grid lines parallel and evenly spaced.
This is the essence of linear algebra or linear transformation.
Whenever a linear transformation is made, it is just moving a vector in a coordinate system to a new coordinate system which is stretched/squished and/or rotated.
In fact, the above three characteristics are the properties of a linear transformation. It is important to note that due to these three characteristics, though we are moving to a new coordinate system, the relationship between some special vectors won’t change and that is the part we would leverage.
Interesting fact: When you multiply two vectors, it has the same effect of rotating and stretching/ squishing.
C) Why do we need to do linear transformation?
Linear transformation helps us achieve the following 2 things:
a) Seeing the world from different lenses that could give us different insights.
b) In these two different worlds, there could be certain data points whose characteristics relative positions won’t change.
For #b above, consider the picture below with 4 vectors A, B, C, D and let’s analyze closely on what changes the transformation has brought to these 4 vectors.


So, something interesting happened with vectors C and D. Even with the new coordinates, the direction of these vectors remained the same and only their length changed.
Get ready for the magical part !!
If we can manage to align all (most of) the vectors (features) in this 2 dimensional space to one of these vectors (C or D), we would be able to move from a 2 dimensional space to a straight line which is a one dimensional space.
Voila – Dimensionality reduction achieved !!
This process can be thought from a large dimension’s perspective as well. i.e. if our data is of 3 dimensions then we can reduce it to a "plane" in 2 dimensions (or a line in one dimension) and to generalize if we have data in n dimensions, we can reduce it to n-1 or lesser dimensions.
Note that in the real world it is impossible for all vectors to be on the same line. Therefore, for the points which are not on the line, their "projections" on the line are taken (details below). By projecting these vectors, though we lose some explainability, that is the cost we need to pay for reducing dimensionality. Again, Explanability is the extent to which independent variables can explain the dependent variable.
D) How are Eigen values and Eigen vectors related to dimensionality reduction?
These vectors (C&D), for which the rotational characteristics don’t change are called Eigen Vectors and the amount by which these get scaled are called "Eigen Values". As you would have gauged from the description above, these are fundamental to dimensionality reduction and will be extensively used in this article going forward.
Eigenvalue for C = 3 (vector has increased 3 times the original size)
Eigenvalue for D = 2 (vector has increased 2 times the original size)
The crux is, if we can define a way to find Eigenvectors and then project our data elements on this vector – we would be able to reduce the dimensionality.
The key characteristic of an Eigenvector is that it remains on its span (line) and does not rotate, it just changes the magnitude. i.e. for any eigenvector v1, if we are applying a transformation A (rotating and stretching), then the vector v1 only gets scaled by a factor of lambda1. Here lambda1 is called Eigen value.

E) Could there be multiple Eigenvectors dependent on the level of transformation?
Yes, depending on the level of transformation (rotation and stretching/squishing) there could be different Eigenvectors. So, depending on our "objective" of analyzing data we can define the transformation and the corresponding Eigenvectors. Note that the "objective" of the exercise is important, and this is the reason for the difference in LDA and PCA.
F) How are the objectives of LDA and PCA different and how do they lead to different sets of Eigenvectors?
For PCA, the objective is to ensure that we capture the variability of our independent variables to the extent possible. In the following figure we can see the variability of the data in a certain direction. Note that our original data has 6 dimensions. This is just an illustrative figure in the two dimension space.

The measure of "variability" of multiple values together is captured using the Covariance matrix. In our case, the input dataset had dimensions 6 dimensions [a, f] and that cov matrices are always of the shape (d * d), where d is the number of features.

Here, the covariance is calculated as :

Where N is the number of observations.
So, this would be the matrix on which we would calculate our Eigen vectors.
On the other hand, Linear Discriminant Analysis (LDA) tries to solve a "supervised" classification problem, wherein the objective is NOT to understand the variability of the data, but to maximize the separation of known categories. In LDA the covariance matrix is "substituted" by a scatter matrix which in essence captures the characteristics of a "between class" and "within class" scatter.
G) Is there more to PCA than what we have discussed? Shall we choose all the Principal components?
There are some additional details. So, in this section we would build on the basics we have discussed till now and drill down further.
So, let’s break it down end-to-end:
a. Assume a dataset with 6 features. As mentioned earlier, this means that the data set can be visualized (if possible) in the 6 dimensional space. Just for the illustration let’s say this space looks like:

b. To reduce the dimensionality, we have to find the eigenvectors on which these points can be projected. Since the objective here is to capture the "variation" of these features, we can calculate the Covariance Matrix as depicted above in #F.
c. Now, we can use the following formula to calculate the Eigenvectors (EV1 and EV2) for this matrix. Note that, PCA is built in a way that the first principal component accounts for the largest possible variance in the data. Then, since they are all orthogonal, everything follows iteratively.


Why are Eigen vectors perpendicular?
If the matrix used (Covariance matrix or Scatter matrix) is symmetrical on the diagonal, then eigen vectors are "real numbers" and "perpendicular (orthogonal)". If not, the eigen vectors would be "complex imaginary numbers".
The way to convert any matrix into a symmetrical one is to multiply it by its transpose matrix. In the later part, in scatter matrix calculation, we would use this to convert a matrix to symmetrical one before deriving its Eigenvectors.
d. Once we have the Eigenvectors from the above equation, we can project the data points on these vectors. For simplicity sake, we are assuming 2 dimensional eigenvectors.

Note that, expectedly while projecting a vector on a line it loses some explainability. i.e. for the vector a1 in the figure above its projection on EV2 is 0.8 a1. This is the reason Principal components are written as some proportion of the individual vectors/features.
*EV1 = PC1 = a10.8 + a20.9 + a30.78….**
One interesting point to note is that one of the Eigen vectors calculated would automatically be the line of best fit of the data and the other vector would be perpendicular (orthogonal) to it.
e. Though in above examples 2 Principal components (EV1 and EV2) are chosen for the simplicity sake. For a case with n vectors, n-1 or lower Eigenvectors are possible. Depending on the purpose of the exercise, the user may choose on how many principal components to consider. This is driven by how much explainability one would like to capture. The same is derived using scree plot.
Scree plot is used to determine how many Principal components provide "real value" in the explainability of data. "Real value" means whether adding another principal component would improve explainability meaningfully. On a scree plot, the point where the slope of the curve gets somewhat leveled ( "elbow) indicates the number of factors that should be used in the analysis.

H) Is the calculation similar for LDA other than using the scatter matrix?
Unlike PCA, LDA is a supervised learning algorithm, wherein the purpose is to classify a set of data in a lower dimensional space. In other words, the objective is to create a new linear axis and project the data point on that axis to maximize class separability "between classes" with minimum variance "within class". See figure XXX
This can be mathematically represented as:
a) Maximize the class separability i.e. maximize the distance between the means. i.e. maximize the square of difference of the means of the two classes. ((Mean(a) – Mean(b))^2)
b) Minimize the variation within each category. i.e. minimize the spread of the data. i.e. (Spread (a) ^2 + Spread (b)^ 2)

Note for LDA, the rest of the process from #b to #e is the same as PCA with the only difference that for #b instead of covariance matrix a scatter matrix is used. The formula for both of the scatter matrices are quite intuitive:
Between Class Scatter is given by:

Where m is the combined mean of the complete data and mi is the respective sample means.
Within Class Scatter is given by:

Where x is the individual data points and mi is the average for the respective classes.
Intuitively, this finds the distance within the class and between the classes to maximize the class separability.
Please note that for both cases, the scatter matrix is multiplied by its transpose. As discussed, multiplying a matrix by its transpose makes it symmetrical. This is done so that the Eigenvectors are "real" and "perpendicular".
I) PCA vs LDA key areas of differences? When should we use what?


Hope this would have cleared some basics of the topics discussed and you would have a different perspective of looking at the matrix and linear algebra going forward.
As they say, the great thing about anything elementary is that it is not limited to the context it is being read in. It is foundational in the real sense upon which one can take leaps and bounds.
Happy learning!!
Disclaimer: The views expressed in this article are the opinions of the authors in their personal capacity and not of their respective employers.