The world’s leading publication for data science, AI, and ML professionals.

Implementing PCA From Scratch

Brushing up your linear algebra skills with just Python and NumPy

From Scratch

Photo by Pawel Czerwinski on Unsplash
Photo by Pawel Czerwinski on Unsplash

I have always been amazed by the different ways one can simplify data, making it more accessible and easier to analyze. It sometimes feels like voodoo-magic – simply apply an algorithm and it somehow does the job.

Principal Component Analysis or PCA is one of those Algorithms. It’s an unsupervised learning algorithm with the purpose of dimensionality-reduction by transforming a large set of features into a smaller one, while preserving as much information as possible.

In this article, we are going to demystify some of the voodoo-magic, by implementing PCA from scratch. We will do this in a step-by-step fashion using just Python and NumPy.


General Overview

Before diving straight into the implementation details, let’s get a high-level overview of how the algorithm actually works.

Principal Component Analysis, consists mainly of three steps:

  1. First of all, we need to compute the covariance matrix.
  2. Once we obtain this matrix, we need to decompose it, using eigendecomposition.
  3. Next, we can select the most important eigenvectors based on the eigenvalues, to finally project the original matrix into its reduced dimension.

In the following section, we will take a closer look at each step, implementing PCA in just a single class. Below we can already take a look at the skeleton class, which provides us with some kind of blueprint.


Implementation From Scratch

Compute the covariance matrix

As described in the general overview, our first step mainly involves the calculation of the covariance matrix. So why do we need to do that?

The covariance matrix, loosely speaking, tells us how much two random variables vary together. If the covariance is positive, then the two variables are correlated, hence moving (increasing or decreasing) in the same direction. If the covariance is negative, then the variables are inversely correlated, meaning they’re moving in the opposite direction (e.g. one is increasing while the other is decreasing)

Let’s assume we have a dataset with three features. If we calculate the covariance matrix, we will get a 3×3 matrix, containing the covariance of each column with all the other columns and itself.

An example of a covariance matrix [Image by Author]
An example of a covariance matrix [Image by Author]

The covariance matrix will be our centerpiece, when applying eigendecomposition, allowing us to choose the main vectors or the main directions, which explain most of the variance within our dataset.

Now that we know what a covariance matrix is, we still need to know how to compute it. The formula for calculating the matrix can be stated as the following:

If we mean-center our data beforehand, we can omit the subtraction of ‘x-bar’ and ‘y-bar’ respectively. Simplifying our equation and expressing it in terms of Linear Algebra notation we get the following:

Which is simply the dot product of the mean-centered data matrix with itself divided by the number of samples.

Equipped with this new knowledge, we can implement our first two class methods.

Note: We standardize the data by subtracting the mean and dividing by the standard deviation. This transforms all features to the same scale, allowing for equal contribution to the analysis. The covariance matrix can be also computed using the NumPy function numpy.cov(x).

Decomposing the covariance matrix

The next step in our implementation mainly concerns the decomposition of the covariance matrix – or more specifically, the eigendecomposition.

Eigendecomposition describes the factorization of a matrix into eigenvectors and eigenvalues. The eigenvectors provide us with information about the direction of the data, whereas the eigenvalues can be interpreted as coefficients, telling us how much variance is carried in each eigenvector.

So long story short, if we decompose our covariance matrix, we obtain the eigenvectors, which explain the most variance within our dataset. We can use those vectors to project our original matrix into a lower dimension.

So how do we decompose our covariance matrix?

Luckily, we can simply rely on the built-in NumPy function, since the ‘manual’ computation of the eigenvalues and eigenvectors is quite tedious. The only thing we have to do, is to sort the eigenvalues and the eigenvectors accordingly, allowing us to select the most important eigenvectors based on the number of components we specify.

Projecting and putting it all together

We have already completed most of the work by now.

By selecting the most important eigenvectors, we are now able to project the original matrix into the lower-dimensional space. This is achieved by taking the dot product of the matrix and the eigenvectors, which also can be interpreted as a simple linear transformation.

After implementing the two remaining methods, our final class looks like the following:


Testing our Implementation

Now that we finished our implementation, there is just one thing left to do – we need to test it.

For testing purposes, we will use the iris dataset, which consists of 150 samples with 4 different features (Sepal Length, Sepal Width, Petal Length, Petal Width). Let’s take a look at the original data by plotting the first two features.

An Overview of the iris dataset by plotting the first two features [Image by Author]
An Overview of the iris dataset by plotting the first two features [Image by Author]

We can now instantiate our own PCA class and fit it on the dataset. When running the code below, we get the following result.

Iris dataset projected onto the first two principal components [Image by Author]
Iris dataset projected onto the first two principal components [Image by Author]

By applying PCA, we clearly untangled some of the class relations and separated the data more clearly. This lower-dimensional data structure should make a classification task a lot easier.

Conclusion

In this article, we implemented Principal Component Analysis in a more manual approach from scratch. We learned about the main computational steps, involving some very important linear algebra concepts.

PCA is a simple yet effective way to reduce, compress and untangle high-dimensional data. Understanding and implementing the algorithm from scratch provides a great way to build strong fundamentals and revise our linear algebra knowledge, since it ties many important concepts together.

You can find the full code here on my GitHub.

ML __ Algorithms From Scratch

Thank you for reading! Make sure to stay connected & follow me here on Medium, Kaggle, or just say ‘Hi’ on LinkedIn


Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Marvin Lanhenke


Related Articles