The world’s leading publication for data science, AI, and ML professionals.

The Definitive Guide to Principal Components Analysis

A tutorial stripping down low-level code that you can edit and run in your browser to understand PCA once and forever down to the detail…

Despite the hundreds of web pages dedicated to Principal Components Analysis (PCA), I don’t find a single one that is complete and detailed enough regarding how it is actually computed. Some tutorials are good at explaining how you get the covariance matrix, others are good at describing the singular value decompositions involved, others simply feature ready-programs or pieces of code that work but you have no clue why and how. On the other hand, some resources go very deep but using technical terms that end up confusing you more. Yet others specialize on what the inputs and outputs of a principal components analysis are but totally disregard what happens inside the black box.

That’s why I wrote this new tutorial, which strips down the low-level code of a web app I recently built to run PCA online:

A free online tool for Principal Components Analysis with full graphical output

By "low-level" I mean that except for one function that is taken from a library, the rest of the code is all simple operations and loops.

Since this is taken from a web app, the code involved is JavaScript. But it is quite simple, as you will see, so you should be able to translate it into other programming languages easily. The good side of it being JavaScript is that you can easily edit it and test it in your browser to explore yourself how PCA works and understand it once and forever down to the detail.


Principal Components Analysis in a nutshell

PCA is a technique used to reduce the number of dimensions in a dataset while preserving the most important information in it. PCA achieves this by projecting high-dimensional data linearly onto its main components of variation, called the principal components (PC).

PCA can be used to identify the underlying structure of a dataset or to reduce its dimensionality. In practice you use PCA to simplify the display of complex data by spreading it on only 1, 2 or 3 dimensions (most commonly on 2, in the form of a bidimensional plot); to find out which of the original variables contribute most to the variability in the dataset; also to remove noise or compress data (such as an image) by reconstructing a dataset from its main PCs and leaving out those that contain more noise; and possibly a few other related applications. You can see some specific examples in the article where I present the web app.

The maths, kept simple but complete

Some resources out there are complete but too complex to follow, while others are simple but that’s because they don’t show the full picture. For example, some tutorials focus more on calculating the covariance matrix but then you are left kind of alone regarding how you go from that matrix to the actual results of a PCA. Other tutorials delve more deeply into that aspect but in the process they take the covariance matrix for granted, and often use unnecessarily complicated jargon.

Here I will start from the starting dataset, which is a column of nobjects objects described by nvariables variables. For example, each object could be a spectrum collected for one sample in a wide range of wavelengths; the reading at each wavelength would be a variable. Oh yes and in the text I will call all variables the same as in the code, using italics.

Step 1: From starting dataset to covariance matrix

Now that the input is clear, it’s time for code. The starting dataset is in the following code contained in the variable called X, which has nvariables rows and nobjects columns.

In its most common form, as we will run here, PCA is a linear decomposition of the covariance matrix. So let’s calculate that. We first need to mean-center all the inputs by subtracting the average value of each variable. For this we first compute the averages for all variables (by adding across all objects and then dividing by their number), and we then subtract the averages. Notice that at the end we don’t create a new variable; rather, the elements of the starting matrix X get replaced by the mean-centered values:

//This first block computes each variable's average
averages=[]
for (i=0;i<nvariables;i++) {
  var tmp2=0
  for (j=0;j<nobjects;j++) {
    tmp2=tmp2+X[i][j]
  }
  averages.push(tmp2/nobjects)
}
//This block subtracts each reading by the average of the corresponding variable
for (i=0;i<nvariables;i++) {
  for (j=0;j<nobjects;j++) {
    X[i][j] = X[i][j] - averages[i]
  }
}

Now that we have subtracted the averages, we can compute the covariance matrix. Up to this point, including computing of the covariance matrix, it’s all quite clear in this external blog entry:

Covariance Matrix

Here’s the low-level code:

//Here we just setup a covariance matrix full of 
var covar = []
for (i=0;i<nvariables;i++) {
  var tmp3 = []
  for (j=0;j<nvariables;j++) {
    tmp3.push(0)
  }
  covar.push(tmp3)
}
//This block fills in the covariance matrix by increasingly summing the products
for (i=0;i<nvariables;i++) {
  for (j=0;j<nvariables;j++) {
    for (k=0;k<nobjects;k++) {
      covar[i][j] = covar[i][j] + X[i][k]*X[j][k]
    }
    covar[i][j] = covar[i][j] / (nobjects-1)   //Here can be -0
  }
}

Note that we need three indices, because we need to calculate the covariance between each variable and all the others, and for each of these calculations we need to iterate through all the objects in the dataset.

Also note the comment in the middle of the covariance calculation procedure ("here can be -0"). The most correct thing is to divide the sum of products by nobjects-1 as shown here; however, some examples out there use the regular variance, in which the sum is divided by just nobjects.

Step 2: Diagonalizing the covariance matrix

We now need to diagonalize the covariance matrix through Singular Value Decomposition (SVD).

This is the only point of the procedure where rather than coding everything at the low level I prefer to use a robust, well-tested, properly coded, and efficient matrix algebra library. In this case I use a procedure from the Lalolib package for in-browser maths, which I presented in a recent article:

Websites for Statistics and Data Analysis on Every Device

Notice that to use Lalolib you need to source it in your HTML, like this:

<script type="application/x-javascript" src="lalolib.js"> </script>

Once that’s done, all of Lalolib’s functions get exposed to your JavaScript.

Now, before we run SVD on the covariance matrix we need to convert this matrix into a type of object that Lalolib can handle. That’s because the library works with its own data types, and not raw JavaScript arrays like our covar matrix.

So in the next piece of code you’ll see the covar matrix gets cast into a matrix that Lalolib can handle, called matrixforsvd, and then you will see the actual call to the SVD routine on this matrix. As shown here, you can log the four main outputs of the procedure to the console:

var matrixforsvd = array2mat(covar);

var svdcalc = svd(matrixforsvd, "full");
console.log(svdcalc.U)
console.log(svdcalc.S)
console.log(svdcalc.V)
console.log(svdcalc.s)

What are these outputs? The first three are matrices. You will see that U and V are the same when you’ve run SVD on a covariance matrix, as this is a symmetric matrix (you can log covar to check this yourself). U and V have the loadings, which are the coefficients that define how the principal components space is built to optimally spread out the variation of the dataset.

s (lowercase as opposed to uppercase for the S matrix) is the diagonal extracted from the S matrix. It turns out that all the off-diagonal elements of the S matrix are zero, so the s vector is simpler and contains the same amount of information. This information consists in how much each principal component contributes to the total variation in the dataset. They are always sorted in decreasing order, as you can verify by running examples on my web app. If you normalize these numbers to their sum, you get the %variation explained by each principal component.

Step 3: Project the data onto the principal components

This is the last part of the PCA procedure, which is also often passed very quickly by most tutorials. Essentially what we need to do is to multiply the matrix containing the starting data (mean-centered, so that’s our X) by the U matrix obtained by SVD of the covariance matrix.

Since I coded this multiplication at a low level, I first had to cast svdcalc.U into a JavaScript array. You’ll see this done with .toArray(). Right after this, two nested for loops carry out the multiplication.

var projection = []
var pc1 = 0, pc2 = 0
var U = svdcalc.U.toArray()

for (i=0; i<nobjects; i++) {
  pc1=0
  pc2=0
  for (j=0;j<nvariables;j++) {
    pc1 = pc1 + U[j][0] * X[j][i]
    pc2 = pc2 + U[j][1] * X[j][i] 
  }
  var tmp4 = []
  tmp4.push(pc1)
  tmp4.push(pc2)
  projection.push(tmp4)
}

Now the final variable, projection, contains nobjects rows and just 2 columns, which are the two first principal components. The PC plot is simply a scatter plot of all these PC1, PC2 pairs.

Closing words

Running PCA online to understand it through examples

For an actual web app to run PCA online including graphical output, plus explanations on how to interpret the results, check out this other story:

A free online tool for Principal Components Analysis with full graphical output


Trying it out

Copy the pieces of code into an HTML page, don’t forget to source the Lalolib library, create an X matrix with some data, and try out the code by using console.log() or document.write() to display your intermediate and final results.

I hope this tutorial has provided you with a clearer picture of how PCA runs and the maths behind it. If some parts of this text are still unclear then indicate this by highlighting and commenting, and I will see what I can do!

Happy PCAing!


www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, science, technology, Programming, etc. Become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here.


Related Articles