Principal Component Analysis: Your Tutorial and Code
Your data is the life-giving fuel to your Machine Learning model. There are always many ML techniques to choose from and apply to a particular problem, but without a lot of good data you won’t get very far. Data is often the driver behind most of your performance gains in a Machine Learning application.
Sometimes that data can be complicated. You have so much of it that it may be challenging to understand what it all means and which parts are actually important. Dimensionality reduction is a technique which, in a nutshell, helps us gain a better macro-level understanding of our data. It reduces the number of features of our dataset such that we’re left with only the most important parts.
Principal Component Analysis (PCA) is a simple yet powerful technique used for dimensionality reduction. Through it, we can directly decrease the number of feature variables, thereby narrowing down the important features and saving on computations. From a high-level view PCA has three main steps:
(1) Compute the covariance matrix of the data
(2) Compute the eigen values and vectors of this covariance matrix
(3) Use the eigen values and vectors to select only the most important feature vectors and then transform your data onto those vectors for reduced dimensionality!
The entire process is illustrated in the figure above, where our data has been converted from a 3-dimensional space of 1000 points to a 2-dimensional space of 100 points. That’s a 10X saving on computation!
(1) Computing the covariance matrix
PCA yields a feature subspace that maximizes the variance along the feature vectors. Therefore, in order to properly measure the variance of those feature vectors, they must be properly balanced. To accomplish this, we first normalise our data to have zero-mean and unit-variance such that each feature will be weighted equally in our calculations. Assuming that our dataset is called X:
The covariance of two variables measures how “correlated” they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. The covariance matrix is then just an array where each value specifies the covariance between two feature variables based on the x-y position in the matrix. The formula is:
Where the x with the line on top is a vector of mean values for each feature of X. Notice that when we multiply a transposed matrix by the original one we end up multiplying each of the features for each data point together! In numpy code it looks like this:
(2) Computing Eigen Values and Vectors
The eigen vectors (principal components) of our covariance matrix represent the vector directions of the new feature space and the eigen values represent the magnitudes of those vectors. Since we are looking at our covariance matrix the eigen values quantify the contributing variance of each vector.
If an eigen vector has a corresponding eigen value of high magnitude it means that our data has high variance along that vector in feature space. Thus, this vector holds a lot information about our data, since any movement along that vector causes large “variance”. On the other hand vectors with small eigen values have low variance and thus our data does not vary greatly when moving along that vector. Since nothing changes when moving along that particular feature vector i.e changing the value of that feature vector does not greatly effect our data, then we can say that this feature isn’t very important and we can afford to remove it.
That’s the whole essence of eigen values and vectors within PCA. Find the vectors that are the most important in representing our data and discard the rest. Computing the eigen vectors and values of our covariance matrix is an easy one-liner in numpy. After that, we’ll sort the eigen vectors in descending order based on their eigen values.
(3) Projection onto new vectors
At this point we have a list of eigen vectors sorted in order of “importance” to our dataset based on their eigen values. Now what we want to do is select the most important feature vectors that we need and discard the rest. We can do this in a clever way by looking at the explained variance percentage of the vectors. This percentage quantifies how much information (variance) can be attributed to each of the principal components out of the total 100%.
Let’s take an example to illustrate. Say we have a dataset which originally has 10 feature vectors. After computing the covariance matrix, we discover that the eigen values are:
[12, 10, 8, 7, 5, 1, 0.1, 0.03, 0.005, 0.0009]
The total sum of this array is = 43.1359. But the first 6 values represent:
42 / 43.1359 = 99.68% of the total! That means that our first 5 eigen vectors effectively hold 99.68% of the variance or information about our dataset. We can thus discard the last 4 feature vectors as they only contain 0.32% of the information, a worthy sacrifice for saving on 40% of the computations!
Therefore, we can simply define a threshold upon which we can decide whether to keep or discard each feature vector. In the code below, we simply count the number of feature vectors we would like to keep based on a selected threshold of 97%.
The final step is to actually project our data onto the vectors we decided to keep. We do this by building a projection matrix: that’s just a fancy word for a matrix we will multiply by to project our data onto the new vectors. To create it, we simply concatenate all of the eigen vectors we decided to keep. Our final step is to simply take the dot product between our original data and our projection matrix.
Voila! Dimensions reduced!
Like to learn?
Want to learn more about Data Science? The Python Data Science Handbook book is the best resource out there for learning how to do real Data Science with Python!
And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone! As an Amazon Associate I earn from qualifying purchases.