The world’s leading publication for data science, AI, and ML professionals.

PCA: Beyond Dimensionality Reduction

Learn how to use PCA algorithm to find variables that vary together

Photo by Pritesh Sudra on Unsplash
Photo by Pritesh Sudra on Unsplash

Principal Component Analysis

Principal Component Analysis or PCA for short is a mathematical transformation based on covariance calculations.

Many beginner Data Scientists have their first contact with the algorithm learning that it is good for dimensionality reduction, meaning that when we have a wide dataset, with many variables, we can use PCA to transform our data to as many components as we want, therefore reducing it before predictions.

That is true and a good technique, actually. But in this post I want to show you another good use of PCA: verify how the features are varying together.

Covariance is used to calculate the movement of the two variables. It indicates the direction of the linear relationship between variables.

Knowing what covariance is and what it does, we can tell if the variables are moving together, opposite or independent of each other.

What PCA does

Understand that things are much more complex than this explanation, but let’s make it simple.

Suppose we have a 3 dimensional dataset. Well, PCA will get your dataset and look at which one of those three dimensions it can draw the longest line – meaning looking point by point, what is the largest difference I can get for each dimension. Once that calculation is done, it will draw that line and call it Principal Component 1.

The first principal component captures the most variation of the data. The second PC captures the second most variation.

After that, it will go to the next dimension and draw another line that must be perpendicular to PC1 and holds the largest variance possible. And finally it will be done for the third dimension, always following the rules of perpendicular to prior PCs and holding the largest variance possible. In that way, if we have n dimensions, that process will be done n times.

A good way I found to explain that is to think about a bottle. Imagine it is full of points from your dataset. Certainly the largest variation will be from the cap to the bottom. So that’s PC1. Then, PC2 needs to be perpendicular, so it leaves us with the arrow from side to side.

Picture 1: "Bottle" dataset. How the computer will "see" your data. Image by the author.
Picture 1: "Bottle" dataset. How the computer will "see" your data. Image by the author.

After that, we could keep drawing many other lines to show how our data is spread to all of the possible the sides. These vectors are "drawing" the dataset in a mathematical way, so the computer can understand it.

Running PCA

Let’s code a little and run a PCA.

import pandas as pd
import random as rd
import numpy as np
from Sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
# Create a dataset
observations = ['obs' + str(i) for i in range(1,101)]
sA = ['sampleA' + str(i) for i in range(1,6)]
sB = ['sampleB' + str(i) for i in range(1,6)]
data = pd.DataFrame(columns=[*sA, *sB], index=observations)
for observation in data.index:
  data.loc[observation, 'sampleA1':'sampleA5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5)
  data.loc[observation, 'sampleB1':'sampleb5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5)
Table 1: Sample data. Image by the author.
Table 1: Sample data. Image by the author.

Now we should scale the data. If some variables have a large variance and some small, PCA will want to draw the longest line for PC1, so it will distort your numbers, making the other PCs very small. Yes, PCA is affected by outliers.

Outlier dominating the variance on PC1. Image by the author.
Outlier dominating the variance on PC1. Image by the author.

Thus, standardizing the variables will minimize that effect. On the other hand, if the specific scale of your variables matters (in that you want your PCA to be in that scale), maybe you don’t want to standardize, but be aware of that problem.

Next, when you run PCA, you will see that the rows of the matrix is the rows that will come out with the components numbers. In our example, we want to see how the samples are varying together, so I will transpose to make the samples as rows, scale and run PCA.

# Transpose and Scale
scaled_data = preprocessing.scale(data.T)
# PCA instance
pca = PCA()
#fit (learn the parameters of the data)
pca.fit(scaled_data)
# transform (apply PCA)
pca_data = pca.transform(scaled_data)
# Creating the Scree Plot to check PCs variance explanation
per_var = np.round(pca.explained_variance_ratio_*100, 1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
plt.figure(figsize=(12,6))
plt.bar(x=labels, height=per_var)
plt.ylabel('Variance explained')
plt.show()
Scree Plot: how much variance each PC explains. Image by the author.
Scree Plot: how much variance each PC explains. Image by the author.

Eigenvectors, eigenvalues

After running PCA, you will receive a bunch of numbers like the table 2 shows. Those numbers are the eigenvectors of each Principal Component – in other words, the numbers that "create" the arrows the computer drew to understand your data.

The eigenvalues represent the amount of variance explained by the variables: pca.explained_variance_

# Loadings Table
pca_df = pd.DataFrame(pca_data, index=[*sA, *sB], columns=labels)
Table 2: Eigenvectors. Image by the author.
Table 2: Eigenvectors. Image by the author.

In case you want to reduce dimensionality of your dataset, it is quite simple. Just use the parameter n_components .

# PCA instance with 3 dimensions
pca = PCA(n_components=3)
#fit (learn the parameters of the data)
pca.fit(scaled_data)
# transform (apply PCA)
pca_data = pca.transform(scaled_data)

Variables Relationship

Now we got to the final part of this post. Let’s go beyond the dimensionality reduction. Let’s learn how to understand if the variables float together or not.

  • Look at each Principal Component column, notice that there are positive and negative signs.
  • Positive samples are going in the same direction of the PC, while negative sign means the sample is varying in the opposite direction.
  • The number means the strength. The higher, the more is the variance of the sample in that PC.
  • Look at the percentage that each PC explains of the variance – PC1 = 92%. So we can look just at PC1 here to understand how the samples are related, as almost the entire variance is explained in PC1.
  • We can see that the A samples are going in one direction (+) while the B samples are going in another direction (-).

Let’s plot a graphic to better illustrate the idea.

# PC1 x PC2 scatter plot
plt.figure(figsize=(12,6))
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('PCA graph')
plt.xlabel(f'PC1 - {per_var[0]}%')
plt.ylabel(f'PC2 - {per_var[1]}%')
for sample in pca_df.index:
  plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]) )
plt.show()
Amazing how the samples A and B are lined up together. Image by the author.
Amazing how the samples A and B are lined up together. Image by the author.

Before You Go

In this post, we learned another good use of the PCA algorithm, that is to understand which variables are related, "floating" together, what can be an interesting tool for feature selection or can be combined with clustering.

You can learn a lot more watching this video from StatQuest, the best I’ve ever found about PCA and the one I usually come back to consult.

Another considerations:

  • PCA is based on covariance analysis and is affected by outliers.
  • It can be used to reduce dimensionality.
  • Can be used to check how features from a dataset are related.
  • If the Scree Plot (the one showing the percentages) gives you a low variance in PC1, like 20%, you may want to get more PCs to sum at least 80% of explained variance to perform your analysis. Like the example below.
# Multiply (weigh) by the explained proportion
pca_df2 = pca_df.mul(pca.explained_variance_ratio_)
# Sum of the components
pca_df2 = pd.DataFrame({'fabricante':pca_df2.index, 'tt_component': pca_df2.sum(axis=1)}).sort_values(by='tt_component')
pca_df2.plot.scatter(x='tt_component', y='fabricante', figsize=(8,8))
Sum PCs for variance analysis. Image by the author.
Sum PCs for variance analysis. Image by the author.

Code in GitHub, here.

If this content is interesting to you, follow my blog for more.

gustavorsantos


Related Articles