The world’s leading publication for data science, AI, and ML professionals.

PCA: Principal Component Analysis – How to Get Superior Results with Fewer Dimensions?

One of The Best Techniques for Dimensionality Reduction

Machine Learning

Principal Component Analysis. Image by author.
Principal Component Analysis. Image by author.

Intro

Principal Component Analysis (PCA) is a commonly used technique by data scientists to make model training more efficient and visualize the data in lower dimensions.

In this article, I explain how PCA works and give you an example of how to perform such analysis in Python.

Contents

  • The category of Machine Learning techniques PCA belongs to
  • Visual explanation of how PCA works
  • Python example of performing PCA on real-life data
  • Conclusions

What category of Machine Learning techniques does Principal Component Analysis (PCA) belong to?

While PCA is often referred to as a dimensionality reduction technique, it is actually a data transformation.

Nevertheless, PCA makes it very easy to use the resulting principal components to reduce the number of dimensions as it ranks them from "most useful" (captures a lot of the data variance) to "least useful" (captures very little of the data variance).

Hence, there is really no harm in putting it under the dimensionality reduction group of algorithms within the unsupervised learning branch of ML.

The below graph is interactive, so please click on different categories to enlarge and reveal more👇 .

If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.

How does Principal Component Analysis (PCA) work?

Preparing the data

In simple terms, PCA helps us find new axes (principal components) of our dimensions that can better capture the variance of the data.

For example, you may have two attributes – house size (sq.ft.) and the number of rooms in that house. Not surprisingly, these two attributes are highly correlated as bigger houses tend to have more rooms.

Let’s visualize our made-up example by plotting the two attributes on a 2D scatterplot:

Relationship between house size and the number of rooms. Image by author.
Relationship between house size and the number of rooms. Image by author.

We can see that there is a positive correlation between house size and the number of rooms. Also, there seems to be more variance in house sizes when compared to the number of rooms.

However, before we make any premature conclusions, let’s pause for a minute and take a look at the scale of x and y. We can see that these two attributes use two different scales; hence the above picture is not really representative.

To make a more objective interpretation, let’s standardize our data.

Note, standartization is a data transformation technique that rescales the data so each attribute has a mean of 0 and a standard deviation of 1. This transformation can be described with the below formula:

Standardization. Image by author.
Standardization. Image by author.

So, this is what the same data looks like after we standardized it:

Relationship between house size and the number of rooms (standardized data). Image by author.
Relationship between house size and the number of rooms (standardized data). Image by author.

We can see that the data is now centered around the origin because both standardized attributes have a mean of 0. At the same time, it is now easier to visually compare the spread (variance) of the two attributes. As previously speculated, there is indeed a wider spread in house sizes than the number of rooms.

Finding principal components – PC1

I will intentionally avoid going into the mathematics behind finding principal components because my goal is to give you an intuitive/visual understanding of how PCA works.

If you do enjoy a bit of algebra and, in particular, matrix multiplications, you can check out this story from Zichen Wang on PCA and SVD explanation using NumPy.

Now that we have the data standardized and centered around the origin, we can look for the "best-fit" line. This line must go through the origin, and we can find it by minimizing the distances from data points to their projections on the line. It looks like this:

"Best-fit" line of the data, which is also axis for PC1. Image by author.
"Best-fit" line of the data, which is also axis for PC1. Image by author.

Interestingly, this "best-fit" line also happens to be the axis for Principal Component 1 (PC1). This is because minimizing the distances from the line also maximizes the spread of data point projections on that same line.

In other words, we have found a new axis that captures the maximum amount of variance of the data in that dimension.

If you decide to study the mathematics of this process separately, you will come across terms like eigenvectors and eigenvalues. They, in a sense, describe this "best-fit" line and are mathematically found through doing eigendecomposition of the covariance matrix ** or through Singular Value Decompositio**n analysis (SVD for short).

Finding principal components – PC2 and others

Remember, Principal Component Analysis is not just about Dimensionality Reduction. Hence, we can find as many principal components as there are attributes (dimensions) in our data. Said that, let’s find PC2.

Since we only have two attributes, in this example, it is straightforward to find PC2 once we know PC1. It is simply an orthogonal (90 degrees angle) line to PC1 that goes through the origin.

Here it is on the graph:

Principal Component 2 (PC2). Image by author.
Principal Component 2 (PC2). Image by author.

Note, if we had three attributes, then PC2 would be the line that goes through the origin, is orthogonal to PC1, and minimizes distances from data points to their projections on the PC2 line. Then PC3 would be the line that goes through the origin and is orthogonal to both PC1 and PC2.

Summarizing principal components

If we were to list each Principal Component, PC1 would be the dimension that captures the highest proportion of the data variance, with PC2 being the dimension that captures the highest proportion of the remaining variance that PC1 could not capture. Similarly, PC3 would be the dimension capturing the highest proportion of the remaining variance that PC1 and PC2 could not capture, etc.

The goal of PCA is to transform the data in a way that enables us to capture the maximum amount of variance in each subsequent dimension.

Now that we have found PC1 and PC2 in our two-dimensional example, we can rotate the graph and make PC1 and PC2 our new axes:

PC1 - PC2 plot. Image by author.
PC1 – PC2 plot. Image by author.

Also, if we wish, we can perform dimensionality reduction by projecting points onto PC1 and getting rid of PC2.

Projecting data points onto PC1. Image by author.
Projecting data points onto PC1. Image by author.

Finally, if, for whatever reason, we did not like PC1, we could reduce dimensions by projecting points onto PC2 instead, enabling us to get rid of PC1. However, we would lose a lot of the data variance if we did that.

Projecting data points onto PC1 and PC2. Image by author.
Projecting data points onto PC1 and PC2. Image by author.

Principal Component Analysis in Python using real-life data

Let’s now get our hands dirty and perform PCA on real-life data.

Setup

We will use the following data and libraries:

Let’s import all the libraries:

import pandas as pd # for data manipulation
import seaborn as sns # for data visualization
import matplotlib.pyplot as plt # for data visualization
from sklearn.preprocessing import StandardScaler # for data standardization
from sklearn.decomposition import PCA # for PCA analysis

Then, we download and ingest Australian weather data from Kaggle. We also derive a couple of new features at the same time.

# Set Pandas options to display more columns
pd.options.display.max_columns=50

# Read in the weather data csv
df=pd.read_csv('weatherAUS.csv', encoding='utf-8')

# Drop records where target RainTomorrow=NaN
df=df[pd.isnull(df['RainTomorrow'])==False]

# For other columns with missing values, fill them in with column mean
df=df.fillna(df.mean())

# Create a flag for RainToday and RainTomorrow
# Note, RainTomorrowFlag would be used as a target variable for prediction model
df['RainTodayFlag']=df['RainToday'].apply(lambda x: 1 if x=='Yes' else 0)
df['RainTomorrowFlag']=df['RainTomorrow'].apply(lambda x: 1 if x=='Yes' else 0)

Here’s a snippet of what the data looks like:

A snippet of Kaggle's Australian weather data. Image by author.
A snippet of Kaggle’s Australian weather data. Image by author.

Data Correlation

Before performing PCA analysis, let’s better understand our data by looking at the correlation plot.

# Create a correlation matrix
corrMatrix = df.corr()

# Plot the correlation matrix
plt.figure(figsize=(16,9), dpi=500)
sns.heatmap(corrMatrix, annot=True) 
plt.show()
Correlation matrix of the weather data. Image by author.
Correlation matrix of the weather data. Image by author.

As we can see, we have plenty of highly correlated variables. For example, MaxTemp is highly negatively correlated to Humidity at 9 am and 3 pm. At the same time, WindGustSpeed is highly positively correlated to WindSpeed at 9 am and 3 pm.

A strong correlation between many variables indicates that PCA will help us capture a large amount of variance within a potentially much smaller number of dimensions.

Data Standardization

Let’s now standardize the data, which gives us an array of features, each with a mean of 0 and a standard deviation of 1.

# Select all numerical features (17) with an exception of a target variable (RainTomorrowFlag)
X=df[['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
      'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 
      'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 
      'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainTodayFlag'
     ]]

# Get scaler
scaler=StandardScaler()

# Perform standard scaling 
X_std=scaler.fit_transform(X)

Performing PCA

Finally, let’s use the algorithm to perform Principal Component Analysis.

# Select the model and its parameters
pca = PCA(n_components=17)

# Fit the model
X_trans=pca.fit_transform(X_std)

# Print the results
print('*************** PCA Summary ***************')
print('No. of features: ', pca.n_features_)
print('No. of samples: ', pca.n_samples_)
print('No. of components: ', pca.n_components_)
print('Explained variance ratio: ', pca.explained_variance_ratio_)

The gives us the following results:

PCA results. Image by author.
PCA results. Image by author.

The explained variance ratio tells us how much of the variance has been captured by each subsequent principal component. Let’s plot it on a bar chart so we can inspect it more easily.

# Plot the explained variance on a bar chart
# Set x and y axis
x_ax=['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 
     'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17']
y_ax=pca.explained_variance_ratio_.round(3)*100

# Create a plot
plt.figure(figsize=(10,8), dpi=300)
plt.bar(x=x_ax, height=y_ax, color='black')

# Annotate chart by adding values on top of the bars
for i in range(len(x_ax)):
    plt.text(i,y_ax[i]+0.2,str(y_ax[i].round(3))+'%', ha = 'center')

# Set title for chart and axis        
plt.title(label='PCA Variance Explained', loc='center')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance (%)')

plt.show()
PCA explained variance plot. Image by author.
PCA explained variance plot. Image by author.

The plot enables us to easily see that the majority of variance is captured just by the first few PCs:

  • PC1 alone captures 31% of the variance
  • Top 2 PCs capture 50% of the variance
  • Top 6 PCs capture 80% of the variance
  • Top 11 PCs capture 95% of the variance
  • Top 13 PCs capture 99% of the variance

So, if you are building a prediction model, you can definitely make the training more efficient with little or no loss to its performance. You can achieve that by reducing the number of dimensions selecting only a few top principal components.

Note, you can do the selection of the top few components by simply taking a subset of the X_trans array, or rerunning the PCA analysis by putting n_components=6 or whatever number of dimensions you want to keep.

Conclusions

We have learned how the Principal Component Analysis works and how to apply it to our data. I sincerely hope you can use this new knowledge and the code I’ve provided to build better, more efficient models.

Before you go, one last interesting idea to keep in mind. Despite losing some amount of data variance by reducing the number of dimensions, it often results in better model performance.

While this may sound counterintuitive, it is worth remembering that some prediction algorithms suffer from the curse of dimensionality, meaning that too many dimensions make the data sparse and harder to model. Hence, reducing the number of dimensions helps to identify the connections between the attributes leading to improved performance.

If you found this article useful or have any questions or suggestions, please do not hesitate to reach out.

Cheers! 👏 Saul Dobilas


A couple of related articles you may find interesting:

UMAP Dimensionality Reduction – An Incredibly Robust Machine Learning Algorithm

DBSCAN Clustering Algorithm – How to Build Powerful Density-Based Models


Related Articles