The world’s leading publication for data science, AI, and ML professionals.

Principal Component Analysis from the ground up with Python

0. Introduction

Say you have a bunch of data points, and you want to find patterns in them. Principal component analysis is a tool that can help you do that. It finds the most important features in your data, and it reduces the dimensionality of the data. This means that it takes a lot of data points, and it turns them into a smaller number of data points that are easier to work with.

In statistics, principal component analysis (PCA) is a technique used to reduce the dimensionality of data. It is a form of linear transformation where the data is transformed to a new coordinate system such that the greatest variance by any projection of the data onto the new axes occurs on the first axis, the second greatest variance on the second axis, and so on. This transformation is defined by the eigenvectors of the covariance matrix of the data, which are known as the principal components. In other words, PCA is a way of finding the directions in which the data varies the most, and projecting the data onto these directions.

PCA is a powerful tool for data analysis and is used in a variety of fields, such as machine learning, image analysis, and signal processing. In this article, we will give a gentle introduction to PCA, including a brief overview of the math behind it, and some applications of PCA.

1. Useful Libraries

NumPy is the fundamental library for scientific computing in Python. It is used for:

  1. Arrays
  2. Matrices
  3. Linear algebra
  4. Random number generation

And more!

Numpy showcase. Image by author created by @carbon_app
Numpy showcase. Image by author created by @carbon_app

Scikit-learn is a free machine learning library for Python. It is used for:

  1. Classification
  2. Regression
  3. Clustering
  4. Dimensionality Reduction
  5. Model selection
  6. Preprocessing

And more!

scikit-learn showcase. Image by author created by @carbon_app
scikit-learn showcase. Image by author created by @carbon_app

2. Principal Component Analysis (PCA)

Principal Component Analysis (Pca) is a statistical technique for determining the underlying structure of a dataset. It accomplishes this by identifying a set of new axes that are orthogonal to each other (i.e., perpendicular) and best explain the variance in the data. The first axis explains the largest variance. The second axis accounts for the second-most variance, and so forth.

PCA is often used for dimensionality reduction, which is the process of reducing the number of variables in a dataset while preserving as much variation as possible. This is done by projecting each data point onto only the first few principal components.

Mathematically, PCA is the process of computing the eigenvectors of a data covariance matrix. The covariance matrix is a matrix that describes the variance of each variable in a dataset. The eigenvectors are directions in the data that represent the largest variance. The variance is the amount of variation of a data point around the mean.

There are a number of different ways to compute the principal components of a dataset. One popular method is called singular value decomposition (SVD). SVD is a matrix factorization technique that factorizes a matrix into three matrices:

  • The left singular matrix contains the eigenvectors of the covariance matrix.
  • The right singular matrix contains the eigenvectors of the data matrix.
  • The diagonal matrix contains the eigenvalues of the covariance matrix.

SVD is used to compute the principal components of a dataset because it is computationally efficient and because it can be used to solve a number of problems related to PCA.

The PCA procedure can be summarized as follows:

  1. Center the data (i.e., subtract the mean of each variable from each data point). This is necessary because PCA is a variance-maximizing procedure, and centering the data ensures that the first principal component explains the maximum possible variance.
  2. Compute the covariance matrix of the data.
  3. Compute the eigenvectors and eigenvalues of the covariance matrix.
  4. Sort the eigenvectors by descending order of eigenvalue.
  5. Choose the first k eigenvectors, where k is the number of desired principal components.
  6. Compute the projection of the data onto the chosen eigenvectors.

Python implementation

You can copy/paste these lines into your favorite IDE. If you don’t know which one to choose, I recommend PyCharm.

Here is a step by step explanation of what we do in the code above:

  1. We generate some data.

Note that we do not center the data, because the mean is already 0.

  1. We calculate the actual covariance matrix of the data.
  2. We calculate the eigenvectors and eigenvalues of the covariance matrix.
  3. We sort the eigenvectors by descending order of eigenvalue.
  4. We choose the first 2 eigenvectors and compute the projection of the data onto the chosen eigenvectors.

Look how we compute the explained variance ratio at line 35. It is equal to [0.93134786, 0.06865214]. It means that the first component alone explains about 93% of the variance in our data, and the second provides only an increase of 6.9%. The cumulative sum is equal to 100%, which means that the two principal components explain all the variance in our data.

In some cases, you might want to keep only a few components that explain most of the variance in your data, because in real-world applications, you often have too many features that are noisy and that add little predictive power to your model.

6. We compare our solution with scikit-learn

  1. We plot the data.
Figure 1. Image by author
Figure 1. Image by author

In Figure 1, the eigenvectors (principal components) are visualized as lines that are sorted by their length. Each line is labeled by its number and color-coded by the type of component. The components are ordered by the amount of variance they explain in the data. As explained earlier, the two are orthogonal with each other.

  1. We remove the most informative component from the data and see what happens to the data.
Figure 2. Image by author
Figure 2. Image by author

It is quite self-explanatory. All data points are then projected onto the second axis.

  1. We remove the second most informative component from the data and see what happens to the data.
Figure 3. Image by author
Figure 3. Image by author

As a result, all data points are projected onto the first axis explaining the most variance.

  1. Just for kicks, we compare to linear regression.

There are some similarities between Principal Component Analysis (PCA) and linear regression. Both methods find linear relationships in data. However, there are some important differences. The most important difference is that PCA finds the directions that maximize the variance of the data, while linear regression finds the directions that minimize the error of the model (Figure 4).

It is important to remember that the purpose of PCA is not to find the best predictor variables for a linear regression model, but rather to find the underlying structure of the data in order to reduce its dimensionality. In contrast, linear regression is a technique for predicting a quantitative response variable from a linear combination of predictor variables. The coefficients of the linear equation are the regression coefficients that represent the effect of each predictor variable on the response variable.

Figure 4. Image by author
Figure 4. Image by author

4. Conclusion

As we can see from the code, PCA is a powerful tool that can be used to find the underlying structure of a dataset. It is also a computationally efficient method for dimensionality reduction.

In this article, we’ve seen how Principal Component Analysis can be used to find the underlying structure of a dataset. We’ve also seen how to use PCA for dimensionality reduction and how to choose the number of components to keep. Finally, we’ve seen how PCA is related to linear regression.


Related Articles