The world’s leading publication for data science, AI, and ML professionals.

Comprehensive guide for Principal Component Analysis

The theoretical and practical part of Principal Component Analysis with python implementation

Table of Contents
1. Introduction
2. Principal Component Analysis (PCA)
3. Theory
3.1. Calculating PCA
3.1.1. Rescaling (Standardization)
3.1.2. Covariance Matrix
3.1.3. Eigenvalues and Eigenvectors
3.1.4. Sorting in Descent Order
3.2. Is PCA one of the feature extraction&feature selection methods?
4. Implementation
4.1. Traditional Machine Learning Approaches
4.2. Deep Learning Approaches
5. PCA Types
5.1. Kernel PCA
5.2. Sparse PCA
5.3. Randomized PCA
5.4. Incremental PCA

1. Introduction

This article covers the definition of PCA, the Python implementation of the theoretical part of the PCA without Sklearn library, the difference between PCA and feature selection & feature extraction, the implementation of machine learning & deep learning, and explained PCA types with an example.

Photo by Nathan Dumlao on Unsplash
Photo by Nathan Dumlao on Unsplash

2. PCA

Principal Component Analysis is a very useful method based on mathematics and statistics, which makes Dimensionality Reduction by evaluating the dataset from different angles. Its task in machine learning is to reduce the dimensionality of the inputs in the dataset and contribute to learning by the algorithm or by grouping the dataset according to the features in the unsupervised approach. This dimensionality reduction process is the result of various mathematical operations. It is exemplified the 2D dataset (x,y) with 2 features in the coordinate plane. Classification of the dataset becomes much easier when we transform it into 1D with PCA. Now let’s implement and visualize dimensionality reduction with PCA:

The cancer dataset (defined as cancer_data in coding) consists of 596 samples and 30 features. These numeric features are first scaled using StandardScaler, then the dataset is made 2-dimensional with the PCA method which is imported with the Sklearn library, and the targets that are ‘malignant’ and ‘benign’ are colored as in Figure 1. The X-Axis represents the first of the 8 components, and the y-axis represents the second component of the 8 components.

Figure 1. Graph of first principal component and second principal component, Image by author
Figure 1. Graph of first principal component and second principal component, Image by author

As can be seen in Figure 1, after the PCA process, classification is made without using any algorithms, almost predictable by the human eye. However, considering the numerical dataset of 30 features, this is something that will not be possible for human beings at all.

When the variance value of each component is looked at, it is seen that there are [0.44272026, 0.18971182, 0.09393163, 0.06602135, 0.05495768, 0.04024522, 0.02250734, 0.01588724]. The first and second components correspond to 63% of the entire dataset. The cumulative variance graph of the 8 components is shown in Figure 2.

Figure 2. Cumulative variance with the number of components, Image by author
Figure 2. Cumulative variance with the number of components, Image by author

The process of positioning the data in the new dimensions created while the dataset is converted to different dimensions is called projection. In Figure 3, the distinction is seen according to the newly created dimensionalities and variance with the PCA illustration in the mglearn library.

Figure 3. PCA visualization with mglearn library, Image by author
Figure 3. PCA visualization with mglearn library, Image by author

So what exactly lies behind this miraculous process that transforms 30 dimensions into 2 dimensions?

3. Theory

PCA changes the orientation of the components to achieve maximum variance and in this way it aims to reduce the dimensionality of the dataset.

Variance: gives information about the distribution of the dataset. For example, let’s take an example of fill 5cl liquids into bottles. Let the bottles in the first case be 4cl, 5cl, 5cl, 5cl, 6cl, and the bottles in the second case 2cl, 3cl, 5cl, 7cl, 8cl. Although the average of both is 5cl, the fillings in the first case will be more uniform than in the second case since the distribution variance of the samples in the first case is lower than in the second case. This indicates that the distribution is more successful.

3.1. Calculating PCA

The flowchart of how dimensionality reduction is done with PCA is shown in Figure 4.

Figure 4. flowchart of dimensionality reduction, Image by author
Figure 4. flowchart of dimensionality reduction, Image by author

PCA is created by mathematical operations without using the sklearn library and the components with the sklearn library are compared.

Outputs of each step are shown in tables step by step.

3.1.1. Rescaling (Standardization)

In the first stage, scaling is applied to the numeric dataset. For this, mean and std values of each feature are calculated. Using these calculations, the new dataset is created adhering to the formula: x_new = (x – mean(column of x)) / std (column of x) . For this operation, mean for each feature = 0 and std = 1 (Standardization with StandardScaler)

Figure 5. Dataset (left) and Scaled dataset (right), Image by author
Figure 5. Dataset (left) and Scaled dataset (right), Image by author

3.1.2. Covariance Matrix

The covariance matrix is created according to the following formula, and the scaled dataset is completely reconstructed according to the relationship between each other:

Figure 6. Covariance matrix formula, source
Figure 6. Covariance matrix formula, source

After calculating all covariance values according to this equation, a matrix of (n_features, n_features) is obtained. The main goal is to rearrange the dataset to maximize the variance in the dataset. To detect this, the covariance matrix is needed.

Covariance is a measure of correlation. By covariance, we learn in which direction two variables change together (the same direction if positive, reverse if negative). then we can use correlation to find out the extent of this change. covariance is measured in units. In data science, covariance covers the relationship of two variables or data sets.

Figure 7. Covariance Matrix, Image by author
Figure 7. Covariance Matrix, Image by author

3.1.3. Eigenvalues and Eigenvectors

The eigenvalues values are calculated from the dataset with the Covariance Matrix, and the corresponding eigenvectors are obtained as the total number of features.

Figure 8. Eigenvalues(left) and corresponding eigenvectors (right), Image by author
Figure 8. Eigenvalues(left) and corresponding eigenvectors (right), Image by author

3.1.4. Sorting in Descent Order

Eigenvalues are ordered from highest to lowest. How many components are desired to be selected for PCA, the eigenvector corresponding to that number of eigenvalues is selected and the dataset dimensionality is reduced.

Hints about eigenvalues:

The trace of matrix x is equal to the sum of its eigenvalues.

The determinant of matrix x is equal to the product of its eigenvalue.

The rank of matrix x is equal to the number of nonzero eigenvalues of matrix x.

Figure 9. PCA results without Sklearn without Sklearn (left), PCA results with Sklearn (right), Image by author
Figure 9. PCA results without Sklearn without Sklearn (left), PCA results with Sklearn (right), Image by author

The first and the second principal components of the dataset are calculated with mathematical equations and it is seen that results are the same with importing the Sklearn library.

3.2. Is PCA one of the feature selection & feature extraction methods?

Both yes and no. Since principal component analysis reduces the dimensionality of features, it can be perceived as extracting features or choosing the most effective features that affect the result. But understanding the theory part mentioned above will make this clear. Beyond the PCA machine learning application, it is about interpreting the dataset in another coordinate system. We can think of this as converting signals from time axis to frequency axis with Fourier Transform.

Numeric and continuous variables are re-evaluated by considering their variance values ​​and the dataset is viewed from a different window with PCA. Although it is technically possible to implement it, using PCA for categorical variables would not yield reliable results. Again, when the theory part above is understood, the condition of performing feature selection with PCA is logical as the most important features that affect the result have the most variance. Of course, it is still technically possible to implement, but the choice is at the developer’s discretion.

4. Implementation

4.1. Traditional Machine Learning Approaches

It was mentioned that PCA, which is a very useful method despite the loss of information, reduces the dimensionality reduction and features values. In image datasets, each pixel is considered a feature. In other words, there are 1281283 = 49152 features for a 128×128 RGB (3 channel) image. This number is quite high for supervised learning models. In this section, on the kitchenware image dataset consisting of 81 cups, 74 dishes, and 78 plates, after expanding the dataset with image augmentation and dimensionality reduction with PCA, XGBoost is applied as follows:

After the dataset is imported from the local folder, it is replicated 15 times with the defined Imagedatagenerator and 3495 samples are obtained. In the coding, x: represents the dataset, y: represents the labels. Then, in order to measure the generalization performance of the model from a different source, 5 cups,5 dishes, and 5 plates are downloaded and these are also imported from the local folder. After the necessary data preprocessing, images are added to the end of x and labels to the end of y. The reason for adding the acquired images to the training and test dataset to evaluate the model generalization performance is that the same PCA process is applied to all of them. After the dataset is combined, PCA is imported using the sklearn library and 49152 pixels (features) are reduced to 300. At this point, the model generalization performance dataset with 15 samples is extracted again using NumPy, and 3495 datasets are separated as train dataset and test dataset. The issue here is not whether or not PCA is used, but only as an application. Feature selection could also be made with SelectPercentile. Then, after the labels are adapted to the XGBoost model, the training dataset is trained and the model is evaluated with the test dataset. Finally, the predictions of the model have been examined with the separated 15 samples external dataset.

Results are shown in Figure 10.

Figure 10. Confusion Matrix of the external dataset (left) and Confusion Matrix of the test dataset (right), Image by author
Figure 10. Confusion Matrix of the external dataset (left) and Confusion Matrix of the test dataset (right), Image by author

4.2. Deep Learning Approaches

Encoder & Decoder is mostly preferred for processing in Deep Learning. However, it is technically possible to apply PCA. Let’s classify the dataset prepared with the above data import and preprocessing operations and dimensionality reduction with PCA, with Dense Layers.

Results are shown in figure 11:

Figure 11. Confusion Matrix of the external dataset (left) and Confusion Matrix of the test dataset (right), Image by author
Figure 11. Confusion Matrix of the external dataset (left) and Confusion Matrix of the test dataset (right), Image by author

5. PCA Types

5.1. Kernel PCA

Although PCA is a linear model, it may not give successful results in non-linear situations. Kernel Pca is a method, also called kernel-trick, that can separate data non-linearly.

Figure 12. Dataset(left), Dataset with PCA(middle), Dataset with Kernel PCA(right), Image by author
Figure 12. Dataset(left), Dataset with PCA(middle), Dataset with Kernel PCA(right), Image by author

5.2. Sparse PCA

it is aimed to interpret the models more easily in Sparse PCA. While the linear combination of the whole dataset is each Principal Component in PCA, each principal component is a linear combination of a subset of the dataset in Sparse PCA.

5.3. Randomized PCA

Randomized PCA works with Stochastic Gradient Descent and is called Randomized PCA. Makes the PCA process faster by finding the first x principal components.

5.4. Incremental PCA

It performs the PCA method by keeping the large-sized dataset in memory in mini-batches.

Methods are above presented in the Sklearn library and can be implemented according to a dataset easily.


Back to the guideline click here.

Machine Learning Guideline


Related Articles