This article is in continuation with the story Variable Reduction with Principal Component Analysis. In the previous post, I talked about one of the most known and widely used methods, called Principal Component Analysis. It employs an efficient linear transformation, which reduces the dimensionality of a high dimensional dataset while capturing the maximum information content. It generates the Principal Components, which are linear combinations of the original features in the dataset. In addition, I showed step by step how to implement this technique with Python. At first I thought that the post was enought to explain PCA, but I felt that something was missing. I implemented PCA using separate lines of code, but they are inefficient when you want to call them every time for a different problem. A better way is to create a class, which is effective when you want to encapsulate data structures and procedures in one place. Moreover, it’s really easier to modify since you have all the code in this unique class.
Table of Content:
1. Dataset
Before implementing the PCA algorithm, we are going to import the breast cancer Wisconsin dataset, which contains the data regarding the breast cancer diagnosed in 569 patients [1].
import pandas as pd
import numpy as np
import random
from sklearn.datasets import load_breast_cancer
import plotly.express as px
data = load_breast_cancer(as_frame=True)
X,y,df_bre = data.data,data.target,data.frame
diz_target = {0:'malignant',1:'benign'}
y = np.array([diz_target[y1] for y1 in y])
df_bre['target'] = df_bre['target'].apply(lambda x: diz_target[x])
df_bre.head()

We can notice that there are 30 numerical features and a target variable, that specify if the tumour is benign (target=1) or malignant (target=0). I convert the target variable to a string since it’s not used by PCA and we only need it in the visualizations later.
In this case, we want to understand the difference in variability of the features when the tumour is benign or malignant. This is really hard to show it with a simple exploratory analysis since we have more than two covariates. For example, we can try to visualize a scatter matrix with only six features, coloured by the target variable.
fig = px.scatter_matrix(df_bre,dimensions=list(df_bre.columns)[:5], color="target")
fig.show()

Certainly, we can observe two different clusters in all these scatter plots, but it’s messy if we plot all the features at the same time. Consequently, we need a compact representation of this multivariate dataset, which can be provided by the Principal Component Analysis.
2. Implementation of PCA

The steps to obtain the principal components (or k dimensional feature vectors) are summarized in the illustration above. The same logic will be applied to build the class.
We define the PCA_impl
class, which has three attributes initialized at the beginning. The most important attribute is the number of components we want to extract. Moreover, we can also reproduce the same results every time by setting random_state
equal to True and standardizing the dataset only if we need it.
This class also includes two methods, fit
and fit_transform
, similarly to the scikit-learn’s PCA. While the first method provides most of the procedure to calculate the principal components, the fit_transform
method also applies the transformation on the original feature matrix X. In addition to these two methods, I also wanted to visualize the principal components without specifying every time the functions of Plotly Express. It can be really useful to speed up the analysis of the latent variables generated by PCA.
3. PCA without standardization
Finally, the PCA_impl
class is defined. We only need to call the class and the corresponding methods without any effort.

We can access the var_explained
and cum_var_explained
attributes, that were calculated within the fit
and fit_transform
methods. It’s worth noticing that we capture 98% with just one component. Let’s also visualize the 2D and 3D scatterplots using the method defined previously:
pca1.pca_plot2d()

From the visualization, we can observe that two clusters emerge, one marked in blue representing the patients with malignant cancer and the other regarding benign cancer. Moreover, it seems that the blue cluster contains much more variability than the other cluster. In addition, we see a slight overlapping between the two groups.
pca1.pca_plot3d()

Now, we look at the 3D scatterplot with the first three components. It’s less clear than the previous scatterplot, but a similar behaviour emerges even in this plot. There are surely two distinct groups based on the target variable. New information is discovered looking at this three-dimensional representation: two patients with malignant cancer appear to have completely different values with respect to all the other patients. This aspect could be slightly noticed looking at the 2D plot or in the scatter matrix we displayed previously.
4. PCA with standardization
Let’s replicate the same procedure of the previous section. We only add the standardization at the beginning to check if there are any differences in the results.

Differently from the previous case, we can notice that the range of values regarding the principal components is more restricted and 80% of the variance explained is captured with three components. In particular, the contribution of the first component passed from 0.99 to 0.44. This can be justified by the fact that all variables have the same units of measure and, consequently, the PCA is able to give equal weight to each feature.
pca1.pca_plot2d()

These observations are confirmed by looking at the scatterplot with the first two components. The clusters are much more distinct and have lower values.
pca1.pca_plot3d()

The 3D representation is easier to read and comprehend. Finally, we can conclude that the two groups of patients have different feature variability. Moreover, there are still the two data points that lie apart from the rest of the data.
5. PCA with Sklearn
At this point, we can apply the PCA implemented by Sklearn to compare it with my implementation. I should point out that there are some differences to take into account in this comparison. While my implementation of PCA is based on the covariance matrix, the scikit-learn’s PCA involves the centering of the input data and employs the Singular Value Decomposition to project the data to a lower-dimensional space.
Before we saw that standardization is a very important step before applying PCA. Since the mean is already subtracted from each feature’s column by Sklearn’s algorithm, we only need to divide each numerical variable by its own standard deviation.
X_copy = X.copy().astype('float32')
X_copy /= np.std(X_copy, axis=0)
Now, we pass the number of components and the random_state to the PCA class and call the fit_transform
method to obtain the principal components.

The same results of the implemented PCA with standardization are achieved with sklearn’s PCA.
fig = px.scatter(components, x=0, y=1, color=df.label,labels={'0': 'PC 1', '1': 'PC 2'})
fig.show()

fig = px.scatter_3d(components, x=0, y=1,z=2, color=df.label,labels={'0': 'PC 1', '1': 'PC 2','2':'PC 3'})
fig.show()

In the same way, the scatterplots replicate what we have seen in the previous section.
Final thoughts:
I hope you found this post useful. The intention of this article was to provide a more compact implementation of the Principal Component Analysis. In this case, my implementation and the Sklearn’s PCA provided the same results, but it can happen that sometimes they are slightly different if you use a different dataset. The GitHub code is here. Thanks for reading. Have a nice day!
References:
[1] Breast Cancer Wisconsin (Diagnostic) dataset
Did you like my article? Become a member and get unlimited access to new Data Science posts every day! It’s an indirect way of supporting me without any extra cost to you. If you are already a member, subscribe to get emails whenever I publish new data science and python guides!