Too Many Features? Let’s Look at Principal Component Analysis

Home-brewed machine learning model series

Published in

Towards Data Science

13 min readJun 24, 2023

The companion repo is available here!

The curse of dimensionality is one major problem in machine learning. As the number of features increases, so does the complexity of the model. Moreover, if there is not enough training data, it results in overfitting.

In this entry, Principal Component Analysis (PCA) will be introduced. First, I will explain why too many features are a problem. Then, the math behind PCA and why it works. Additionally, PCA will be broken down into steps, accompanied by visual examples and code snippets. Moreover, the advantages and disadvantages of PCA will be explained. Lastly, these steps will be encapsulated in a Python class for later use.

Projecting a vector into a lower space is like casting a shadow in real life. PCA finds the direction from where to cast these shadows. Photo by inbal marilli on Unsplash

Note for the reader: If you are not interested in the math explanation and just want to see the practical examples and how PCA works, jump to the section “PCA in practice”. If you are only interested in the Python class head to “Home-brewed PCA implementation”.

The problem with too many features?

Take a look at the feature space in Fig 1. There are few examples to fill all the space, so a model of this data might not generalize well to new, unseen examples.

Fig 1. An example of 2-dimensional feature space. From my own authorship, inspired by [1].

What happens if we add another feature? Let’s have a look at the new feature space in Fig. 2. You can see that there is even more empty space than in the previous example. As the number of features increases, the model will overfit the current data. That is why there are techniques to reduce the dimensionality of the data and alleviate this problem. [1]

The same example with an additional feature. From my own authorship, inspired by [1].

What is the goal of PCA?

In a few words, the purpose of PCA is to extract new, uncorrelated features of lower dimensions that maximize the amount of information kept from the original data. The measure of information in this context is the variance. Let’s see why:

This technique is based on the assumption that our d-dimensional data point x can be represented by a linear combination of vectors of an orthonormal basis [1]:

Do not worry, I will explain where we get the vectors of said basis later. Moreover, we can extract a representation x̂ using m out of the d vectors in the combination (m < d):

Of course, we are not getting an exact representation since there are fewer features, but at least we try to minimize the loss of information. Let us define the Mean Squared Error (MSE) between the original example x and the approximation x̂:

As the summations use the same variables with different cutoffs, then the difference is just the offset:

We know by our starting hypothesis that x is the sum of orthonormal vectors. Hence, the dot product of these vectors is zero, and each of their Euclidean norms is one. Thus:

Solving the importance value yi:

Plugging that result into its expectation:

We can see that if xi is centered (mean equal to zero), then the expectation turns out to be the covariance matrix of the whole data, and this result is nothing more than the variance in the original space. By choosing the right vectors vi that maximize the variance, we will effectively minimize the representation error.

Where does this orthonormal basis come from?

As previously stated, we want to get m vectors that maximize the variance:

If we take the whole data matrix, it can be seen that vi is a projection direction. The data is going to be projected in a space of lower dimensionality.

If we diagonalize the covariance matrix Σ using spectral decomposition we get:

Where U is a matrix with the normalized eigenvectors of Σ, and Λ is a diagonal matrix that contains the eigenvalues of Σ in descending order. This is possible since Σ is a real, symmetric matrix.

Furthermore, as Λ only contains non-zero values on the diagonal, we can rewrite the above equation as:

with:

Notice that the vectors in U and vector v are normalized. Thus, when performing the squared dot product of each v with a, we get a value between [0,1] and so w must also be a normalized vector:

From here, interesting properties arise.

The first principal component

Recall the optimization problem. Since the eigenvalues are ordered and w must be a normalized vector, our best option is to get the first eigenvector with w = (1,0,0,…). As a consequence, the upper bound is attained when:

The projection direction that maximizes the variance turns out to be the eigenvector associated with the largest eigenvalue!

The rest of the components

Once the first principal component is set, a new restriction to the optimization problem is added:

It means that the new component v2 must be orthogonal to the previous component, the eigenvector u1 so that the information is not redundant. It can be proved that all the d components correspond to the d normalized eigenvectors from Σ associated with the eigenvalues, in descending order. Take a look at these notes for formal proof of this claim [2].

PCA in practice

From the theoretical description above the steps needed to get the principal components of a dataset can be described. Let the initial dataset be a random sample of the following 2D normal distribution:

from scipy import stats
mean = [3,3]
var = [[6, 3],
       [3, 3.5]]
n = 100
data_raw = np.random.multivariate_normal(mean, var, 100)

Fig 3. Original point cloud. Non-centered. From my own authorship.

1. Centering the data

The first step is to move the cloud to the origin of the coordinate system so the data has zero mean. This step is performed by subtracting the sample mean from every point in the dataset.

import numpy as np
data_centered = data_raw - np.mean(data_raw, axis=0)

Fig 4. Centered point cloud. From my own authorship.

2. Computing the covariance matrix

The variance defined above is the population covariance matrix Σ. In practice, that information is not available to us as we only have one sample. Therefore, we can approximate that parameter by using the sample covariance S.

Recall that the data is already centered. Thus:

We can write this compactly using matrix multiplications. This can also help us vectorize computations:

cov_mat = np.matmul(data_centered.T, data_centered)/(len(data_centered) - 1)
# > array([[5.62390186, 2.47275007],
# >       [2.47275007, 3.19395349]])

The reason for passing the transposed matrix as the first argument in the code is that in the mathematical formulation of the data matrix, the features are in the rows and the subjects in the columns. In the implementation, the opposite happens since in almost every system, the events, subjects, logs, and so forth, are stored in rows.

3. Perform the eigendecomposition of the covariance matrix

The eigenvalues and eigenvectors a are computed using eig()from scipy:

from scipy.linalg import eigh
eigvals, eigvecs = eigh(cov_mat)

# Sorting the eigenvalues and eigenvectors
indices = eigvals.argsort()[::-1]
eigvals, eigvecs = eigvals[indices], eigvecs[:,indices]

eigvecs
# > array([[-0.82348021,  0.56734499],
# >        [-0.56734499, -0.82348021]])

As it was explained earlier, the eigenvalues represent the variance of the principal components, and the eigenvectors are the projection directions:

Fig 5. Plot of the projection directions and the point cloud. From my own authorship.

You can see that a new coordinate system is created using the directions of the principal components. Furthermore, the eigenvalues and eigenvectors must be stored to transform new data afterward.

4. Enforce determinism

The coefficients of the eigenvectors will be always the same, except for their sign. PCA can have multiple valid orientations. Thus, we need to enforce a deterministic result by taking the eigenvector matrix and for each of its columns, applying the sign of the largest absolute value within that column.

max_abs_cols = np.argmax(np.abs(eigvecs), axis=0)
signs = np.sign(eigvecs[max_abs_cols, range(eigvecs.shape[1])])
eigvecs = eigvecs*signs
eigvecs
# > array([[ 0.82348021, -0.56734499],
# >      [ 0.56734499,  0.82348021]])

5. Extract the new features

Each new feature (principal component) is extracted by performing the dot product between each point in the original feature space and the eigenvector:

new_features = np.dot(data_centered, eigvecs)

For this particular example, after computing the components, the new points in the space are depicted as follows:

Fig 6. Plot of the uncorrelated point cloud, using the features from the PCs. Notice that it is just a rotation of the original point cloud. From my own authorship.

Notice that this result is basically a rotation of the original cloud of points in a way that the attributes are uncorrelated.

6. Reduce dimensionality

So far, the principal components were computed in full to understand them, in a visual manner. What is left is to choose how many components are needed. We resort to the eigenvalues for this task, since they represent the variance of each principal component.

The ratio of the variance held by component i is given by:

And the ratio of the variance preserved by choosing m components is given by:

If we visualize the variance for each component in our example, we arrive at the following:

# Variance of each individual component as bars
plt.bar(
    [f"PC_{i}" for i in range(1,len(eigvals)+1)],
    eigvals/sum(eigvals)
)

# Percentage held by m components as the line
plt.plot(
    [f"PC_{i}" for i in range(1,len(eigvals)+1)],
    np.cumsum(eigvals)/sum(eigvals),
    color='red'
)
plt.scatter(
    [f"PC_{i}" for i in range(1,len(eigvals)+1)],
    np.cumsum(eigvals)/sum(eigvals),
    color='red'
)

Fig 7. Variance explained by each principal component. From my own authorship.

In this case, PC1 represents 80% of the variance of the original data, with the remaining 20% belonging to PC2. Furthermore, we can choose to use only the first principal component, in which case the data would look like this:

Fig 8. Projection of the data in the direction of the first eigenvector. From my own authorship.

This is the projection of the data in the direction of the first eigenvector. It does not look very useful right now. What if we instead choose data that belong to three classes? How would PCA look?

PCA on multiclass data

Let us create a dataset with three classes that can be linearly separable:

from sklearn.datasets import make_blobs
X, y = make_blobs()


plt.scatter(X[:,0], X[:,1],c=y)
plt.legend()
plt.show()

Fig 9. Multi-class data. From my own authorship.

If we apply PCA to the data above, this would be the plot of the principal components:

Fig 10. Multi-class uncorrelated data. From my own authorship.

And this would be the plot of the first component (the projection of the data in the direction of the eigenvector corresponding to the largest eigenvalue):

Fig 11. Projection of the data in the direction of the first eigenvector. This is the dimensionality reduction. From my own authorship.

It works! The data still looks easily separable by a linear model.

Advantages and disadvantages

Like everything in science, there is no silver bullet. Here is a list of advantages and disadvantages that you should take into account before using PCA with real-world data.

Advantages of PCA

Dimensionality Reduction: PCA allows for the reduction of high-dimensional data into a lower-dimensional space while preserving most of the important information. This can be useful for data visualization, computational efficiency, and dealing with the curse of dimensionality.
Decorrelation: PCA transforms the original variables into a new set of uncorrelated variables called principal components. This decorrelation simplifies the analysis and can improve the performance of downstream machine learning algorithms that assume independence of features.
Noise Reduction: The lower-dimensional representation obtained through PCA tends to filter out noise and focus on the most significant variations in the data. This can enhance the signal-to-noise ratio and improve the robustness of subsequent analyses.

Disadvantages of PCA

Linearity Assumption: PCA assumes that the underlying data relationships are linear. If the data has complex non-linear relationships, PCA may not capture the most meaningful variations and could provide suboptimal results.
Interpretability: The principal components obtained from PCA are linear combinations of the original features. It can be difficult to relate the principal components back to the original variables and understand their exact meaning.
Sensitivity to Scaling: PCA is sensitive to the scaling of the input variables. If the variables have different scales, those with larger variances can dominate the analysis, potentially leading to biased results. Proper feature scaling is crucial for obtaining reliable results with PCA.
Outliers: PCA is sensitive to outliers since it focuses on capturing the variance in the data. Outliers can significantly influence the principal components and distort the results.

Home-brewed PCA implementation

Now that we have covered the details of Principal Component Analysis, all that remains is to create a class that encapsulates the aforementioned behavior and that could be reused in future problems.

For this implementation, the scikit-learn interface will be used, which has the following methods:

fit()
transform()
fit_transform()

Constructor

No complex logic is needed. The constructor will just define the number of components (features) that the transformed data will have.

import numpy as np
from scipy.linalg import eigh

class PCA:
    """Principal Component Analysis.
    """
    def __init__(self, n_components):
        """Constructor of the PCA class.

        Parameters:
        ===========
        n_components: int
            The number of dimensions for the transformed data.
            Must be less than or equal to n_features.
        """
        self.n_components = n_components
        self._fit_instance = False

The fit method

The fit method will apply steps 1–4 from the previous section.

Centering the data
Computing the covariance matrix
Computing eigenvalues, eigenvectors and sorting them
Enforcing determinism by flipping the signs of the eigenvectors

It will also store the eigenvalues and vectors, as well as the sample mean, as object attributes to transform new data later.

    def fit(self, X):
        """Compute eigenvectors to transform data later

        Parameters:
        ===========
        X: np.array of shape [n_examples, n_features]
            The data matrix
        
        Returns:
        ===========
        None
        """
        # Fit the mean of the data and center it
        self.mean = np.mean(X, axis=0)
        X_centered = X - self.mean

        # Compute covariance matrix
        cov_mat = np.matmul(X_centered.T, X_centered)/(len(X_centered) - 1)

        # Compute eigenvalues, eigenvectors and sort them
        eigenvalues, eigenvectors = eigh(cov_mat)
        self.eigenvalues, self.eigenvectors = self._sort_eigen(eigenvalues, eigenvectors)

        # Get the explained variance rations
        self.explained_variance_ratio = self.eigenvalues/np.sum(self.eigenvalues)
        
        # Enforce determinism by flipping the eigenvectors
        self.eigenvectors = self._flip_eigenvectors(self.eigenvectors)[:, :self.n_components]

        self._fit_instance = True

The transform method

It will apply steps 1, 5, and 6:

Centering new data using the stored sample mean
Extracting the new PC features
Reducing the dimensionality by picking n_components dimensions.

    def transform(self, X):
        """Project the data in the directions of the eigenvectors.

        Parameters:
        ===========
        X: np.array of shape [n_examples, n_features]
            The data matrix
        
        Returns:
        ===========
        pcs: np.array[n_examples, n_components]
            The new, uncorrelated features from PCA.
        """
        if not self._fit_instance:
            raise Exception("PCA must be fitted to the data first! Call fit()")
       
        X_centered = X - self.mean
        return np.dot(X_centered, self.eigenvectors)

The fit_transform method

For simplicity of implementation. This method will apply the fit() function first and transform() later. I am sure you can figure out a more clever definition.

def fit_transform(self, X):
       """Fits PCA and transforms the data.
       """
       self.fit(X)
       return self.transform(X)

Helper functions

These methods were defined as separate components, instead of applying all the steps in the fit()function to make the code more readable and maintainable.

    def _flip_eigenvectors(self, eigenvectors):
        """Enforce determinism by changing the signs of the eigenvectors.
        """
        max_abs_cols = np.argmax(np.abs(eigenvectors), axis=0)
        signs = np.sign(eigenvectors[max_abs_cols, range(eigenvectors.shape[1])])
        return eigenvectors*signs
    
    def _sort_eigen(self, eigenvalues, eigenvectors):
        """Sort eigenvalues in descending order and their corresponding eigenvectors.
        """
        indices = eigenvalues.argsort()[::-1]
        return eigenvalues[indices], eigenvectors[:, indices]

Testing the class

Let’s use the previous example with our PCA class:

from pca import PCA

# Using our PCA implementation
pca = PCA(n_components=1)
X_transformed = pca.fit_transform(X)

# Plotting the first PC
plt.scatter(X_transformed[:,0], [0]*len(X_transformed),c=y)
plt.legend()
plt.show()

Fig 12. Results using the PCA class. From my own authorship.

Conclusion

Having many features with small data can be harmful and will, most likely, result in overfitting. Principal Component Analysis is a tool that can help alleviate this problem. It is a dimensionality reduction technique that works by finding projection directions for the data in a way that the original variability is preserved as much as possible, and the resulting features are uncorrelated. Moreover, the variance explained by each new feature, or principal component, can be measured. Then, the user can choose how many principal components and how much variance is enough for the task. Finally, be sure to know your data first, as PCA works with samples that can be linearly separated and can be sensitive to outliers.

References

[1] Fernández, A. Dimensionality Reduction. Universidad Autónoma de Madrid. Madrid, Spain. 2022.

[2] Berrendero, J. R. Regresión lineal con datos de alta dimensión. Universidad Autónoma de Madrid. Madrid, Spain. 2022.

Connect with me on LinkedIn!