The world’s leading publication for data science, AI, and ML professionals.

Principal Component Analysis (PCA)

Intuition, math, and stonks

This is the first post in a two-part series on Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Although they have similarities (such as their names), they each achieve different tasks. In this post, I will describe what PCA is, how it works, and, as an example, use it to define an S&P 500 index fund. Example code and other related resources can be found in the last section of this post.


PCA

Imagine a big rock band with about 20 members. It features guitarists, background singers, pianists, keyboardists, a horn section, drummers, percussionists, etc. A big band needs a big stage. This is not a problem for venues like Madison Square Garden or Wembley Stadium, but what if they are starting out and can only play coffee shops?

Well, instead of 3 guitars, there could be one. Instead of 2 drummers and a percussionist, one could play the bongos. And instead of a piano, electric piano, and synthesizer on stage, one member could play a keyboard. You wouldn’t get the full details of each song, but the songs could still be played in an MTV-unplugged way.

Visual analogy of PCA. Image by author.
Visual analogy of PCA. Image by author.

This is exactly what Principal Component Analysis (PCA) does, but instead of a band, we have a dataset. Instead of players, we have variables. And instead of a song, we have what the dataset represents. Data often have redundant players, like having three background vocalists. Sometimes you need them all, but mostly, they stand around and look pretty (here comes the vocalist hate mail). PCA rewrites the music so that fewer performers can play the same song.

Visualization of PCA. Image by author.
Visualization of PCA. Image by author.

In other words, PCA reduces dimensionality and redundancy by combining original variables to maximize variance [1]. This defines a new axis in the input space where variance along this axis is maximized, like in the above figure. When using PCA, it is critical to autoscale input variables. By autoscale, I mean subtract the mean along each variable and divide by its standard deviation. This ensures that the center of each new axis, i.e. the 0, corresponds to the mean.

Although it may not be obvious at first, mathematically this is equivalent to performing an eigenvector decomposition of the covariance matrix defined by the original variables. I will attempt to give a clear derivation in the following section. If you don’t care how it works, feel free to skip ahead to the example at the end of the post.

How does it work?

As stated earlier, PCA creates new variables by combining the original variables in such a way that the variance is maximized. Mathematically we can write this as,

Single component PCA problem written in terms of an optimization problem.
Single component PCA problem written in terms of an optimization problem.

This is just a mathematical way of writing the following question: what value of w maximizes the variance of t subject to the constraint that the norm squared of w is equal to 1? Where, t is a new variable we are defining in terms of the original data, X, and an optimal vector of weights, w. Here, the specific form of t is given by,

Functional form of score vector or what I will call a principal component.
Functional form of score vector or what I will call a principal component.

The far right-hand side of the above expression rewrites the matrix multiplication as a summation. The jth column of X (a vector) is multiplied by the jth element of w (a number). This summation leaves us with a vector, more specifically, the score vector, which I will call the principal component.

Let’s return to the problem. What value of w maximizes the variance of t subject to the constraint that the norm squared of w is equal to 1? The first thing we can ask is what is the variance of t. This is defined in the usual way,

Variance of t.
Variance of t.

Notice when the mean of t is zero, its variance is proportional to its norm squared i.e. tt. Where "•" indicates the dot product. This is why it is important to autoscale your data! It then follows that the vector w which maximizes the variance of t, is the same vector that maximizes the norm squared of t. Thus, we can reformulate the PCA optimization problem as,

Reformulation of PCA optimization problem.
Reformulation of PCA optimization problem.

Using the definition of t above, this becomes:

Reformulation of PCA optimization problem in terms of X and w.
Reformulation of PCA optimization problem in terms of X and w.

Believe it or not, this is a pretty easy problem to solve. Don’t let the vectors and matrices distract you. This is like solving an introductory calculus problem. We just need to employ the Method of Lagrange Multiplier, which is a technique for removing constraints from an optimization problem. This is helpful because it defines a new objective function that has built-in constraints, allowing us to take a derivative and get the optimal solution. If all that was gibberish, don’t worry about it. We need the following relevant equations:

Lagrangian and associated equations for single variable optimization problem and N constraints.
Lagrangian and associated equations for single variable optimization problem and N constraints.

L(x) is a Lagrangian function, defining an equivalent (but hopefully easier) optimization problem. Now, constructing the Lagrangian for the PCA problem we get,

Lagrangian for PCA optimization problem.
Lagrangian for PCA optimization problem.

Which gives us 2 equations for 2 unknowns (i.e. w and λ),

Resulting equations from Method of Lagrange Multipliers.
Resulting equations from Method of Lagrange Multipliers.

The 2nd equation is just a restatement of our constraint i.e. the norm squared of w is 1. The 1st equation is the interesting one. Rearranging terms, we get,

Rearranging first equation from Method of Lagrange Multipliers gives Eigenvalue problem.
Rearranging first equation from Method of Lagrange Multipliers gives Eigenvalue problem.

It turns out this first equation is an Eigenvalue problem, which is a standard problem in Linear Algebra. A question then is: what is this matrix on the left-hand side? Well, since we autoscaled our data, it is equivalent to the Covariance Matrix of X. Which has two nice properties: 1) it is symmetric and 2) it is positive definite. This means we can always solve this eigenvalue problem! Then, the eigenvector corresponding to the largest eigenvalue will give the optimal weights for defining our single principal component.

This naturally extends to multiple principal components. Instead of stopping at the largest eigenvalue, we can sort the eigenvalues from largest to smallest. The eigenvectors of these sorted eigenvalues define principal components, such that those associated with larger eigenvalues contain more information about X. This yields an ordered set of new variables such that subsequent variables contain less information about our original dataset.

Although this may have been more math than you have an appetite for when reading an internet blog, I hope that the what and how of PCA is more clear. A few take-home points are given under Key Points. Next, I aim at the why question of PCA, by way of a concrete example. The example below uses PCA to create an S&P 500 index fund.

Key Points

  • New variables are defined by a linear combination of original variables
  • Each subsequent new variable contains less information
  • Applications: dimensionality reduction, clustering, outlier identification

Independent Component Analysis (ICA)

Example: S&P 500 Index Fund

At the outset, I want to disclose I am not a financial advisor. I have never taken a finance class, and this is not a recommendation of how to invest your money. This is just a fun example of what PCA can do—an example Jupyter Notebook can be found in the GitHub repo.

The goal here is to create an S&P 500 index fund. An index fund is an investment portfolio constructed to match or track a specific market [2]. For example, if you think the oil market is a great investment, investing in every oil company is not feasible. That is where index funds are beneficial because they (in theory) will mimic the fluctuations of a market without the cost of investing in each and every company. If we think about it, this is what PCA does. It attempts to capture variations with the fewest variables possible.

The first lines of code get actual S&P 500 data. The yfinance python module is used to get updated close prices [3]. S&P 500 ticker names are grabbed from Wikipedia [4].

Importing modules and ticker names. Image by author.
Importing modules and ticker names. Image by author.

Next, we pull stock data for 2020 using yfinance.

Pulling ticker data using yfinance. Image by author.
Pulling ticker data using yfinance. Image by author.
Preview of Pandas dataframe with close prices. Rows are dates. Columns are stock ticker names. Image by author.
Preview of Pandas dataframe with close prices. Rows are dates. Columns are stock ticker names. Image by author.

Now, we import sklearn and use the built-in PCA functionality. Note, that before we run PCA we need to autoscale the data, which was defined earlier. We also print the explained variance of each principal component, which is a measure of how much information it contains about the dataset.

Import sklearn. Autoscale data. Apply PCA. Print explained variance. Image by author.
Import sklearn. Autoscale data. Apply PCA. Print explained variance. Image by author.

Next, we sum the weights of the first 3 principal components and define an index fund using the top 61 variables based on relative weight.

Define index fund. Image by author.
Define index fund. Image by author.

We can visualize the portfolio using a bar plot.

The relative weighting of stocks in portfolio. Image by author.
The relative weighting of stocks in portfolio. Image by author.

Comparing actual S&P 500 fluctuations to index fund.

Comparing close prices of of S&P 500 to index fund over time. Image by author.
Comparing close prices of of S&P 500 to index fund over time. Image by author.

Computing the percent return of each we get the S&P 500 had an actual return of about 20% over the course of 2020, and the index fund had a return of about 25%.

Conclusion

Principal Component Analysis (PCA) is a popular and powerful tool in Data Science. It provides a way to reduce redundancy in a set of variables. We’ve seen that this is equivalent to an eigenvector decomposition of the data’s covariance matrix. Applications for PCA include dimensionality reduction, clustering, and outlier detection. In my next post, I will discuss a similar but completely different technique, Independent Component Analysis (ICA).

👉 More in this series: Independent Component Analysis | GitHub repo

Independent Component Analysis (ICA)


Resources

Connect: My website | Book a call

Socials: YouTube 🎥 | LinkedIn | Twitter

Support: Buy me a coffee ☕️

Get FREE access to every new story I write


References

[1] R. Bro, A. K. Smilde, Anal. Methods, 2014,6, 2812–2831

[2] https://www.investopedia.com/terms/i/indexfund.asp

[3] https://pypi.org/project/yfinance/

[4] https://medium.com/wealthy-bytes/5-lines-of-python-to-automate-getting-the-s-p-500-95a632e5e567

[5] Golden, R. (2020). Statistical Machine Learning: A unified framework. Boca Raton: CRC Press C.


Related Articles