This is the first post in a two-part series on Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Although they have similarities (such as their names), they each achieve different tasks. In this post, I will describe what PCA is, how it works, and, as an example, use it to define an S&P 500 index fund. Example code and other related resources can be found in the last section of this post.
PCA
Imagine a big rock band with about 20 members. It features guitarists, background singers, pianists, keyboardists, a horn section, drummers, percussionists, etc. A big band needs a big stage. This is not a problem for venues like Madison Square Garden or Wembley Stadium, but what if they are starting out and can only play coffee shops?
Well, instead of 3 guitars, there could be one. Instead of 2 drummers and a percussionist, one could play the bongos. And instead of a piano, electric piano, and synthesizer on stage, one member could play a keyboard. You wouldn’t get the full details of each song, but the songs could still be played in an MTV-unplugged way.

This is exactly what Principal Component Analysis (PCA) does, but instead of a band, we have a dataset. Instead of players, we have variables. And instead of a song, we have what the dataset represents. Data often have redundant players, like having three background vocalists. Sometimes you need them all, but mostly, they stand around and look pretty (here comes the vocalist hate mail). PCA rewrites the music so that fewer performers can play the same song.

In other words, PCA reduces dimensionality and redundancy by combining original variables to maximize variance [1]. This defines a new axis in the input space where variance along this axis is maximized, like in the above figure. When using PCA, it is critical to autoscale input variables. By autoscale, I mean subtract the mean along each variable and divide by its standard deviation. This ensures that the center of each new axis, i.e. the 0, corresponds to the mean.
Although it may not be obvious at first, mathematically this is equivalent to performing an eigenvector decomposition of the covariance matrix defined by the original variables. I will attempt to give a clear derivation in the following section. If you don’t care how it works, feel free to skip ahead to the example at the end of the post.
How does it work?
As stated earlier, PCA creates new variables by combining the original variables in such a way that the variance is maximized. Mathematically we can write this as,

This is just a mathematical way of writing the following question: what value of w maximizes the variance of t subject to the constraint that the norm squared of w is equal to 1? Where, t is a new variable we are defining in terms of the original data, X, and an optimal vector of weights, w. Here, the specific form of t is given by,

The far right-hand side of the above expression rewrites the matrix multiplication as a summation. The jth column of X (a vector) is multiplied by the jth element of w (a number). This summation leaves us with a vector, more specifically, the score vector, which I will call the principal component.
Let’s return to the problem. What value of w maximizes the variance of t subject to the constraint that the norm squared of w is equal to 1? The first thing we can ask is what is the variance of t. This is defined in the usual way,

Notice when the mean of t is zero, its variance is proportional to its norm squared i.e. t•t. Where "•" indicates the dot product. This is why it is important to autoscale your data! It then follows that the vector w which maximizes the variance of t, is the same vector that maximizes the norm squared of t. Thus, we can reformulate the PCA optimization problem as,

Using the definition of t above, this becomes:

Believe it or not, this is a pretty easy problem to solve. Don’t let the vectors and matrices distract you. This is like solving an introductory calculus problem. We just need to employ the Method of Lagrange Multiplier, which is a technique for removing constraints from an optimization problem. This is helpful because it defines a new objective function that has built-in constraints, allowing us to take a derivative and get the optimal solution. If all that was gibberish, don’t worry about it. We need the following relevant equations:

L(x) is a Lagrangian function, defining an equivalent (but hopefully easier) optimization problem. Now, constructing the Lagrangian for the PCA problem we get,

Which gives us 2 equations for 2 unknowns (i.e. w and λ),

The 2nd equation is just a restatement of our constraint i.e. the norm squared of w is 1. The 1st equation is the interesting one. Rearranging terms, we get,

It turns out this first equation is an Eigenvalue problem, which is a standard problem in Linear Algebra. A question then is: what is this matrix on the left-hand side? Well, since we autoscaled our data, it is equivalent to the Covariance Matrix of X. Which has two nice properties: 1) it is symmetric and 2) it is positive definite. This means we can always solve this eigenvalue problem! Then, the eigenvector corresponding to the largest eigenvalue will give the optimal weights for defining our single principal component.
This naturally extends to multiple principal components. Instead of stopping at the largest eigenvalue, we can sort the eigenvalues from largest to smallest. The eigenvectors of these sorted eigenvalues define principal components, such that those associated with larger eigenvalues contain more information about X. This yields an ordered set of new variables such that subsequent variables contain less information about our original dataset.
Although this may have been more math than you have an appetite for when reading an internet blog, I hope that the what and how of PCA is more clear. A few take-home points are given under Key Points. Next, I aim at the why question of PCA, by way of a concrete example. The example below uses PCA to create an S&P 500 index fund.
Key Points
- New variables are defined by a linear combination of original variables
- Each subsequent new variable contains less information
- Applications: dimensionality reduction, clustering, outlier identification
Example: S&P 500 Index Fund
At the outset, I want to disclose I am not a financial advisor. I have never taken a finance class, and this is not a recommendation of how to invest your money. This is just a fun example of what PCA can do—an example Jupyter Notebook can be found in the GitHub repo.
The goal here is to create an S&P 500 index fund. An index fund is an investment portfolio constructed to match or track a specific market [2]. For example, if you think the oil market is a great investment, investing in every oil company is not feasible. That is where index funds are beneficial because they (in theory) will mimic the fluctuations of a market without the cost of investing in each and every company. If we think about it, this is what PCA does. It attempts to capture variations with the fewest variables possible.
The first lines of code get actual S&P 500 data. The yfinance python module is used to get updated close prices [3]. S&P 500 ticker names are grabbed from Wikipedia [4].

Next, we pull stock data for 2020 using yfinance.


Now, we import sklearn and use the built-in PCA functionality. Note, that before we run PCA we need to autoscale the data, which was defined earlier. We also print the explained variance of each principal component, which is a measure of how much information it contains about the dataset.

Next, we sum the weights of the first 3 principal components and define an index fund using the top 61 variables based on relative weight.

We can visualize the portfolio using a bar plot.

Comparing actual S&P 500 fluctuations to index fund.

Computing the percent return of each we get the S&P 500 had an actual return of about 20% over the course of 2020, and the index fund had a return of about 25%.
Conclusion
Principal Component Analysis (PCA) is a popular and powerful tool in Data Science. It provides a way to reduce redundancy in a set of variables. We’ve seen that this is equivalent to an eigenvector decomposition of the data’s covariance matrix. Applications for PCA include dimensionality reduction, clustering, and outlier detection. In my next post, I will discuss a similar but completely different technique, Independent Component Analysis (ICA).
👉 More in this series: Independent Component Analysis | GitHub repo
Resources
Connect: My website | Book a call
Socials: YouTube 🎥 | LinkedIn | Twitter
Support: Buy me a coffee ☕️
References
[1] R. Bro, A. K. Smilde, Anal. Methods, 2014,6, 2812–2831
[2] https://www.investopedia.com/terms/i/indexfund.asp
[3] https://pypi.org/project/yfinance/
[4] https://medium.com/wealthy-bytes/5-lines-of-python-to-automate-getting-the-s-p-500-95a632e5e567
[5] Golden, R. (2020). Statistical Machine Learning: A unified framework. Boca Raton: CRC Press C.