PCA & Autoencoders: Algorithms Everyone Can Understand

Thomas Ciha
Towards Data Science
8 min readSep 12, 2018

--

The primary focus of this article is to provide intuition for the Principal Components Analysis (PCA) and Autoencoder data transformation techniques. I’m not going to delve deep into the mathematical theory underpinning these models as there are a plethora of resources already available.

Introduction

Autoencoders map the data they are fed to a lower dimensional space by combining the data’s most important features. They encode the original data into a more compact representation and decide how the data is combined, hence the auto in Autoencoder. These encoded features are often referred to as latent variables.

There are a few reasons doing this can be useful:

1. Dimensionality reduction can decrease training time

2. Using latent feature representations can enhance model performance

Like many of the concepts in machine learning, Autoencoders are seemingly esoteric. If you’re not familiar with latent variables, a latent variable is essentially an implicit feature of some data. It’s a variable that isn’t observed or measured directly. For example, happiness is a latent variable. We must use a method like a questionnaire to infer the magnitude of an individual’s happiness.

Like the Autoencoder model, Principal Components Analysis (PCA) is also widely used as a dimensionality reduction technique. However, the PCA algorithm maps the input data differently than the Autoencoder does.

Intuition:

Suppose you have an awesome sports car Lego set that you want to send to your friend for their birthday, but the box you have isn’t big enough to fit all the Lego pieces. Instead of not sending it at all, you decide to pack the most important Lego pieces — the pieces which contribute most in making the car. So, you throw away some trivial pieces like the door handles and the windshield wipers and pack pieces like the wheels and the frame. Then, you ship the box off to your friend. Upon receipt of the package, your friend is perplexed at the miscellaneous Lego pieces without instructions. Nonetheless, they assemble the set and are able to recognize that it is a drivable vehicle. It might be a dune buggy, race car or sedan — they don’t know.

The analogy above is an example of a lossy data compression algorithm. The quality of the data is not perfectly retained. It’s a lossy algorithm because some of the original data (i.e. Lego pieces) have been lost. Although using PCA & Autoencoders for dimensionality reduction is lossy, this example does not exactly describe these algorithms — it describes a feature selection algorithm. Feature selection algorithms discard some features of the data and retain salient features. The features they retain are typically chosen for statistical reasons, such as the correlation between the attribute and the target label.

Principal Component Analysis

Suppose a year passes by and your friend’s birthday is approaching again. You decide to get them another Lego car set because they told you last year how much they loved their present. You also blunder again by purchasing a box that’s too small. This time, you think you can make better use of the Legos by cutting them systematically into smaller pieces. The finer granularity of the Legos allows you to fill the box more than last time. Before, the radio antenna was too tall to fit in the box, but now you cut it into thirds and include two of the three pieces. When your friend receives the gift in the mail, they assemble the car by gluing certain pieces back together. They’re able to glue together a spoiler and some hub caps and the car is more recognizable as a result. Next, we’ll explore the mathematical concepts behind this analogy.

LEGO klodser©2015 LEGO/Palle Peter SkovP 1

Elaboration

PCA works by projecting input data onto the eigenvectors of the data’s covariance matrix. The covariance matrix quantifies the variance of the data and how much each variable varies with respect to one another. Eigenvectors are simply vectors that retain their span through a linear transformation; that is, they point in the same direction before and after the transformation. The covariance matrix transforms the original basis vectors to be oriented in the direction of the covariance between each variable. In simpler terms, the eigenvector allows us to re-frame the orientation of the original data to view it at a different angle without actually transforming the data. We are essentially extracting the component of each variable that leads to the most variance when we project the data onto these vectors. We can then select the dominant axes using the eigenvalues of the covariance matrix because they reflect the magnitude of the variance in the direction of their corresponding eigenvector.

Original Data (left) 1st Principal Component & Data (right)

These projections result in a new space, where each basis vector encapsulates the most variance (i.e. the projections onto the eigenvector with the largest eigenvalue have the most variance, the ones on the second eigenvector have the second most variance, etc.). These new basis vectors are referred to as the principal components. We want principal components to be oriented in the direction of maximum variance because greater variance in attribute values can lead to better forecasting abilities. For example, say you’re trying to predict the price of a car given two attributes: color and brand. Suppose all the cars have the same color, but there are many brands among them. Guessing a car’s price based on its color — a feature with zero variance — would be impossible in this example. However, if we considered a feature with more variance — the brand — we will be able to come up with better price estimates because Audis and Ferraris tend to be priced higher than Hondas and Toyotas. The principal components resulting from PCA are linear combinations of the input variables — just like the glued Lego pieces are linear combinations of the originals. The linear nature of these principal components also allow us to interpret the transformed data.

Data projected onto 1st principal component (Source: Author)

PCA Pros:

  • Reduces dimensionality
  • Interpretable
  • Fast run time

PCA Cons:

  • Incapable of learning non-linear feature representations

Autoencoder

Autoencoder Architecture

Things get a little weird with Autoencoders. Instead of just cutting the pieces, you begin melting, elongating and bending the Legos entirely such that the resulting pieces represent the most important features of the car, yet fit within the constraints of the box. Doing this not only allows you to fit even more Lego pieces into the box, but also allows you to create custom pieces. This is great, but your buddy has no idea what to do with the package when it arrives. To them, it just looks like a bunch of randomly manipulated Legos. In fact, the pieces are so different that you would need to repeat this process myriad times with several cars to converge on a systematic way of transforming the original pieces into pieces that can be assembled by your friend into the car.

Elaboration

Hopefully the analogies above facilitate in understanding how Autoencoders are similar to PCA. Within the context of Autoencoders, you are the encoder and your friend is the decoder. Your job is to transform the data in a way the decoder can then interpret and reconstruct with minimum error.

Autoencoders are just a repurposed feed forward neural network. I’m not going to delve into the nitty-gritty details here, but feel free to check out Piotr Skalski’s great article or the deep learning book to gain a more comprehensive understanding of neural nets.

Although they are capable of learning complex feature representations, the largest pitfall of Autoencoders lies in their interpretability. Just like your friend was clueless when they received the distorted Legos, it is impossible for us to visualize and understand the latent features of non-visual data. Next we’ll examine sparse Autoencoders.

Autoencoder Pros

  • Able to learn non-linear feature representations
  • Reduce dimensionality

Autoencoder Cons

  • Computationally expensive to train
  • Uninterpretable
  • More complex
  • Prone to overfitting, though this can be mitigated via regularization

Sparse Autoencoders

Sparse Autoencoder Loss Function (Source: Andrew Ng)

The notion that humans underutilize the power of the brain is a misconception based on neuroscience research that suggests at most 1 – 4% of all neurons fire concurrently in the brain. There’s probably several good, evolutionary reasons for the sparse firing of neurons in the human brain. If all neurons fired simultaneously and we were able to “unlock the brain’s true potential” it might look something like this. I hope you enjoyed the digression. Back to neural networks. The sparseness of synapses in the brain may have served as inspiration for the sparse Autoencoder. The hidden neurons throughout a neural net learn a hierarchical feature representation of the input data. We can think of a neuron “firing” when it sees the feature of the input data that it is looking for. Vanilla Autoencoders force latent features to be learned by virtue of their undercomplete architecture (undercomplete meaning the hidden layers contain fewer units than the input layer). The idea behind sparse Autoencoders is that we can force the model to learn latent feature representations via a constraint unrelated to the architecture — the sparsity constraint.

The sparsity constraint is what we want our average hidden layer activations to be and is typically a floating point value close to zero. The sparsity constraint hyperpameter is represented with the greek letter rho in the function above. Rho hat j denotes the average activation of hidden unit j.

We impose this constraint on the model using KL divergence and weight this imposition by β. In short, KL divergence measures the dissimilarity of two distributions. Adding this term into our loss function incentivizes the model to optimize the parameters so that the KL divergence between the distribution of activation values and the uniform distribution of the sparsity parameter is minimized.

Constraining activations close to zero means neurons will only fire when it is most critical to optimize accuracy. The KL divergence term means neurons will be also be penalized for firing too frequently. I highly recommend reading this if you’re interested in learning more about sparse Autoencoders. These lectures (lecture1, lecture2) by Andrew Ng are also a great resource which helped me to better understand the theory underpinning Autoencoders.

Conclusion

We’ve delved into the concepts behind PCA and Autoencoders throughout this article. Unfortunately, there is no elixir. The decision between the PCA and Autoencoder models is circumstantial. In many cases, PCA is superior — it’s faster, more interpretable and can reduce the dimensionality of your data just as much as an Autoencoder can. If you can employ PCA, you should. However, if you’re working with data that necessitates a highly non-linear feature representation for adequate performance or visualization, PCA may fall short. In this case, it may be worth the effort to train Autoencoders. Then again, even if the latent features produced by an Autoencoder increase model performance, the obscureness of those features poses a barrier to knowledge discovery.

Thanks for reading! I hope you enjoyed the article and gained some useful insights. If you did, feel free to leave a clap! Constructive feedback is appreciated.

References

Neuroscience research: https://www.sciencedirect.com/science/article/pii/S0960982203001350?via%3Dihub

https://www.coursera.org/learn/neural-networks/lecture/JiT1i/from-pca-to-autoencoders-5-mins

https://arxiv.org/pdf/1801.01586.pdf

Other Image Sources

Autoencoder Architecture:

Another article worth checking out: https://neptune.ai/blog/understanding-representation-learning-with-autoencoder-everything-you-need-to-know-about-representation-and-feature-learning

--

--