The world’s leading publication for data science, AI, and ML professionals.

Dimensionality Reduction (PCA) Explained

Explaining and Implementing PCA in Python

Image by
Héctor J. Rivas from Unsplash
Image by Héctor J. Rivas from Unsplash

Table of Contents

  • Introduction
  • What is Dimensionality Reduction?
  • Linear Combination
  • Feature Engineering & Feature Selection
  • Data
  • PCA
  • Intuition
  • Mathematical Breakdown
  • Advantages
  • Disadvantages
  • Implementation
  • Summary
  • Resources

Introduction

Dimensionality reduction is a popular method in machine learning commonly used by data scientists. This article will focus on a very popular unsupervised learning approach to dimensionality reduction, principal component analysis (PCA). Consider this to be an introductory article with the scope of providing an intuitive understanding of what dimensionality reduction is, how PCA works and how to implement it in Python.

What is Dimensionality Reduction

Before jumping into dimensionality reduction, let’s first define what a dimension is. Given a matrix A, the dimension of the matrix is the number of rows by the number of columns. If A has 3 rows and 5 columns, A would be a 3×5 matrix.

A = [1, 2, 3]   --> 1 row, 3 columns
The dimension of A is 1x3

Now in the most simplest of terms, dimensionality reduction is exactly what it sounds like, you’re reducing the dimension of a matrix to something smaller than it currently is. Given a square (n by n) matrix A, the goal would be to reduce the dimension of this matrix to be smaller than n x n.

Current Dimension of A : n
Reduced Dimension of A : n - x, where x is some positive integer

You might ask why someone would want to do this, the most common application would be for Data Visualization purposes. It’s quite difficult to visualize something graphically which is in a dimension space greater than 3. Through dimensionality reduction, you’ll be able to transform your dataset of 1000s of rows and columns into one small enough to visualize in 3 / 2 / 1 dimensions.

Linear Combination

Without doing a massive deep dive into the mathematics behind linear algebra, the simplest way to reduce the dimension of matrix A is to multiply it by a vector / matrix X such that the product equals to b. The formula would be Ax=B , this formula will have a solution if and only if B is a linear combination of the columns of A.

Example

The goal is to reduce the matrix 4x4 dimension matrix A to a matrix of 4x2 dimensions. The values below are from a random example.
A = [[1,2,3,4],      x = [[1,2],
     [3,4,5,6],           [3,4],
     [4,5,6,7],           [0,1],
     [5,6,7,8]]           [4,0]]
Ax = [[23,13],
      [39,27],
      [47,34],
      [55,41]]
The output B now has a dimension of 4x2.

Feature Engineering & Feature Selection

Given a dataset of various features, you can reduce the number of features through feature engineering and feature selection. This in itself is intuitively reducing the dimensions of the original dataset you were working with. Assuming you have a DataFrame with the following columns :

Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 and you found that you can combine a set of features without losing information about those features by doing some arithmetic. Then the new feature would replace the pairs of features it was produced from. The process of creating new features through some means of preprocessing is known as feature engineering, and the process of selecting specific features for the purposes of training a model is known as feature selection.

Example
Feature 1 + Feature 2 = Feature 6
Now your new features are : 
Feature 3 | Feature 4 | Feature 5 | Feature 6

Data

I’ll show you how to implement PCA through sklearn using NBA data. You can download the data directly from my GitHub repository [here](https://www.kaggle.com/justinas/nba-players-data) or download it from it’s original source from Kaggle here

Principal Component Analysis (PCA)

PCA is a highly used unsupervised learning technique to reduce the dimension of a large dataset. It transforms the large set of variables into smaller components which contain the majority of the information in the large one. Reducing the size of the dataset would naturally result in the loss of information and would impact the accuracy of a model, this downside is offset by the ease of use for exploration, visualization and analysis purposes.

Intuition

Suppose we are looking at a data set associated with a collection of athletes with all of their associated stats. Each of those stats measure various characteristics associated with the athlete, things like height, weight, wingspan, efficiency, points, rebounds, blocks, turnovers, etc. This would essentially yield us a large dataset of various characteristics associated with each of our players, however, many of the characteristics might be related to one another and thus redundant information.

PCA aims to create new characteristics which summarize the initial characteristics we had. Through finding linear combinations of the old characteristics, PCA can construct new characteristics whilst trying to minimize information loss. For example, a new characteristic might be computed as the average number of points of a player subtracted by the average number of turnovers per game. It does this by identifying characteristics which strongly differ across players.

Mathematical Breakdown

Mathematically PCA can be broken down into 4 simple steps.

  1. Identify the centre of the data & reposition the data and centre to the origin

    • This can be accomplished by taking the average of each column and subtracting the original data by that average
  2. Calculate the covariance matrix of the centred matrix
  3. Calculate the eigenvector of the covariance matrix
  4. Project the eigenvectors onto the covariance matrix through the dot product

Mathematically, it is a projection of a higher-dimensional object in a lower-dimensional vector space.

Advantages

  • Removes correlated features
  • Allows for easier visualization
  • Helps reducing overfitting

Disadvantages

  • Variables become less interpretable
  • Information loss
  • Data standardization

Implementation

Note : You may need to change the path on line 11 associated to where you saved the data.

2 Dimensional PCA Visualization of Numerical NBA Features (Image provided by author)
2 Dimensional PCA Visualization of Numerical NBA Features (Image provided by author)

Summary

Dimensionality Reduction is a commonly used method in machine learning, there are many ways to approach reducing the dimensions of your data from feature engineering and feature selection to the implementation of unsupervised learning algorithms like PCA. PCA aims to create new characteristics which summarize the initial characteristics of a dataset through identifying linear combinations.


Resources


Thank you for reading through my article, if you liked this article, here are some others I’ve written which you might be interested in.

Word2Vec Explained

Recommendation Systems Explained

Bayesian A/B Testing Explained


Related Articles