Understanding Dimensionality Reduction for Machine Learning

Rajat S
Towards Data Science
7 min readNov 30, 2019

--

Recently, I was asked to work on a dataset that was a bit on the heavy side. It was so large that my Excel program would stop responding for a few minutes while loading, and manually exploring the data started to become a real pain.

I am still a newbie in the Machine Learning world. So, I had to go back the Excel multiple times to decide which features should be used in the Machine Learning model.

But after an excruciatingly long and mind wrecking time, I was able to build a model that would give me some decent results! I started to feel really good about myself and started to believe that I could make it as a Machine Learning Engineer!

But then I came across my next obstacle: How do I show my results to ordinary people?

I could make two columns where one would show the real-world results and the other column would show the results given to me by the ML model. But this is not the most visually appealing way to go. While machines are better than us at understanding large sets of numeric data, the human mind finds it easier to understand visual data like graphs and plots.

So a plot is the best way to go forward. But our screen is two dimensional and our data can have more than two features where each feature can be considered as a dimension. One of the two dimensions of the plot is going to be the output. So the question we should ask ourselves is: How do we represent all of our features, which could be anywhere from hundreds to thousands, in a single dimension?

This is where dimensionality reduction comes into play!

What is Dimensionality Reduction?

Dimensionality Reduction is a technique in Machine Learning that reduces the number of features in your dataset. The great thing about dimensionality reduction is that it does not negatively affect your machine learning model’s performance. In some cases, this technique has even increased the accuracy of the model.

By reducing the number of features in our dataset, we are also reducing the storage space required to store the data, our python compiler will need less time to go through the dataset.

--

--