The Curse of Dimensionality

Why not to just put all variables in your model

Julio Marco A. Silva
Towards Data Science

--

As cool as it would be, this post is not about pirates hopping through parallel realities in their dimensional sailing ships, but about those moments when you feel tempted to just shove all your variables into your model and see if it can digest them. Trust me: you will be cursed if you do so and I am about to tell you why and how.

Four sailing ships on a wavy ocean under a gloomy night sky.
Unnamed author, http://www.pikist.com/free-photo-srces

The Curse itself

I have heard it a lot of times and, at first, it even makes sense: “The more information in a model, the better, so I’ll put all these variables in and get perfection!” That is how people, or rather modelling efforts, acquire the curse: by placing lots of variables in their feature space, drastically increasing its dimensionality.

Well, in the great majority of cases, when you go on increasing the number of variables in your dataset (or, conversely, dimensions in your vector space), you, after some time, start seeing your models having a harder and harder time getting robust results. That, which seems counterintuitive, is the curse of dimensionality.

So, why is adding variables problematic? Why is there a “curse” of dimensionality?

Sparser Vector Spaces

The first reason for the curse is topological: the more dimensions (variables) you have in a space, the more “space” you have between data points, as distances are progressively “stretched” as the number of dimensions increase.

Let A and B be data points in an Euclidean space with N dimensions:

Equation for the Euclidean distance.

If we add one more dimension, the effect will be:

Equation for the Euclidean distance with one added dimension.

Since the new term added to the sum is squared, its effect will always be positive, so, unless all the data points have the same coordinate in the new dimension (i.e.: we are adding a constant variable), there will always be an increase in overall distances as we add variables to a dataset. In other words: vector spaces with more dimensions are also sparser.

Even more: let’s consider ε to be a percent increase in distance due to a new variable added and let’s say all variables we add will contribute only with this same added volume to our vector space.

If we add one more variable to our dataset, the volume the vector space will be multiplied by (1 + ε), so, if we insert V more variables in our new vector space, its volume will be increased by a factor of (1 + ε) to the power of V. This means that, as we add more variables to our dataset, its vector space becomes exponentially sparser.

So, again, what is the problem with sparser vector spaces?

The reason sparser vector spaces are a problem is correlated to the reason why we use squared measures for detecting differences, such as squaring errors in the MSE (Mean Square Error) metric or squaring deviations from the mean to compute the variance. Aside from the fact that it makes all numbers positive — something we could achieve by taking the absolute value — squaring the metrics emphasizes every bit of difference between data points, so we see them as if through a magnifying glass.

That “magnification effect” is perfect for detecting deviations, but can be problematic for clustering and classification when it is too strong, as very tiny differences between points in a same cluster or class will then be seen as larger, making very similar data points look so separated from the rest that other similar (but not so similar) data points will be seen as belonging to a different group. That creates the illusion of a multitude of clusters in a dataset that truly has only a few and can make elements rightfully belonging to a class look like outliers to our models. Even in the best scenario, it increases the variance of our predictions (please, tell me you do estimate the expected error of your models… you do that, right? Always healthy to ask.)

Since we were talking about pirates before, a skinny man with a funny hat and little in terms of teeth raises his hand and, in traditional pirate speech, tells us he found a way to lift the curse. Since we have been depending on Euclidean distances to describe the curse so far, all we have to do is to discretize our variables and, by sailing in a different type of space, we would sink the pesky curse to the dark depths and leave it there! Everyone cheers, but it’s time for bad news…

Unfortunately, things are not that simple. When you increase the volume of a discrete space (let’s say, one with a Manhattan distance), its number of discrete possible positions increases exponentially and it starts to behave more and more like an Euclidean space, therefore, while the effects of what we discussed will be somewhat dampened for discrete variables, adding too many of them will take you down the same path — and this is just considering sparsity, which, as we will see ahead, is only one of the problems.

So nice try, rookie… but you’re still cursed.

The Combinatorics Conundrum

Most models we use in data science are nothing but mathematical approximators. We use them to approximate generator functions in regression problems and we use them to approximate separator functions in classification problems. That’s what we do in most of our day-to-day activities: we approximate functions.

Because of the nature of mathematical approximation, limit cases (those data points at the leftmost and rightmost part of every dimension in your domain) are very relevant for generalization. If you don’t properly handle limit cases, they will haunt you like the crew of a ghost ship when your model goes to production.

That is part of the curse: being haunted by ghastly limit cases.

With one single variable, your limit cases will be only in the two sides of your number line (well, this is not true, as you may have to deal with limit cases for your clusters, too, but let’s ignore this for a moment…), so, every time you insert a new variable in your dataset, you will have to deal with not only its new two limit cases, but with the combination of the limit cases of both the old and new variables.

Combinatorics tells us that, for this kind of problem, the number of limit cases tends to be two to the power of the number of variables in your dataset. That means: every time you insert a new variable, you will be at least doubling the number of limit cases you have to deal with… not considering the limit cases of the clusters!

As dealing with limit cases demands data on those, I hope it is clear at this point that, as you insert new variables into your dataset, the volume of the data you need in order to properly learn generalizable patterns from them not only increases, but increases exponentially.

Of course, with very well-behaved and predictable variables with limit cases fitting nicely in the overall shape of the curve of the rest of the data, this will simply not be a problem… but, let’s be frank: if real-life variables behaved like that frequently, data scientists would be facing a short demand in the job market.

If that did not convince you yet, it will convince your manager, as more data means more memory usage, which means more “beefy” machines to run your model… so the ghastly limit cases also pick your pockets by increasing the cloud provider’s bill at the end of the year. That not to mention the costs of your fellow data engineers examining, cleaning, shaping, storing and otherwise wrangling these (lots of) extra data rows and the possible costs of people having to label them.

This is the “combinatorics conundrum” part of the curse: when you insert new variables in your dataset, you will be haunted by the ghosts of your limit cases… and they will have your gold.

The Hyperspherical Cage

You may have heard of mathematical attractors, which are emergent behaviors that force some mathematical relations into a certain pattern and are seen more frequently in iterative processes, like the process of inserting one more variable, then another, then another into a dataset. This is precisely the kind of monster we are dealing with here — and it also catches you when inserting all the variables at the same time, in case you are wondering…

Let’s imagine for a moment that all our variables have symmetrical distributions centered around zero. This is not a necessity for the curse, but helps us understand the monster (the attractor) that takes part on it. We will lift this assumption a bit later.

For any one of our variables taken individually, the greatest probability is that its value in a given randomly chosen data point will be close to zero. Regardless, our variables are weakly correlated, as two strongly correlated variables should not be in the same dataset, as they bring little information (but their full noise) with them to the model.

Since our variables are weakly correlated, the probability of finding a fringe value (say, in the 5% larger or 5% smaller parts of the distribution’s tails) of a variable in a data point is independent on the values of all other variables for that same data point. That means: for every new variable inserted in our dataset, there is a 10% probability that it will be either a small one (5% lower tail) or a large one (5% upper tail) for each data point in our dataset.

As we increase the number of variables, the probability of a data point to be outside of the 5% upper or lower regions goes down significantly (35% with 10 variables, 12% with 20 variables). With enough variables, it is certain that almost all data points will have at least one fringe value in one of its variables.

Adding to it, as we have seen in the discussion about sparsity, when we add another variable to a dataset, the overall distance between points never decreases, each variable adding more and more distance between any point and the center of the vector space, no matter the value any variable or group of variables may take at any point.

Usually, when we picture a data cloud with symmetrical distributions around zero, we are led to visualize it as a fuzzy cloud that fills the space of a hypersphere around the center of the vector space. That is definitely not what happens in high-dimension spaces.

As new variables are added, the probability of any point not being at the fringe of a distribution decreases and, one by one, all points are placed at the surface of the hypersphere, all of them very distant from the center.

Even more so, since, for each new variable, this process is independent from the values of all other variables. Effectively, it is random and, as such, it is more subject to the central limit theorem the more variables there are, making the distance between every single data point and the center of the vector space to be normally distributed around a certain value— the radius of your hyperspherical cage.

Notice that we did not mention or care about the information contained in any of the variables… you may see that it really does not matter.

Now you can see both the monster (the attractor that places all data points into one single geometric shape in the vector space of your dataset, no matter the information your variables may carry) and the cage (the hypersphere itself).

If your data is asymmetric and/or centered around any point other than zero, it will not matter for our scary monster: it will only translate the center of the hypersphere to the mean vector of your data and distort the hypersphere according to the distribution of each of your variables.

OK, scary, but what is the problem of all our data points being at the surface of a hypersphere?

Remember I said our data was placed at the surface of a hypersphere no matter the information our variables may carry? Well, the surface of a hypersphere in a vector space of D dimensions is itself a vector space of D-1 dimensions.

When these winds are strong enough (that is: there are already too many variables in the dataset), our data will be projected into a space of smaller dimensionality and, unless the projection process carefully takes the information in the dataset into consideration, a lot of information can be lost just by doing so.

It happens that our monster cares absolutely nothing for the information contained in the data, so expect heavy losses. Even more so if the variables carry a lot of noise (i.e.: they have a small signal-to-noise ratio).

With noisy variables we face a tragedy, as this whole process empowers noise to simply take over the information we wanted to increase by adding variables in the first place.

How? As we project our data into the surface of the hypersphere with each new variable, the position of each data point in that new space carries its (reduced) information, but is disturbed by its full noise at every step of the way, because destroying information is much easier than maintaining it.

In the end, the position of each data point in the surface of the hypersphere gets more and more random as more and more noise is added.

So, if it wasn’t enough to be cast into an ever expanding space, while chased by ghastly limit cases that happen to be gold-grabbing pickpockets, you are now locked by a mathematical monster in a distorted hypersphere that also happens to flatten you! In what comes to curses, this is one even the most rugged dimensional sea wolves should not take lightly.

The Blessing of Dimensionality

We have seen pretty dangerous things so far, but not everything is dark and scary in high-dimensionality spaces, though. You may find there can be gold there to be plundered!

Inserting new variables into our datasets can, in some cases, be beneficial to our efforts. This is, indeed, the very principle behind one of our most beloved models: the support vector machine (SVM).

The SVM works by inserting at least one new variable — created out of a learned mathematical transformation (a kernel function) — that distorts our data in just the right way to allow linear separators to slice it into meaningful parts. That is what is called the “kernel trick”.

Many other forms of informational gain from higher dimension datasets exist, but they usually depend on the existence of a very large number of data points in order to be taken advantage of (as we saw in “the combinatorics conundrum” section above).

Essentially, increasing the number of variables in your dataset can, under very specific conditions, be of use, but you will need a lot of data and very careful feature engineering in order to use any form of “blessing of dimensionality”.

Final Thoughts

We have seen at least the most important elements that constitute what we know as “the curse of dimensionality” and discussed the way they work and their implications for data science.

I hope it became clear here that:

  • “More variables” does not necessarily mean “more information”;
  • Nothing, absolutely nothing substitutes good feature engineering when it comes to choosing what we intend to feed our models with;
  • The signal-to-noise ratio (or the “informational density”) of our variables matters… a lot;
  • High dimensional spaces are very dangerous to sail through…

If those four points are clear, I will consider it a job done.

--

--