Why and how to get rid of the curse of dimensionality right (with breast cancer dataset visualization)

Published in

Towards Data Science

11 min readMar 20, 2018

The topic for this article has been floating in my thoughts for quite a long period of time, but as my blog activity shows to us, I couldn’t find an inspiration to put the information I know into structured words for more than a few months. I hope that my skills of writing blog posts haven’t become very bad since my last article and you will enjoy the topic described here; any type of feedback is welcome, as always.

Informal definition and intuition

The curse of dimensionality and dimensionality reduction itself are the topics being talked about quite rarely; however, for almost each data scientist (no matter if he/she is working with machine learning or natural language processing) there comes a moment, when the specialist meets face-to-face with this problem.

It is quite difficult to formulate a valid definition of the curse of dimensionality, but in simple words, it is a

poor performance of algorithms and their high complexity associated with dataframe having a big number of dimensions/features, which frequently make the target function quite complex and may lead to model overfitting as long as often the dataset lies on the lower dimensionality manifold.

As long as the majority of machine learning/natural language processing algorithms are based on processing a numerical dataset (coming from real life or being generated after doing some of the data preprocessing), it is quite a common situation that each of the observations is a real-valued vector, having size N×1, where N may be 200, 500 or more than 10000. However, as it was mentioned previously, a lot of ML/NLP methodologies can’t deal with the given number of features in a proper way (especially, when working with classic machine learning models).

In some of the real-life cases, a huge number of variables leads to a substantial decrease in model accuracy and recall; proper decreasing of the dataset size helps to achieve model generalization and nice cross-validation results.

The problem of choice of the algorithm for suitable dimensionality reduction is highly dependent on the type of the dataset we’re working with and the target function we’re preparing our data for; for instance, in some cases it is important to preserve distances within lower-dimensionality space, and in other cases conditional probability distributions of the observations obtain the vast majority of the information encoded in the high-dimensional feature space.

Manual feature selection

If the data scientist is acquainted with the dataset and meanings of its numerical features well, then it is quite easy to determine if we need some of the variables or not. For instance, when predicting the temperature in some region having data about humidity, cloudiness, time of the sunrise, etc. the data about the population of the given region and its traffic won’t be sufficient as long as the target variable isn’t dependent on it. However, it is possible to perform such an approach only in the case when the dataset is well-interpretable and the dependencies between the variables are easily determined.

Another nice idea for getting rid of useless variables is looking at their correlation (Pearson correlation coefficient) and removing one of the two features, which are highly correlated, as long as both of them contain resembling information.

Pearson correlation coefficient for a population is calculated with:

The numerator of this formula is the covariance of two variables and its denominator is the product of their standard deviations.

Principal component analysis and singular value decomposition (Principal Component Analysis)

I consider principal component analysis to be the most popular dimensionality reduction algorithm in machine learning; some data scientists use it as the synonym for dimensionality reduction methodology (which is wrong, of course) because of its common usage in many papers, researches, etc.

Principal component analysis
The idea behind PCA is matrix factorization approach: we’re representing our input dataset as a product of two matrices T and P transposed (also adding a residual and mean vector, as it will be observed later).

Projecting a dataset into a matrix T is same as assigning a scalar to every row; columns of it represent dominant object patterns (these columns are also called ‘score vectors’). Projecting a dataset into a matrix P’ is same as assigning a scalar to every column; rows of it represent dominant variable patterns (these rows are also called ‘loading vectors). Vectors in T and vectors in P’ are orthogonal.

In fact, PCA can be formulated in terms of least squares model:

Epsilon represents the residuals, which are derivations between the original coordinated and projections; mean vector may be included in the model formulation or determined mathematically.

Here is the breast cancer dataset (classification task) with data projected into 2 dimensions.

Singular value decomposition
Informally speaking, singular value decomposition is an enrichment of the principal component analysis methodology except for the fact, that now columns in our matrix T are normalized to length 1 and we’re having one more matrix D, which stores actual lengths of these vectors:

Matrix U is same as T, but with normalized vectors, as it was mentioned earlier, D is the diagonal matrix, having square roots of eigenvalues of covariance matrix XX’ as its values and V’ is absolutely the same as P’.

SVD is also the basis for other popular algorithms, for instance, latent semantic indexing in terms of natural language processing tasks. Values of matrix D are often referred to as ‘explained variance’ of the components; they represent the importance of each of the variables.

Multidimensional scaling (Mapping a manifold of perceptual observations)

The intuition behind multidimensional scaling algorithm is all about preserving the intrinsic metric structure of the objects by calculating geodesic (locally shortest) distances between objects and mapping these paths into straight lines in the reduced feature space.

The main component of MDS is the isomap procedure, which is formulated in a quite simple fashion:

Isomap assumes that distance between points in observation space is an accurate measure of distance in low-dimensional space only locally and must be integrated over paths on the manifold to obtain global distances. The main 2 properties of this non-linear function are:

intrinsically similar observations should map to nearby points in feature space;
geodesic paths in a manifold should map into straight lines in the obtained feature space.

The isomap procedure consists of 3 crucial steps:

discrete representation of the manifold. We’re randomly selecting r points from the dataset to serve as nodes of the topology-preserving network and connecting these nodes only if exists at least one observation from the whole dataset whose two closest nodes are these previously mentioned nodes. Created graph G clearly respects the topology of the manifold;
manifold distance measure. We should assign a weight to each link in a graph G, which is equal to the Euclidean distance between data points:

Length of the path G is equal to the sum of link weights along that path. As a next step, we’re moving from Euclidean distance to geodesic distance, formulated as:

After that, for each node k we’re setting the geodesic distance to be:

The computations are performed using dynamic programming algorithm;

3. isometric Euclidean embedding. Ordinal multidimensional scaling (‘nonmetric MDS’) is used to find a configuration of k-dimensional feature vectors notated with Y corresponding to high-dimensional observations such that graph distances are preserved as closely as possible. Ordinal MDS is also less sensitive to noise and outliers in the dataset:

We’re minimizing the cost function with respect to Euclidean graph distance; graph distance with hat sign means distance with some monotonic transformations applied to it. Overall, the main idea of multidimensional scaling is to preserve and simplify the existing distances in the high-dimensional space as well as possible by turning geodesic distances into straight lines in a low-dimensional space and still obtain correct positions of data points in the resulting manifold, which relate well to each other.

The same dataset is projected into 2 dimensions: it is clear that now clusters are denser, the trend of the data is clearly visible, which can be interpreted as the linear regression line.

Locally linear embedding (Nonlinear Dimensionality Reduction by Locally Linear Embedding)

Previously described dimensionality reduction technique was based on working with distances between observations and appropriately preserving them in low-dimensional space; locally linear embedding idea is based on the fact, that locally linear patch of the manifold has each data point and its neighbors to lie on it or to be close to it. These patches are characterized by linear coefficients, which reconstruct each data point from its neighbors.

Reconstruction errors are measured with:

Reconstructions have index j; cost function we’re minimizing has two main constraints: firstly, we’re reconstructing each data point only from its neighbors. Secondly, rows of our weight matrix sum to one. We need these constraints to for the weights to obtain necessary symmetry. Optimal weights to solve this function are found by minimizing least-squares problem; in fact, optimal weights are the contribution of j-th data point to the i-th reconstruction.

The core concept behind LLE is the fact that same weights that reconstruct data points in high-dimensional space should also reconstruct its embedded manifold coordinates in low-dimensional space. The reconstruction weights reflect intrinsic geometric properties of that data that are invariant to transformations such as translation, rotation, rescaling, etc.

Global internal coordinates are represented in high-dimensional neighborhood-preserving mapping with the following cost function:

Here we have our previously calculated weights fixed and optimizing our cost function for embedded coordinates Y.

As we see on the plot, distances between data points and relative observations positions are not preserved well, but the neighborhoods between samples are demonstrated clearly.

SNE and t-SNE (Visualizing Data using t-SNE)

Stochastic neighbor embedding and t-distributed stochastic neighbor embedding are the last methodologies described in this post, but they are still quite important in terms of dimensionality reduction tasks as long as they use probabilistic information representing similarities as the core of the model. Their main idea behind both SNE and t-SNE is to use observations similarities to transform high-dimensional data points into a low-dimensional one with capturing much of the local structure of the high-dimensional data very well.

Stochastic neighbor embedding
As it was mentioned earlier, preserving neighborhood identity is the main task we’re trying to solve when embedding observations into a low-dimensional space.

An asymmetric probability for each object i and each potential neighbor j that i would pick j as a neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x is defined as:

Euclidean distance (affinity) between two high-dimensional points is formulated with the use of probability distribution variance in the following way:

Sigma squared is the variance of the Gaussian that is centered on datapoint x with index i.

Affinity has the following properties:

k is the effective number of neighbors (perplexity);
value of sigma squared is that makes the entropy of the distribution over neighbors equal to log k.

Induced perplexity in the low-dimensional space that point i picks point j is defined without the use of probability density as:

As long as we would like to model these two distributions as well as possible, cost function we’re constructing is made with the sum of Kullback-Leibler divergences over all datapoints using a gradient descent method:

Because the Kullback-Leibler divergence is not symmetric, different types of error in the pairwise distances in the low-dimensional map are not weighted equally. Minimization of previously defined cost function leads to embedding the observations into a manifold, which has conditional probabilities calculated in low-dimensional space to be equal or almost equal to conditional probabilities calculated in high-dimensional space so that the mismatch between two probability distributions is minimized.

t-distributed stochastic neighbor embedding
The only difference of t-distributed stochastic neighbor embedding from simple stochastic neighbor embedding is that we’re using symmetric probabilities instead of asymmetric conditional probabilities for calculation of similarities in high-dimensional space. Also, Student t-distribution is employed in low-dimensional space to alleviate both the crowding problem and the optimization problems of SNE.

Symmetric SNE helps to compute pairwise affinities in the cases when the distance between two high-dimensional data points is big:

Crowding problem is formulated as the fact that ‘the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datapoints’. As the result of it, distances between datapoints are not preserved adequately and natural dataset clusters are not formed correctly because of high variance of the projection into the low-dimensional space. In order to solve the crowding problem we use Student t-distribution that makes map’s representation of joint probabilities (almost) invariant to changes in the scale of the map for map points that are far apart:

Spherical data representation in the low-dimensional space is a quite common result after performing t-SNE:

Conclusions

This article doesn’t mention pros and cons of the described approaches, methodologies they’re suitable for, most common usage cases, etc. as long as this topic is highly dependent on the input data, noise present in it, number of dimensions, etc. In practice, dimensionality reduction techniques can’t be adequately measured because of same reasons as for word embeddings models: their effectiveness highly depends on the algorithm to be applied later and type of the data to be passed into the methodology.

Represented visualizations give a nice insight into the type of information dimensionality reduction technique is preserving, for instance, if it is reflecting neighborhood relations, trends or clusters positions well.