The world’s leading publication for data science, AI, and ML professionals.

tSNE simplified

Reducing dimensions with tSNE

Guys, to tell you the truth when I heard the name tSNE and the full form being t-distributed stochastic neighborhood embedding, I was scared. Gradually, I could find my way through the papers, various blogs, and realized it is not so ominous as the name suggests. I just thought I will try to share the intuitions that I gathered. Let me put it down in a question, answer format like it came to my mind.

What is dimension Reduction?

Less number of attributes are simpler to manage. It’s easy for the models to train, store, and also easy for visualization. Technically, you can eliminate some of the original features or attributes. This is called feature selection. There is another family of techniques where new features are created from the original set of features. Let’s say, A typical example, where there are a group of customers and there height and weight is collected. If we want to create one feature which will represent both these attributes, we can use the Body Mass Index (BMI), which is computed as weight in Kg and Height in Square Meter. This is the simple motivation of dimension reduction.

From a technical standpoint, these second set of methods where we construct new features are called as dimension reduction techniques.

Can this be done arbitrarily or will it always give meaningful real-world features?

Of-course not arbitrarily. There has to be an objective function like any ML algorithm. Let’s call the original dataset as D and the dataset with a reduced or new feature set as R. let there be a function ‘f’ that measures the information contained in any dataset. The loss function/Objective function will be at an abstract level, f(D) – f(R), let’s call this as J(D, R). We should try to minimize this J. If there are several possibilities reductions namely R1, R2, Rn, we should pick that version of the newly constructed features which minimizes J.

Most of the time, no real-world i.e.domain meaning like BMI can be attached to the newly constructed features, which is a weakness of these sets of methods.

Why dimension reduction is needed?

  • So that we can visualize ( For that we need to reduce to two or three dimensions) and it is interpretable
  • Models are light and efficient, faster to train and apply

What is tSNE now?

tSNE is the t-distributed stochastic neighborhood embedding

Let’s start with stochastic neighborhood embedding. Well, a City Agra can be represented by its’ latitude, longitude, alternatively, it can be represented by its distance from New Delhi.

It’s similar to the concept of word embedding, where a word is represented by the surrounding words. A vector is known by the company it keeps.

Fig1: Representing by Neighbors (Souce: Author)
Fig1: Representing by Neighbors (Souce: Author)

In the above figure, there are 9 points and there are two attributes x1 and x2. The point marked in green can be represented by these two values of x1 and x2.

Alternatively, it can be represented by its distance or similarity with k nearest neighbors (k = 4), the ones marked by black. The similarity is inversely proportional to the distance. To standardize, we can scale these similarities between a number between 0 to 1.

If we think, this in a slightly different manner, we can think of this as the probability of selecting the neighbor, given the green point is already chosen. Let’s understand this with the simple example below.

Fig 2: Neighborhood of the point ( Source: Author)
Fig 2: Neighborhood of the point ( Source: Author)

The intuition is simple,

  • Closer the neighbor, higher is the probability
  • The sum of the probability of all the four neighbors is one.

Any time, when we are dealing with probability, we are dealing with randomness, in more formal term randomness is mentioned as stochastic.

So instead of representing the green point by its coordinates, we can represent it as (0.91,0.045,0.045,0.01). This is called a stochastic neighborhood embedding.

This is still not done, your question will be of course, how the dimension is reduced? Now let’s call the probability distribution in the original space as p and let’s also now assume that the original vectors do not have two features but a total of fifty features.

Let’s further assume that by some magic, we have transformed these vectors into two-dimensional vectors, and then again we have represented the points by its stochastic neighborhood embedding, which is nothing but a probability distribution q. When we are calculating the probability, a standard probability density function like normal distribution is assumed.

Let’s recap, so the same point is represented by a probability distribution in the original dimension called p and by another probability distribution q in reduced space. Without applying much thought, we can conclude p and q should be as similar as possible.

How do we measure, the similarity between two probability distribution? We do that by using KL Divergence.

Equation 1: KL Divergence for a single point
Equation 1: KL Divergence for a single point
Equation 2: KL Divergence for all the points
Equation 2: KL Divergence for all the points

The J, that we were talking about is nothing but the above equation.

The intuition of KL Divergence

Let’s say the original distribution is given as the following

Fig 3: Embedding in Original Feature space (p) ( Image Source: Author)
Fig 3: Embedding in Original Feature space (p) ( Image Source: Author)

Now let’s use the concept of KL Divergence which is a function of the ratio of p and q. Let’s evaluate the following two set of distributions in reduced space.

Fig 4: Two target probability distributions q1 and q2
Fig 4: Two target probability distributions q1 and q2

Visually, we can see that the second one is more similar to the original distribution in figure 3 and when we calculate this p/q for each point and sum it up we do see the value is less for the second alternative (4.08 instead of 5).

What about t-distribution?

Again, we will not deal with maths here, the idea is we try to preserve the neighborhood (Often called as local structure) when we embed or convert to lower dimension. In the first paper, which was called as stochastic neighborhood embedding, this neighborhood preservation was good enough. In its improvement, the authors wanted the points which are far in original space to move further in reduced space.

Let’s say, we know that the probability of a particular neighbor is 0.05, using the standard normal table we can say it’s away from the mean (reference point) by a distance of 1.65. If we convert the same probability of 0.05 using a t-distribution (with a degree of freedom 20), it is 1.725. The bottom line, for the same probability, t-distribution pushes the neighbor in a lower dimension further away from the reference point.

That’s the story of t-distributed stochastic neighborhood embedding.

what are the Parameters of tSNE?

The simplest parameter is how many features we want in the lower dimension space. This is typically two.

Another important parameter is perplexity, which determines when we are doing this stochastic embedding, how many neighbors to consider. If we keep a high value of perplexity, it considers all the points (Global), if we keep a small value, it will consider a few neighbors (Local). This is demonstrated with the concentric circle example.

Fig 5: Concentric Circles in Original Feature Space Scatter plot (Source: Author's notebook)
Fig 5: Concentric Circles in Original Feature Space Scatter plot (Source: Author’s notebook)
Fig 6: Scatter plot in tSNE Space with the different perplexity of concentric circles (Source: Authors notebook)
Fig 6: Scatter plot in tSNE Space with the different perplexity of concentric circles (Source: Authors notebook)

Figure 5 is the original data. Here we are not reducing the dimension, but are illustrating the effect of perplexity

It’s evident that with lower perplexity the red points are still close to each other but the global structure is lost. In the papers, the author suggested a value between 5 to 50.

Why not PCA?

When we talk about dimension reduction Principal component analysis has the most recall. PCA works by preserving the variability of the data. The points that contribute most to the variability are the ones furthest from the center. Hence PCA is more influenced by such points which gives kind of global structure of the data. Also, PCA has a linearity assumption, which tSNE does not have.

Fig 7: 2D Representation of MNIST Dataset for both PCA and tSNE (Author’s notebook)

I am sure, you know about the MNIST dataset which has 10 classes and 784 features. In the above scatter plots 2 new features are constructed using both tSNE and PCA and each class is presented by a different color in Fig 7. It is very evident in the tSNE reduced space, the classes are much more well separated. Please remember both are unsupervised methods and hence do not use the class information.

A PCA, tSNE face-off

  • PCA focusses on global structure rather than local, tSNE focuses on local
  • PCA is linear, tSNE is non-linear
  • PCA intuition is simpler and has fewer parameters than tSNE, as a result, has wider applicability
  • tSNE is computationally more expensive than PCA

Please look at the notebook and the videos for further details

Our Resources from Calcutta University (CU) Data Science Lab ( A.K.Chouhdury School of IT and Statistics Department)

Hands-on:https://www.youtube.com/watch?v=su6amJTXnto

Theory: https://youtu.be/KvbuJ0daXd4

Kaggle Notebook: https://www.kaggle.com/saptarsi/tsne-with-python

Conclusion:

If original features are important, Dimensionality Reduction can not be used. PCA can be a good starting point to see the structure if we do not get a good result we can go for tSNE. If it’s costly to run on the entire dataset, some sampling approach can be adopted.

References:

[1] Hinton GE, Roweis ST. Stochastic neighbor embedding. In Advances in neural information processing systems 2003 (pp. 857–864).

[2] Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media; 2019 Sep 5.

[3] https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1

[4] YouTube. (2013, November 6). Visualizing Data Using t-SNE [Video File]. Retrieved from https://www.youtube.com/watch?v=RJVL80Gg3lA


Related Articles