In this blog post we will look into inner workings of the t-SNE algorithm, to clearly understand how it works, what it could be used for and what are its limitations. I will take a top-to-bottom approach first explaining more generally how the algorithm works on a higher level and then dive deeper into the mathematics behind it. The article contains code blocks in most places where there are equations or calculations, as some people have an easier way understanding concepts through code than math. The entire post is also a notebook and can be downloaded.
Introduction
t-SNE is an algorithm used to visualize high-dimensional data. Because we can’t visualize anything that has more than two – perhaps three – dimensions, t-SNE does this by reducing the number of dimensions in the data. It does this while preserving the structure of the data as much as possible. If we compare t-SNE to PCA we see that PCA tries to preserve the information in the data, while t-SNE would try to preserve the relative distances and the clustering structure in the data, therefore with less regard to the total preserved information.
General Procedure
As the aim is to reduce the dimensionality of the data, let’s start with a simple example and try to see how reducing two dimensional data to one dimension would look like.

As you can see the output of the algorithm is not just a projection to the x axis, it transforms the dimensions such that the correlation between similar and dissimilar data is kept as much as possible.
If you take a look at the GIF below you can see how this procedure happens. At each iteration of the algorithm, similar data-points are pulled together, and dissimilar points are pushed apart. This means that with enough iterations the algorithm should cluster the points in a similar way to their natural structure.

The Math
Now let’s look at how the algorithm actually works and the math behind it. First of all we need to be able to determine if a pair of data points are similar. To do this we first calculate the euclidean distance between every pair of points.
Then, we go through the following procedure to calculate the similarities: For each data-point we place it in the middle of a Gaussian curve and the rest of the data along the rest of the curve, according to their distances. This means that points closer to the middle are similar to the chosen one, while the ones on the edges are very different.
This is reflected by the following formula which basically is normalized Gaussian distribution;

You may be wondering, how do we choose the standard deviation (sigma)? This we do not choose directly, but by setting something called perplexity; which is defined by,

Perplexity, basically, reflects our opinion of the density of the data. The best value for perplexity cannot really be calculated analytically. So, we usually need to try out different values and try to find the one that provides the best results. The best value for perplexity usually varies between 5 and 50.
With perplexity set by us, the sigmas corresponding to it are chosen by the algorithm, by doing binary search.
Now that we now how to find the similarities between the data in its original, high-dimensional state, we need to do the same for its low-dimensional representation. This way we will be able to make sure that data that is close together in its original structure is also close together in the projection.
This is a very similar procedure to the previous one, the only difference being that instead of using a Gaussian distribution we will use a Student’s t-distribution, with one degree of freedom.

This distribution is very similar to the Gaussian. It’s just a bit higher on the edges, which makes the middle a bit lower.

We now know how to calculate the similarities both in high and low dimensional space. Next, we will do the iteration process. This, will be done by stepping and picking better and better representations of the data with each step using gradient descent.
But in order to be able to calculate the gradient, we need a cost function – meaning a way to check if the structure of the data in low dimensions represents the high-dimensional state accurately. Because we will basically be comparing two distributions, we use the Kullback-Leibler divergence for this.
I will not be showing the derivations to obtain the gradient here. If you are interested I suggest checking the Appendix in the original paper for this algorithm. If you go through the process the resulting gradient is the following;

With that, the gradient descent procedure is basically as usual. We will, however, use momentum here, to make the learning at the beginning a bit quicker. It is just a simple step function.
There is only one thing left that I haven’t told you yet. That is, what is called the crowding problem. The gist of this problem is that because we decrease the number of dimensions, the things that we are able to represent in these dimensions also decrease. To understand this, imagine a square where each of its edges is a data-point. Now try to represent these four points in a one dimensional space. You should see that there are a couple of combinations of results you will achieve, however none of them will feel right, because you cannot represent the fact that they all have equal distances to two other points. This basically is the crowding problem. To try to solve this problem, we set

Now we know everything to implement the t-SNE algorithm.
Results
We can quickly run it on the MNIST digits dataset, and see that it does cluster different digits in separate clusters, which is exactly what we would like to see.

But let’s now generate some data to check how it works on different structures. I will generate data of only 2 dimensions and run on t-SNE to obtain a representation with two dimensions. Although this practically wouldn’t make sense, I do it here so we are be able to visualize the results and have an easier time examining them.
First, different distances. If you look at the results below, you can see that the distances between clusters are not preserved for all the results. This, shows the importance of choosing the correct perplexity to keep the original structure as much as possible.

Next, we try clusters with different densities. Here, we see that the densities of the clusters created by t-SNE become about the same. The reason for this is that when we calculate the similarities we normalize the results. This causes the clusters to have similar sizes although it is not how the original data looks like.

We can also see that the algorithm can distinguish a denser region inside a cluster as a different cluster, which is pretty impressive. Do note, however, that the perplexity set will change the results, so we do need to represent our knowledge of the data through the perplexity or try out different values to compare the results.

Lastly, one more example mostly because visually it looks really cool. Still, if you look at the result obtained by 50 perplexity you can see that the middle of the clusters is pushed out more, compared to the rest of the data-points. This is caused by the fact that these regions are more dense so the push-pull affects become greater.
If you are interested in seeing more examples of t-SNE’s results on different structures of data and its limits I suggest checking out distill.pub.

To close we can compare results obtained by t-SNE to PCA to clearly see that t-SNE aims to cluster the data and therefore makes the visualizations much more useful while trying to cluster the data.

Conclusion
We saw how t-SNE works, through code. If you’d like to download the post as a notebook you can get it here. We also saw that t-SNE is useful for different purposes than PCA. Although they both reduce the dimensionality of the data, they do it in a very different way. For this reason the drawbacks that we saw should be carefully considered before using t-SNE.