T-Stochastic Neighbor Embedding
The t-SNE algorithm is a good resource to look at high dimensional data. Think about it: to visualize more than 2 dimensional data can become really challenging. And I mean more than 2 because even the great 3D graphics available these days, it is still very difficult for our brain to interpret 3D images. After all, our monitor or screens are still flat good old 2D, right?
So we came up with this good resource. The t-distribution Stochastic Neighbor Embedding. What does this fancy name mean?
It means that the algorithm will look at the dataset checking for similarities between data points and converts them to joint probabilities (likelihood of two events occurring together and at the same point in time). Then, it will try to minimize the Kullback-Leibler divergence (difference between two probability distributions) between the joint probabilities of the low-dimensional embedding and the high-dimensional data.
Check the data points, calculate the probability of the events occur at the same time for the low dimensional and high dimensional data.
Perplexity
If we look at the documentation, perplexity is "related to the number of nearest neighbors that is used in other manifold learning algorithms". It also says that "larger datasets usually require a larger perplexity".
This is a tunable parameter. The parameter can be understood as an estimate about the number of close neighbors each point has. I also saw it here as one of the parameters to calculate the standard deviations of the T-Distributions. As per the documentation, consider selecting a value between 5 and 50.
Also consider creating a test with different values, since different values can result in significantly different results.
Look at this test with a fake dataset I created for learning purposes:

If we force the quantity of iterations n_iter=250
then the result below occurs. Look that we don’t see a clear separation for high perplexity in this case, because the algorithm needs a different number of iterations to converge to the best solution.

As my dataset is small, you can see that the smaller numbers are driving better results, given that the number of neighbors is smaller.
If I create another dataset with 50 variables distributed uniformly and present it to t-SNE, look at the result at different perplexity
numbers.

In this case, the perplexity
parameter after 30 start to get very random. As we see in the documentation, it should not be used over 50, as the results won’t be so precise.
Other Considerations
Other considerations we can consider from this good article How to Use t-SNE Effectively are:
- Cluster sizes in a t-SNE plot mean nothing: t-SNE won’t care about the sizes of each cluster.
- Distances between clusters might not mean anything. Not always the distances between clusters will be reflected by t-SNE. Have that in mind.
- Random noise doesn’t always look random: if you create a random dataset and present it to t-SNE, you can still see some pattern for low value
perplexity
. That does not mean that there are clusters. - You can see some shapes, sometimes: the
perplexity
will dial the local variance for small values or global variance for high values. This leads to the appearance of shapes.
The results of t-SNE can perhaps be used for clustering, since the resultant matrix will be a low dimensional version of the data, thus the results shall be similar. You can see in the test at the code provided at the end of this article that KMeans presented the same results using the matrix or the one hot encoded high dimensional data.

Before You Go
Here it was presented the t-SNE. Using T-distributions, this nice algorithm helps us to visualize high dimensional data in 2D.
You must find the fine tune of the perplexity
and n_iter
hyperparameters to be able to get the best separation and similarities for your data. Try using loop and plot for best results.
The dataset used for this exercise can be found in my GitHub.
References
If this content is useful, follow my blog for more.
Consider registering to Medium membership using this referral code.
Code: