An Extensive Introduction to Latent Spaces

Introduction to Embedding, Clustering, and Similarity

A close look at key elements of ML and Latent Spaces

Mathias Grønne

Follow

Published in

Towards Data Science

12 min readSep 15, 2022

--

Photo by Javier Allegue Barros on Unsplash

In this post, we give a general introduction to embedding, similarity, and clustering, which are the basics to most ML and essential to understanding the Latent Space. The process of representing the real world as data in a computer is called embedding and is necessary before the real world can be analyzed and used in applications. Similarity finds how similar real-world embeddings are to each other and enables applications such as product recommendation. Clustering identifies groups within real-world embeddings and enables applications such as identifying which books are about the same topic.

Table of Content:

1.1 Embedding
1.2 Similarity
1.3 Clustering
1.4 K-Means Clustering
1.5 Conclusion

1.1 Embedding

Computers can represent and analyze real-world occurrences. Examples may include text to represent a book and use it to read it later or images to represent a drawing that can be shared with others online.

The process of representing something in a computer is called embedding. A computer only knows numbers and must therefore represent the embedding as so. Embedding starts with asking questions about what and how we wish to represent something. A book can be an example, asking questions about its colors by taking an image of its pages with a digital camera. The camera has a lot of tiny sensors that each ask the question of what color it sees and converts that color into a number. These sensors divide what the camera sees into small squares where each square is a single sensor and color — more sensors equal higher resolution.

Figure 1.1 — Process of turning an image into numbers. Book photo by Jess Bailey on Unsplash

A lot of numbers are required to embed a book. A ‘G’ requires 5x5=25 numbers, a sentence with 48 characters needs 25x48=1.200, a page with 28 lines requires 1.200*28=33.600, and an entire book with 250 pages needs 33.600*250=8.400.000 numbers — That is a lot of numbers!

It is not only the book’s content that we are asking for when taking a picture; we are also asking to see how it looks, the state of the book, the lighting conditions, and how far it is from the camera. We do not often need this amount of details about a book; the good thing is that we can trade some of them for fewer numbers. We can instead ask for the content of the book. Digits and letters are often in a format called ASCII with numbers between 48 and 122 where the computer knows that 48 is ‘0’ and 71 is capital ‘G.’ Using ASCII means that a character is 1 number, a sentence is 1x48=48 numbers, a page is 48*28=1344 numbers, and a book is 1433*250=336.000 numbers.

Figure 1.2 — A sentence from a book converted to ASCII. Book photo by Jess Bailey on Unsplash

Figure 1.3 — ASCII table of how a computer often stores characters (Digits and Letters)

We can represent the book with even fewer numbers. Maybe it is not the book’s content that is relevant but what it is about. What a book is about can be represented with genres such as fantasy, biography, horror, and non-fiction. Let’s put the number of genres to a high amount and say there are 256 different ones. We can then represent the book with a 1 (true) if it is a given genre and a 0 (false) if it is not. The book can now be represented with only 256 bits or 32 numbers (8 bits per number) no matter the size! It is a trade-off between embedded information and size.

Figure 1.4 — A book represented with genres. 1 (true) means that the book is a given genre and 0 (false) means that it is not.

Different applications require different embeddings. Genres are easier to use when recommending new books since we can look for ones with the same genres as the book we liked. But we require the content to read the books and an image of the front page to see how it looks.

A good embedding requires us to know what information the application needs and a way to convert the information to numbers. Like the example with books, different parts of an application require different representations.

1.2 Similarity and Vectorisation

The goal of similarity is to find similar embeddings. An example can be to use similarity to recommend a user books similar to the ones they already have read.

But what is similarity? Similarity is based on embeddings (also called measurements, samples, or points) that can be plotted into a coordinate system, also called a dimensional space (or space for short). We call it a point when placing an embedding into a coordinate system. Below is an example of four books set into a 2D coordinate system.

Figure 1.5 — Four books placed into a 2D coordinate system based on much fiction and Life-journey they consist of

Points are more similar if they are closer together in space. It is the same with books; two books that are both non-fiction (point A and B) are more similar than one that is fiction and one that is not (point A and C). Similarity is the distance between the two points. Figure 1.6 shows that L1 < L2 which means that A and B are more similar than A and C.

Figure 1.6 — Distance between A and B (i.e. L1) and between A and C (i.e. L2). A and B are closer than A and C as L1 < L2.

In practice, vectorizing the points allow us to calculate the distance (i.e., similarity) between them. Vectorization requires only points originating from (0,0). E.g., points (2,3) = vector [2, 3].

Figure 1.7 — Points can be converted to vectors by assuming they start in (0, 0)

The distance is the length of the vector traveling between the two points, called the Euclidean distance. Finding the vector between two points only requires subtraction of their vectors.

Figure 1.8 — Euclidean similarity is the distance between two points

Equation 1.1 — Formula for calculating the euclidean similarity score

In Python, the euclidean similarity is calculated by creating two vectors with Numpy, subtracting them, and taking the norm.

Code 1.1 — Calculating the euclidean similarity between two books by using equation 1.1

Another way to determine similarity is Cosine Similarity which looks at the angle between vectors rather than the distance between their ends. The idea is that the ratio between concepts/features matters more than how much they prioritize those features. Point A focuses more heavily on Fiction, and Point B on Life-Journey — Their concepts/features are vastly different even if they are close to each other in the euclidean space.

Figure 1.9 — Cosine similarity is the angle between two vectors

Equation 1.2 — Formula for calculating the cosine similarity score

In Python, the cosine similarity is calculated by creating two vectors with Numpy and using Scipy to find the cosine between the two.

Code 1.2 — Calculating the cosine similarity between two books by using equation 1.2

We can use both Euclidean and Cosine Similarity; they enforce different behavior, and it is the designer’s task to figure out which fits the application best.

1.3 Clustering

Clustering is a technique that helps us find groups in data. An example may be recommending new books based on groups/genres we like to read. Intuitively, a group consists of similar objects and becomes more concise the closer its objects are. The task is to determine which points are most similar and how to combine them into an appropriate number of groups. Clustering methods use similarity formulas to assess the similarity between points; we create two groups in the figure below by using the euclidean similarity between the points and combining the closest ones. Analyzing the two groups shows that our primary reading consists of school books and fantasy (No indication of whether we like to read school books or only read them because we need to).

Figure 1.10 — Two groups can be created by finding closely related pairs of points.

The number of groups is a hyperparameter, which means that it is not the computer that defines the number of groups but rather a human beforehand. Choosing the appropriate amount of clusters (i.e., groups) is vital. But how is an appropriate amount of clusters even chosen? Figure 1.11 analyzes the data with one, two, and four clusters to show the effect of a variated number.

Figure 1.11 — The information provided by adding one, two, and four clusters to the data. One and four clusters only provide the information we already have, whereas two clusters help us understand what kind of books the user read.

Choosing one cluster tells us that we like books, which may not be that helpful as we are making an application about analyzing book preferences. Four clusters show us that we like books A, B, C, and D but contribute little to recommending new books in the same group. Two clusters are more appropriate for this scenario as they provide the information that school and fantasy books are the ones we read the most. There is no perfect number of clusters for all applications — What if we have read more books like the example below? Would two or four groups be appropriate?

Figure 1.12 — Four clusters were in the previous example too many to extract useful information. In this example is four clusters required to properly understand the data. There is no perfect number of clusters, it is all application specific.

Two methods often used for clustering are k-means clustering¹ and hierarchical clustering². In K-means clustering, ‘k’ clusters are defined and found within the data like in the examples above. In hierarchical clustering, we define a threshold and use it to find the number of clusters by determining how distinct they should be. Hierarchical clustering is often done by either combining points closest together into larger and larger clusters (bottom-up) or by making a single cluster and splitting it up until they are distinct enough (top-down).

Figure 1.13 — (Left) Points plotted in a coordinate system. (Right) The points closest together are combined into a cluster, creating a new point that can be used for future combinations. The process continues until the distance between groups is above a predefined threshold.

There are two primary ways to use clustering. The first is to see what groups exist in the existing data, and the second is to find which group new points belong to. We can use the first way to find groups of books and mark each with a genre. Afterward, we can take a new book and use the second way to classify which genre the book belongs to by looking at which group it is closest to. Determining the group of a new point is the basis for classification tasks where we use existing data to estimate classes/groups.

Figure 1.14 — Deciding which group a new point is part of is determined by which group it is closest to the point

The remainder of this blog post explains K-means clustering in more detail and provides a practical implementation. Hierarchical clustering is explained in more detail here.

1.4 K-means clustering

The ‘K’ in K-means clustering defines the number of clusters. K=2 means that there must be found two clusters in the data, and K=4 means that there must be found four clusters in the data. The ‘mean’ defines how each cluster is the mean of its group, placing it in the center of its group.

A cluster represents a part of the data by placing a new point into the same space. The new point can now represent that part of the data.

Figure 1.15 — Placing clusters into the space as new data points that represent parts of the data

The goal of K-means is to find the best placement for each cluster. Placing a cluster is not as simple as above, where each cluster is in the center of a colored circle. There are no such circles in reality — only points as shown below. Code 1.3 defines the points from figure 1.16 in Python and is used in examples to illustrate how to implement K-means.

Figure 1.16 — Groups are not represented by colors or circles. It is rather seen as colorless points placed into a coordinate-system (for humans) or by points in a table (for computers).

Code 1.3 — Define data points from the table in figure 1.16

The first step in finding K-clusters is randomly placing K-points into the space. Using random points may seem counterintuitive because why not place them equally spaced across the space? The main reason is that all data looks different, which often results in the equally spaced points not being placed in the center of the actual group. Actual groups are the ones that we wish to find in the data if everything went perfectly. K-means clustering does not ensure that we find the best possible clusters, only that we find the best clusters close to where our points were placed. Random points do not ensure a good representation, but there is a high chance that randomly placed points at some point will yield a good result if done multiple times. Using equally spaced points would give the same clusters each time or be too computationally heavy if searched systematically.

Figure 1.17 — Different start positions for clusters yield different results

So, we place random points into the space. There is no reason to place them outside the minimum and maximum values of the existing data points, so somewhere in between.

Figure 1.18 — Randomly generated cluster points are generated within the range of existing points

Code 1.4 — Generate random starting positions for n-clusters. The principle from figure 1.18 is used to generate the points within a range defined by the data-points.

The next step is to distribute each original data point between the clusters. A data point belongs to the cluster that represents it the best. The best representation is the cluster most similar to the data point and measured with a similarity heuristic. We will be using euclidean in this example.

Figure 1.19 — Determine which cluster each point is most similar to by calculating the distance to each of them and choosing the closest one. The Euclidean distance is used as the heuristic.

Code 1.5 — Calculate which cluster each of the data-points is closest to.

Next up after distributing the data points is the “mean” part of K-means clustering. The “mean” refers to how a cluster should be the mean of the data points it represents. The clusters are likely not the mean after the distribution as they only represent the closest data points. Therefore, it is necessary to move each cluster, changing its position to be the mean of its data points.

Figure 1.20 — Each cluster is updated to be in the “mean” of the data points that were closest to it.

Equation 1.3 — Formula for calculating the mean for cluster j with the points that lie closest to it.

Code 1.5 — Calculate a new position of each cluster as the mean of the data points closest to it. Equation 1.3 is used to calculate the mean for a single cluster.

A cluster may be closer to other data points in its new position. Calculating the distribution again is necessary to ensure that each cluster represents the correct data points.

Figure 1.21 — Recalculate which point belongs to which cluster and move the clusters to their new means.

Code 1.5 — Full implementation with the used functions can be found in the git-repository

K-means clustering is an iterative process. This means we cannot find the perfect place for each cluster in a single calculation. It needs to be calculated multiple times, each time getting closer to the right place. This iterative process continues until the clusters do not move or until it is below a threshold. The threshold ensures that the process does not continue indefinitely.

It is easy to make new clustering algorithms by changing the heuristic (e.g., cosine similarity rather than euclidean similarity) or changing how a cluster represents its group of data points (e.g., medoids rather than means).

To make things easy, SKLearn has released a robust implementation of K-means clustering:

Code 1.6 — Sk-learn has its own implementation of K-Means clustering, which is better and faster than the simple method introduced in Chapter 1.

1.5 Conclusion

We are now done with Chapter 1 and the introduction to embedding, similarity, and clustering.

The first section was about embedding and explained how to represent the real world in a computer and why asking the right question can be critical for the application. Some applications even require multiple representations of the same real-world occurrence (e.g., book). The task is to determine which representations provide the best format for the application.

The second section was about similarity and showed how to calculate the similarity between two real-world occurrences by using the distance between them. There exist different ways to calculate similarity (e.g., euclidean distance and cosine similarity) that each result in different behavior. Choosing the one that provides the correct behavior for the application is vital. One of the use cases for similarity is recommending books similar to the ones a user likes.

The last section was about clustering and showed how similar points often group together and that these groups represent underlying structures. Books form natural groups of genres or thematics (e.g., sci-fi and biography). Finding the correct number of groups is crucial; too few or too many groups will provide misleading knowledge for the application.

Chapter 2 looks into more complex applications where similarity and clustering determine what animal an image is portaging. Chapter 2 establishes a way to measure how well ML performs and discovers that methods from Chapter 1 are ineffective on image embeddings. Chapter 3 takes the weaknesses and strengths discovered in chapter 2 and explores ways to improve ML.

References

[1] Josh Starmer, StatQuest: K-means clustering (2018), Youtube.com

[2] Alexander Ihler, Clustering (2): Hierarchical Agglomerative Clustering (2015), Youtube.com

All images and code, unless otherwise noted are by the author.

Thank you for reading this book about the Latent Space! We learn best when sharing our thoughts, so please share a comment, whether it is a question, new insight, or maybe a disagreement. Any suggestions and improvements are greatly appreciated!

If you enjoyed this book and are interested in new insights into machine learning and data science, sign up for a Medium Membership for full access to my content. Follow me to receive an e-mail when I publish a new chapter or post.