An Extensive Introduction to Latent Spaces

Introduction to Image Embedding and Accuracy

With Clustering, Linear Discriminant Analysis, and Performance Evaluation

Mathias Grønne

Follow

Published in

Towards Data Science

12 min readSep 29, 2022

--

The previous chapter was a general introduction to Embedding, Similarity, and Clustering. This chapter builds upon these fundamentals by expanding the concept of embedding to include images as well. We will explore how K-Means clustering, introduced in the previous chapter, performs on image embeddings and introduce ways to measure performance through Accuracy and Recall. A simple Latent Space Embedding(LSE) in the form of Linear Discriminant Analysis is introduced to help us understand how clustering and performance work with images. This chapter does not explain LSE, as it will be introduced in the next chapter.

Table of Content:

2.1 Image Embedding
2.2 Cluster Performance
2.3 Discussion
2.4 Conclusion

2.1 Image Embedding

Chapter 1 explained how different embeddings of the same object are suitable for different applications. An embedding of a book in the form of images is not suitable for book recommendations but requires an embedding in the form of genres. However, we cannot read a book using genre embedding as this application requires the actual text. In general, when using embedding, the goal is to find an embedding method that provides us with similarities suitable for an application.

Images are a tricky form of embedding when looking for similarities. An example may be a use case where we want a mobile app to help us identify the animal we are looking at. Here, we need to find similarities between the image and an animal species.

Figure 2.1 — Images of animals as examples of categories from the dataset: Animal-10. Cat photo by Amber Kipp on Unsplash, Dog photo by Valeria Boltneva from Pexel, Chicken photo by Erik Karits from Pexel, and Cow photo by Gabriel Porras on Unsplash.

An image is an embedding of what we see in the real world. An image comprises pixels, each pixel being a single color. The images above are created by combining 10.000s of these pixels (100.000 pr. image to be exact). Each pixel provides a small amount of unique information, and we can only get the complete picture with all of the pixels. Each pixel represents a piece of knowledge and should, therefore, have its own unique dimension. Chapter 1 showed how a single book could be represented by two genres giving it a total of two dimensions. However, A gray-scaled image with 250x400 pixels will have a total of 100.000 dimensions!

Figure 2.2 — Process of turning an image into numbers. Taken from Chapter 1, figure 1.1. Book photo by Jess Bailey on Unsplash

The question then becomes, can pixel embedding be used to categorize which animal is in the image? Figure 2.1 has four animal groups (i.e., Cat, Dog, Chicken, and Cow). Sixteen images from each group are extracted from the Animal-10 dataset and placed inside a coordinate system in figure 2.3 (each animal group has a unique color). The animals are from the Animal-10 dataset hosted on Kaggle [1] and can be downloaded for free.

Figure 2.3 — The four animals from figure 2.1 are represented with 16 images each in a 2D space. Linear Discriminant Analysis (LDA) is used to transform the images to 2D as it focuses on separating classes.

The attentive reader may recognize that the figure only shows two dimensions, not all 100.000. We start with two dimensions for simplicity and later expand the techniques to all 100.000 dimensions. The key takeaway is that the colored points have no clear groups of single colors but are all mingled.

Principal Component Analysis [2] (PCA) and Linear Discriminant Analysis [3] (LDA) can both transform an image into 2D. PCA focuses on transforming data with no classes, and LDA on data with classes. LDA is used to transform the images to 2D as the animal in each image is known beforehand.

The remainder of the Image Embeddings section explains how to generate the graph from figure 2.3. You can continue to the next section, 2.2 Cluster Performance, to skip the detailed explanation of how to generate the graph

The detailed explanation of how to generate the graph from figure 2.3:
The points from figure 2.3 represent animals from the Animal-10 dataset hosted on Kaggle. The dataset can be downloaded from the website after logging in for free. The download button is highlighted in figure 2.4.

Figure 2.4 — Animal-10 dataset on Kaggle with the download button highlighted.

An “archive” folder will be downloaded from Kaggle, consisting of a folder called raw-img and a file called translate.py. Take the raw-img folder and place it in a known location; see figure 2.5. We define the dataset location in Code 2.1 so it can be accessed later.

Figure 2.5 — Move the “raw-img” file into a known location. In this example, it is moved into the “Datasets/Animals/” folder. The location is referred to in code 2.1.

Code 2.1 — Define the location of the animal-10 folder

Now we can extract the sixteen images for each animal group (i.e., Cat, Dog, Chicken, and Cow) by defining their names and how many images to extract. Code 2.2 uses this information to go through each animal folder, loading 16 images and making them the same size. It is easier to compare images of the same size and is required for functions such as LDA.

Code 2.2 — Load 16 images of Cats, Dogs, Chickens, and Cows.

We can now show a single image per animal to check if the data is loaded correctly. The code for showing one image per animal can be found in Code 2.3.

Code 2.3 — Show one image per Animal

Afterward, the images can be transformed to 2D using a Linear Discriminant Analysis (LDA) by using Code 2.4. The LDA must be “fitted” before transforming images to 2D. Fitting is the process of teaching the LDA how to transform data. The fitting is done by first defining how many dimensions the images should have after the transformation is done and then giving the LDA examples of which images belong to the same class (i.e., give the LDA images and tell which animal is in them). It uses this information to figure out how to transform these and future images. The last step is to use the LDA to transform our 16x4 images to 2D.

Code 2.4 — Transform animal images to 2D

The 2D points can be plotted in a graph like in figure 2.3 using Code 2.5.

Code 2.5 — Plot 2D points from multiple classes, each with unique colors. The graph should match figure 2.3.

2.2 Cluster performance

We learned in chapter 1 how to apply k-clusters to no-colored data (data without a category/without a predefined animal) with K-Means. K-Means focuses on finding K-groups in the data and placing a point in the center of each group. The point is placed in the center to represent its group best as it has the shortest distance to all its points. Our case is slightly different from K-Means as we already know the groups (images of the same animal species are from the same group). However, the idea from K-Means can be used by placing a point in the center of each group/animal to represent it best. Figure 2.6 has added the new center points by using Equation 2.1. Code 2.6 shows how to add the clusters and plot the new graph. OBS! The output from the code will not highlight the clusters with a black border.

Figure 2.6 — Center of each group/animal is marked with a new point highlighted with a black border. Orange: Dog, Red: Cow, Green: Chicken, Blue Cat:

Equation 2.1 — Calculating the center of a group

Code 2.6 — Find the centers of each animal group and plot them together with the remaining groups

The tricky part is what happens when we get a new image and want to figure out which animal is on it. Figure 2.7 shows that we calculate the distance from the new transformed image to each cluster as the clusters each represent an animal group. A smaller distance means a higher similarity; the closest cluster is chosen as the best fit using the euclidean similarity. Equation 2.2 shows how to calculate this similarity between two points.

Figure 2.7 — Measure the distance to each center and choose the closest one as its group

Equation 2.2 — Formula for calculating the euclidean similarity score between two points

So, how do we calculate how well our clusters can recognize animals? A simple way is first to forget which group each point belongs to and afterward use Equation 2.2 to calculate which group they are closest/most similar to. Figure 2.8 and Code 2.7 use this approach to change each point’s color to match the nearest cluster (Keep reading, it will make sense soon!).

Figure 2.8 — The points are redistributed so they now belong to the group closest to them

Code 2.7 — Distribute points by calculating which class they are closest to

Four things can happen when we change the color of each point to match the closest cluster. A simple example is given to explain what happens. The example uses a Corona Test rather than a Cluster to keep it simple. A Corona test can be either positive or negative. Two things can happen when the test is positive, it can either be true that it is positive, called true-positive, or it can be false, called false-positive. The same thing can happen if the test is negative, it can either be true that it is negative, called true-negative, or it can be false, called false-negative. We trust a test more when it gets more true-positives and true-negatives.

Our case is more complex because we have four outcomes rather than two (i.e., four clusters/animal types vs. positive and negative). We look at a single cluster at the time to determine how well it performs. The same four things (true-positive, false-positive, true-negative, and false-negative) can happen when we determine how well it performs. Let us focus on the blue cluster: Points that were both blue before and after are called True-Positives (TP). Points that were another color before but are blue now are called False-Positives (FP). Points that were blue before but are now another color is called False-Negatives (FN). Lastly, points that were both another color before and after are called True-Negatives (TN). Figure 2.9 illustrates each scenario when focusing on the blue cluster.

We must continue this process for each cluster to determine how well each performs.

Figure 2.9 — How well a cluster performs can be measured by looking at how many true & false positives and true & false negatives it has after redistributing the points

The method we used in figure 2.8 is problematic because we only use existing points to evaluate the performance. These points were used to create/fit the LDA, and the LDA is naturally better at getting these points correct. The good thing is that we have more images in the Animal-10 dataset to test on! Code 2.8 shows how to load and transform 128 images from each animal class to 2D.

Code 2.8 — Load 128 images for each animal class and transform them to 2D with the functions defined in Code 2.2 and 2.4.

How well a cluster performs can be measured by Precision and Recall [4]. Precision measures the ratio between true and false positives — Hence, how great the cluster is at only predicting its own points. Recall measures the ratio between true-positives and false-negatives — Hence, how great the cluster is at including all of its points when predicting. The equation for Precision is shown in equation 2.3 and the equation for Recall is shown in equation 2.4. Code 2.9 calculates positives and negatives as well as precision and recall for each cluster with all 512 images (128 images pr. class).

Equation 2.3 — Formula for calculating the precision of a cluster. It is the ratio between true-positives and false-positives

Equation 2.4 — Formula for calculating the recall of a cluster. It is the ratio between true-positives and false-negatives

Code 2.9 — Calculate the performance of each cluster with all 512 (128*4 classes) images in 2D

The results for Precision and Recall for 2D images can be found in Table 2.1. The average precision is 34%, and the average recall is 33%, which is not great at all! But wait, we only used two dimensions? Let us try the same calculation with all 100.000 dimensions to see if it performs better than before.

Table 2.1 — Precision and Recall when the points are transformed to 2D with an LDA

Table 2.2 shows the results for all 100.000 dimensions and Code 2.10 how to get them. The average precision is 31%, and the average recall is 30%, which is even worse than before! LDA performs better because it focuses on separating each group and making them more distinct, even if only a little.

Table 2.2 — Precision and Recall when the points are used without any transformation

Code 2.10 — Code for calculating the cluster accuracy when using all dimensions without transforming the data

But wait… The LDA only used two dimensions… What if both transform the data and let it have more dimensions? The max number of dimensions for the LDA is “number of classes — 1,” which in our case is 4–1 = 3. Table 2.3 shows the results when we use all three dimensions, and code 2.11 shows how to get it. The average precision is 38%, and the average recall is 38%, still not great but better than before!

Table 2.3 — Precision and Recall when the points is transformed to 3D with an LDA

Code 2.11 — Code for calculating the cluster accuracy when using transforming it to 3D with LDA

2.3 Discussion

The 3D LDA only scores 38% because it only uses 16 images per group to train the LDA and because it, like the non-transformed images, only looks at the pixel colors.

More images give the LDA more context when training as it has more examples of where images may be plotted on the graph. It is easier to tell where future points from the same group will be if we have 16 images rather than a single image. Likewise, 128 images give us more information than just 16. Figure 2.10 plots 128 images per group; the left graph uses an LDA trained on 64 (16*4) images, and the right graph uses an LDA trained on 512 (128*4) images. The figure illustrates how an LDA better separates the points when trained on more images. Code 2.11 shows how to plot these two graphs.

An LDA only looks at the pixel colors when transforming the image. The method is limited because it prioritizes the pixel’s color over form of what is in the image. This means that a brown cat and a brown dog will be more similar than a white dog and a brown cat!

Figure 2.10–512 animal images plotted with a LDA trained on 16 images per group (left) and 128 images per group (right)

Code 2.11 — Create a new LDA that transform the images to 2D and is trained with 128 images per group.

In Chapter 3, Introduction to Image Latent Spaces Embedding, we will use what we have learned about the strengths and weaknesses of Image Embedding and LDAs to improve the accuracy of our animal identification application.

2.4 Conclusion

We are now done with Chapter 2 and the introduction to Image Embedding and Accuracy.

The first section, Image Embedding, explained that an image consists of pixels and that each pixel represents a color. Each color is a dimension, just like “life-journey” and “Fiction” was dimensions for books in chapter 1. We used 64 images from four animal groups (i.e., Cat, Dog, Chicken, and Cow) to show how to work with Image embedding in Python. Each image was transformed into 2D using Linear Discriminant Analysis (LDA), making it possible to plot them in a graph and show their similarities.

The second section, Cluster Performance, looked at how we can represent each animal group by placing a point in their centers. The points can help identify animals in images by calculating which center the image is closest to. Precision and Recall are two ways to determine how well a method identifies the correct animal. Precision checks how many images were identified as the correct animal, and Recall checks how good each center was at including all images of its specie. Performance was tested on both images without a transformation and images transformed to 2D and 3D with LDAs.

Precision and Recall showed that images transformed with LDA are better at identifying the correct specie of animal than images that were not transformed (38% vs. 31%). The reason is that the LDA transforms the images to make each group more distinct. However, images and LDAs only look at colors and not the form of what is in the image. This means that both methods believe a brown cat and a brown dog are more similar than a brown dog and a white dog.

Chapter 3, Introduction to Image Latent Spaces Embedding, uses what we have learned from the weaknesses and strengths of image embeddings and LDAs to improve the accuracy of our animal identification application. Chapter 3 achieves better accuracy by using the strength of transformation and solving the problem of only looking at the colors.

References

[1] Corrado Alessio, Animals-10, Kaggle.com

[2] Casey Cheng, Principal Component Analysis (PCA) Explained Visually with Zero Math (2022), Towardsdatascience.com

[3] YANG Xiaozhou, Linear Discriminant Analysis, Explained (2020), Towardsdatascience.com

[4] Google Developers, Classification: Precision and Recall (2022), developers.google.com

All images and code, unless otherwise noted are by the author.

Thank you for reading this book about the Latent Space! We learn best when sharing our thoughts, so please share a comment, whether it is a question, new insight, or maybe a disagreement. Any suggestions and improvements are greatly appreciated!

If you enjoyed this book and are interested in new insights into machine learning and data science, sign up for a Medium Membership for full access to my content. Follow me to receive an e-mail when I publish a new chapter or post.