The world’s leading publication for data science, AI, and ML professionals.

Unsupervised image mapping

A simple and efficient way to explore a large quantity of images

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

As a data scientist, I often work on anti-fraud investigation missions. Exploration is therefore an essential part of the investigation. It allows one to become familiar with the subject of the analysis. I will detail here a simple, fast, efficient and reproducible way for you to get a global idea of the images you have. This is my first article so do not hesitate to ask your questions and make your comments. Enjoy 😉

Contents

  1. Prerequisite
  2. Data
  3. Describing images
  4. Projection
  5. Visualisation

Prerequisite

This algorithm uses Python 3.6.8 and the libraries keras (version 2.2.4), pandas (version 0.24.1), scikit-learn (0.22.2.post1), numpy (version 1.18.2) and matplotlib (version 3.0.3).

To install Python, if you are on Windows or Mac, you can download the installer here. For linux users, the following bash command will install Python 3.6:sudo apt-get install python3.6

For libraries, the following commands in your terminal (bash, powershell, etc.) will install them: pip install keras==2.2.4 pandas==0.24.1 scikit-learn==0.22.2.post1 matplotlib==3.0.3 numpy==1.18.2

Finally, the visualisation of the results is done using Tableau Software. It is a paid software but you can try it for free here.

We can now import the libraries and specify the path to the images:

## Imports
from datetime import datetime
from keras.applications.resnet50 import preprocess_input, ResNet50
from keras.models import Model
from keras.preprocessing import image
from pandas import DataFrame
from random import sample
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
import os
import time
## Parameters
INPUT_DIRECTORY = 'UIC_dataset/'
assert os.path.exists(INPUT_DIRECTORY), 'Directory does not exist'

Data

For this example, I chose to use the EuroSAT (RGB) dataset available here.

The EuroSAT dataset is based on Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes with 27,000 labelled and georeferenced samples - Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification by Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019
The EuroSAT dataset is based on Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes with 27,000 labelled and georeferenced samples – Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification by Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

Describing images

The first step is to find a way to richly describe images. A quick and simple method is to use a neural network already trained on a general classification task as an encoder. Our goal will not be to classify but rather to group similar images together using the features extracted by the network. The ResNet50 residual neural network trained on the ImageNet dataset is a very good start.

Want to learn more about residual neural networks? The first part of this article explains it very well.

Why a ResNet50 trained on the ImageNet dataset? The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. Thus, a neural network trained to classify ImageNet images and achieving a good score will have learned to correctly differentiate the shapes and characteristics of the images. This faculty will be precious to us. The problem is that our ResNet50 as it stands only knows how to predict one class from the list of ImageNet classes. To overcome this, we have to remove the last layer of the network used for classification. The output of the network will then be a vector of characteristics of dimension 2048.

## Model
# Retrieve the base model trained on ImageNet
base_model = ResNet50(weights='imagenet')
# Removing the last layer by selecting the layers from the first to the penultimate one.
model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output)
# Display layers
print(model.summary())
## Get image paths
image_file_names = list()
image_file_paths = list()
for root_path, directory_names, file_names in os.walk(INPUT_DIRECTORY):
    for file_name in file_names:
        if file_name.lower().endswith(('.jpg', '.jpeg', '.png')):
            image_file_names.append(file_name)
            image_file_paths.append(os.path.join(root_path, file_name))

print('{} images found'.format(len(image_file_paths)))
## Sampling
image_file_names, image_file_paths = zip(*sample(list(zip(image_file_names, image_file_paths)), 1000))
image_file_names = list(image_file_names)
image_file_paths = list(image_file_paths)
## Get image features
image_features = list()
start_time = time.time()
for index, image_file_path in enumerate(image_file_paths.copy()):
    try:
        img = image.load_img(image_file_path, target_size=(224, 224))
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        x = preprocess_input(x)
        image_features.append(model.predict(x)[0])
    except OSError:
        del image_file_paths[index]
        del image_file_names[index]
        print("ERROR: Can't load image {}".format(os.path.basename(image_file_path)))
image_features = np.array(image_features)
print('Done ({:.0f} min)'.format((time.time() - start_time) / 60))

Projection

We humans find it very difficult to grasp more than 3 dimensions. So imagine 2048…

One of my preferred methods for visualizing large vectors is the t-SNE algorithm. t-SNE stands for t-distributed Stochastic Neighbor Embedding. It is a machine learning algorithm for visualization based on Stochastic Neighbor Embedding. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback-Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map.

Like many algorithms, the parameters must be chosen carefully. This article explains very well the influence of the different parameters of the t-SNE. In our case, the following values of the parameters allowed us to obtain a satisfactory result.

## Compute t-SNE
start_time = time.time()
t_SNE = TSNE(n_components=2,
             perplexity=20,
             n_iter=2000,
             metric='euclidean', 
             random_state=8,
             n_jobs=-1).fit_transform(image_features)
print('Done ({:.3f} s)'.format(time.time() - start_time))

Visualisation

Once the projection has been made, we can visualise the result.

Matplotlib

A simple scatter plot can already be used to identify some groupings.

## Plot results
# Increase plot size
plt.figure(figsize=(12, 12))
# Set title
plt.title('t-SNE plot')
# Plot and show
plt.scatter(t_SNE[:, 0], t_SNE[:, 1], marker='.')
plt.show()
We can already distinguish one group that stands out from the others - Image by Author
We can already distinguish one group that stands out from the others – Image by Author

The disadvantage of this simple graph is that it does not allow you to view images represented by dots. For this, more advanced visualisation solutions are required such as Power BI or Tableau Software which is detailed below.

Tableau Software

Tableau is a paid data visualisation software that is rather simple to use and allows a large number of different types of visualisation, including interactive visualisations.

We must first save the results of the t-SNE.

## Save results
data_frame = DataFrame(data={'Image_file_names': image_file_names,
                             'Image_file_paths': image_file_paths,
                             'Image_features': [str(vector).replace('n', '') for vector in image_features], 
                             'X': t_SNE[:, 0], 
                             'Y': t_SNE[:, 1]})
data_frame = data_frame.set_index('Image_file_names')
data_frame.to_csv(path_or_buf='Unsupervised_image_cluterization_run_{}.csv'.format(datetime.now().strftime("%Y_%m_%d-%H_%M")),
                  sep=';',
                  encoding='utf-8')

Then, in Tableau Software, click "New Data Source" > "Text file" and select your CSV file to open.

Tableau loading page - Image by Author
Tableau loading page – Image by Author

Once the data has been loaded, go to "Sheet 1".

Tableau Data page - Image by Author
Tableau Data page – Image by Author

Here, right-click on "X" > "Convert to Dimension". Do the same with "Y". Add "X" and "Y" to the "columns" and "rows" fields respectively by dragging and dropping them. Click on the small arrow that appears when you move the mouse over "X" in the "columns" field, and then click "Continuous". Do the same with "Y". You should get a visual similar to the scatter plot. In anticipation of later, drag and drop "image file paths" into the "detail" box in the "Marks" pane.

Tableau Sheet page - Image by Author
Tableau Sheet page – Image by Author

To make the visual more aesthetic, you can make the grid, the axes and the name of the sheet disappear. To do this, right-click on the grid > "Format", then in the "Format Font" pane on the right, click on the "Lines" icon and deactivate the different lines by selecting "None" for each one. Then, with a right click on the axes, you can deactivate the "Show header" option. Right-click on the title to activate the "Hide title" option. Finally, in the "Marks" pane, you can modify the shape of the points, their size, colour, etc. Personally, I like to make them appear as small squares with an opacity of 75%. This reminds me of small polaroid images that we would have sorted and put together on a table. You can also make the name of the image appear by dragging and dropping "image file names" into "tooltips" and then, by double-clicking "Tooltip", you can edit the content of the tooltip and modify it to keep only "<ATTR(Image file names)>".

Sheet embellished - Image by Author
Sheet embellished – Image by Author

We can now switch to the "Dashboard" tab by clicking on the "New Dashboard" icon at the bottom next to Sheet 1. Let’s increase the size in the "Size" pane by changing "Width" and "Height" to 1600px and 900px respectively. You can then drag and drop "Sheet 1" from the "Sheets" tab to the dashboard. Right click on the title to make it disappear. In the "Objects" pane, drag and drop "Web Page" to the right of "Sheet 1" on the dashboard. Then click "OK" leaving the URL blank.

Tableau Dashboard page - Image by Author
Tableau Dashboard page – Image by Author

This web page is a way to display images. The purpose is to display the images in the web page by hovering over a square. To do this, we will launch a local web server. In your terminal, go to the directory to be served and type cd <path where UIC_dataset folder is>, then launch a Python web server by typing python -m http.server --bind 127.0.0.1. Your server is now launched. In the "Dashboard" tab at the top > "Actions…" > "Add Action >" > "Go to URL…". In "Run action on:" select "Hover". In "URL", enter the URL "http://127.0.0.1:8000/" literally. In "URL Target", select "Web Page Object". Finally, click "OK".

Action window - Image by Author
Action window – Image by Author

Note: If you are working in Windows, you may need to change the "image file paths" by replacing "" to "/". To do this, in "Sheet 1", right click at the bottom of the "Data" pane, then "Create Calculated Field…". There type name "path" and enter the calculation formula as below. Then in "Dashboard" > "Actions" > double-click on "Hyperlink1", change the URL with "http://127.0.0.1:8000/".

Calculated field window - Image by Author
Calculated field window – Image by Author

And voilà! Your images will appear as you fly over them.

Results - Image by Author
Results – Image by Author

So we can see that similar images have been grouped together. Some clusters can be seen, such as the "Residential" cluster at the top right or the clearly detached "SeaLake" cluster at the bottom.


Conclusion

So we saw a simple and effective way to get a good idea of an image dataset. This idea will be very precious for you in your further analysis work. I hope you enjoyed this article and that it made you want to learn more about data analysis. Feel free to ask questions and comments.

The code is available as a Jupyter notebook on my gitlab to which you can contribute if you wish.

More articles will come, including one on cleaning up plain text emails. Stay tuned and see you soon 😉


Related Articles