Image Processing with Python — Extracting Image Data for Clustering

How to derive more features from an image to improve clustering results

Published in

Towards Data Science

5 min readFeb 1, 2021

In a previous article, we explored the idea of applying the K-Means algorithm to automatically segment our image. However, we only focused in on the RGB Color Space. Of course the RGB color space is the native format for most images, however in this article we shall go beyond it and see the effects of using different color spaces on the resulting clusters.

Let’s get started!

As per usual, let us begin by importing the required Python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D
from matplotlib import colors
from skimage.color import rgb2gray, rgb2hsv, hsv2rgb
from skimage.io import imread, imshow
from sklearn.cluster import KMeans

Great, now let us load the image we will be working with.

island = imread('island_life.png')
plt.figure(num=None, figsize=(8, 6), dpi=80)
imshow(island);

Shot of an island in the eastern Philippines (Image by Author)

As before we can try segregating this image via the K-Means algorithm.

def image_to_pandas(image):
    df = pd.DataFrame([image[:,:,0].flatten(),
                       image[:,:,1].flatten(),
                       image[:,:,2].flatten()]).T
    df.columns = ['Red_Channel','Green_Channel','Blue_Channel']
    return dfdf_island = image_to_pandas(island)plt.figure(num=None, figsize=(8, 6), dpi=80)
kmeans = KMeans(n_clusters=  5, random_state = 42).fit(df_island)
result = kmeans.labels_.reshape(island.shape[0],island.shape[1])plt.figure(num=None, figsize=(8, 6), dpi=80)
imshow(result, cmap='twilight');

Setting the cluster count to 5, the algorithm clustered the image into these distinct clusters. To get a better idea of what each cluster represents, let us apply this mask to our original image.

We can see that the K Means algorithm divides the image into the above parts. One obvious thing we notices is that the algorithm seems to split the light portions of the plants from the darker portions, it does something similar to the sky. Taking this into account let us attempt to collapse these clusters by decreasing the amount of clusters.

plt.figure(num=None, figsize=(8, 6), dpi=80)
kmeans = KMeans(n_clusters =  3, random_state = 42).fit(df_island)
result = kmeans.labels_.reshape(island.shape[0],island.shape[1])plt.figure(num=None, figsize=(8, 6), dpi=80)
imshow(result, cmap='magma');

Let us now apply these results to the original image.

def masker(image,masks):
    fig, axes = plt.subplots(1, 3, figsize=(12, 10))
    image_copy = image.copy()
    for n, ax in enumerate(axes.flatten()):
        masked_image = np.dstack((image_copy[:, :, 0]*(masks==[n]),
                                  image_copy[:, :, 1]*(masks==[n]),
                                  image_copy[:, :, 2]*(masks==[n])))
        ax.imshow(masked_image)
        ax.set_title(f'Cluster : {n+1}', fontsize = 20)
        ax.set_axis_off();
    fig.tight_layout()
    
masker(island,result)

We can see that the algorithm now considered the sky as a single cluster. The other two clusters seem to be split based on the brightness of their green. Though the algorithm seems to do its job well, it might pay off to try clustering the image by different color spaces.

HSV Color Space

Another way to visualize our image is via its HSV color space representation. In a nutshell, HSV stands for Hue, Saturation, and Value. It is the ideal color space for end users due to its intuitive nature. In the future we shall go over the different color spaces and their applications. However, for now let us stick to applying it to the K Means clustering algorithm.

def image_to_pandas_hsv(image):
    hsv_image = rgb2hsv(image)
    df = pd.DataFrame([hsv_image[:,:,0].flatten(),
                       hsv_image[:,:,1].flatten(),
                       hsv_image[:,:,2].flatten()]).T
    df.columns = ['Hue','Saturation','Value']
    return dfdf_hsv = image_to_pandas_hsv(island)
df_hsv.head(5)

Excellent, let us now apply the K Means algorithm to it and check out the results.

plt.figure(num=None, figsize=(8, 6), dpi=80)
kmeans = KMeans(n_clusters =  3, random_state = 42).fit(df_hsv)
result_hsv = kmeans.labels_.reshape(island.shape[0],island.shape[1])plt.figure(num=None, figsize=(8, 6), dpi=80)
imshow(result_hsv, cmap='magma');

Unfortunately the HSV color space seems to generate a clustering that is quite similar to RGB color space. Applying the mask to the original image confirms this.

So what can we do to generate a different kind of cluster? Well we can actually concatenate the two dataframes. Adding the elements to each other will allow the K Means algorithm to be able to better differentiate each specific cluster.

Concatenating the Data Frames

To concatenate the dataframes we need only run the concat function in Pandas.

stacked_features = pd.concat([df_island, df_hsv], axis=1)
stacked_features.head(5)

Let us now apply the K Means clustering algorithm and see if there is any significant difference.

plt.figure(num=None, figsize=(8, 6), dpi=80)
kmeans = KMeans(n_clusters=  3, random_state =
               42).fit(stacked_features)
result_stacked = kmeans.labels_.reshape
                (island.shape[0],island.shape[1])
imshow(result_stacked, cmap='magma')
plt.show();

Tragically the result seems to be the same, I was hoping that it would at least be a little different. Perhaps in the future we shall discuss different clustering algorithms and finally get some differentiation.

In Conclusion

Though the finding were lackluster, they do illustrate the limits of the K Means algorithm. Though we could theoretically append more data into the data frame, we may be better off making use of other clustering algorithms such as Spectral Clustering and Agglomerative Clustering. For now I hope you have a better idea of how to play with the parameters of your K Means algorithm and hopefully you can be more creative with how you want to represent your image.

Image Processing with Python — Extracting Image Data for Clustering

How to derive more features from an image to improve clustering results

HSV Color Space

Concatenating the Data Frames

Written by Tonichi Edeza