Semantic Segmentation of Remote Sensing Imagery using k-Means

From scratch in python🐍

Published in

Towards Data Science

9 min readMar 14, 2024

One of the most simple and genius ML models, in my opinion, is k-Means clustering. It relates to the group of unsupervised learning algorithms and is capable of finding patterns inside an unlabeled dataset. The most pleasant feature is that it lacks complicated math, and basically any high school student can successfully implement and use this method. So in this article I want to share how you can build k-Means algorithm from scratch in python using only numpy and pandas libraries and apply it to a real world problem — semantic segmentation of satellite imagery.

Firstly, let’s talk about the data we have.

In one of my previous articles, I talked about the problem of the Aral Sea shrinkage. As a result, we managed to get remote sensing imagery from MODIS using Google Earth Engine, which strongly indicates that the sea is drying. So I wondered, how can we estimate the change of the water surface between 2000 and 2023 using ML semantic segmentation? The answer is k-Means!

How can a sea disappear? Case study of the Aral Sea using Python and MODIS data.

Let’s create a timelapse and check if it’s true!

medium.com

Before diving into coding, let’s have a look at the data we are going to use in this tutorial. These are two RGB images of the same area with an interval of 23 years, however it’s clear that the land surface properties and atmospheric conditions (clouds, aerosols etc.) are different. That’s why I decided to train two separate k-Means models, one for each image.

To follow up the tutorial, you can download and run the notebook here.

First of all, let’s import the necessary libraries and upload the data to the notebook:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread('MOD_01.jpg')
img2 = mpimg.imread('MOD_24.jpg')

You can see that the area covered by the images is quite large, so I suggest to zoom in a little:

img = img[140:600,110:500,:]
img2 = img2[140:600,110:500,:]

fig, ax = plt.subplots(ncols=2, figsize=(16,9))
ax[0].imshow(img)
ax[1].imshow(img2)
for i in range(2):
  ax[i].set_facecolor('black')
  ax[i].set_xticks([])
  ax[i].set_yticks([])
ax[0].set_title('2000-08-01', fontsize=26)
ax[1].set_title('2023-08-01', fontsize=26)
plt.show()

And the last step before the ML phase, let’s convert our images to pandas dataframes (one column for each image channel). I do that for the sake of visibility of my explanations. If you want to get it optimized, it’s better to use numpy arrays instead.

df = pd.DataFrame({'R': img[:,:, 0].flatten(), 'G': img[:,:, 1].flatten(), 'B':img[:,:, 2].flatten()})
df2 = pd.DataFrame({'R': img2[:,:, 0].flatten(), 'G': img2[:,:, 1].flatten(), 'B':img2[:,:, 2].flatten()})

k-Means

So what is the idea behind the algorithm?

Imagine that you judge about the taste of food using two criteria: sweetness and price. Keeping this in mind, I’ll give you a set of possible options to eat:

I bet your brain has already split the options into three clusters: fruits, drinks and bakery. Basically, you unconsciously clustered the 2-dimensional data, which are defined by a pair of values — (sweetness; price).

In the case of k-Means, the goal of the algorithm is quite similar — to find a pre-set number of cluster, k, in n-dimensional space (e.g. besides sweetness and price you want to account for nutrition, health, presence of the food in your fridge, and in this case, n = 5).

The algorithms includes the following stages:

I. Define the number of clusters.

As I mentioned beforehand, k in k-Means is the number of clusters you want to get in the end, and you’re supposed to set this value before training the model.

II. Randomly initialize centroids.

Centroid is an integral part of k-Means. Basically, centroid is a circle with a center, which is defined a set of coordinates, and each centroid represents a cluster. For instance, in our previous example there are 3 centroids.

III. Calculate distances and assign clusters.

Now we need to find how far each point is from each centroid. Based on this calculations, we assign each point to the least distant centroid (cluster).

IV. Calculate new centroids.

Now each of our clusters contains at least one points, so it’s time to re-calculate the centroids simply by taking mean coordinates across all cluster points.

And that’s it! We repeat steps 2–4 until centroids are not changing.

Time To Code.

Now let’s wrap this really simple idea behind k-Means into python code.

Reminder: in this task we have 3D problem, i.e. our X, Y and Z are Red, Green and Blue image channels!

def kmeans(data, K, kind):
  L = list()
  new_centroids = data.sample(K).values

  data = distance(data.copy(), new_centroids, kind)
  old_centroids = new_centroids.copy()
  new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].mean().values for Class in data.loc[:,'C1':f'C{K}'].columns])
  i = 1
  print(f'Iteration: {i}\tDistance: {abs(new_centroids.mean()-old_centroids.mean())}')
  while abs(new_centroids.mean()-old_centroids.mean())>0.001:
    L.append(abs(new_centroids.mean()-old_centroids.mean()))
    data = distance(data, new_centroids, kind)
    old_centroids = new_centroids.copy()
    new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].mean().values for Class in data.loc[:,'C1':f'C{K}'].columns])
    i+=1
    print(f'Iteration: {i}\tDistance: {abs(new_centroids.mean()-old_centroids.mean())}')
  print(f"k-Means has ended with {i} iteratinons")
  return data, L

On the first stage we create a list L to collect all the distances between clusters to visualize them afterwards and randomly sample K points from the dataset to make them our centroids (or alternatively, you can assign random values to the centroids).

L = list()
new_centroids = data.sample(K).values

Now we need to calculate the distances between centroids and data points. There are lots of different distance metrics in Data Science, but let’s focus on the following ones — Euclidean, Manhattan, Chebyshev.

For Euclidean distance:

For Manhattan:

For Chebyshev:

To use this formulas, let’s write a versatile function for any number of dimensions:

def distance(data, centroids, kind):
  #kind = euclidean, manhattan, chebyshev
  #Here we add to the dataframe as many clusters C-ith as needed
  cols=list()
  for i in range(1,k+1):
    if kind=='euclidean':
      data[f'C{i}'] = ((centroids[i-1][0]-data.R)**2+(centroids[i-1][1]-data.G)**2+(centroids[i-1][2]-data.B)**2)**0.5
    elif kind=='manhattan':
      data[f'C{i}'] = abs(centroids[i-1][0]-data.R)+abs(centroids[i-1][1]-data.G)+abs(centroids[i-1][2]-data.B)
    elif kind=='chebyshev':
      merged=pd.concat([centroids[i-1][0]-data.R, centroids[i-1][1]-data.G, centroids[i-1][2]-data.B], axis=1)
      data[f'C{i}'] = merged.max(axis=1)
    cols.append(f'C{i}')
  data['Class'] = data[cols].abs().idxmin(axis=1) #assigning clusters to points
  return data #returning the dataframe with k cluster columns and one Class column with the final cluster

So now we can simply calculate distances and assign a cluster to each data point. Thus, our new centroids became old, so we store them in another variable and recalculate the new ones. To do that we iterate over each cluster and take a mean across all the coordinates (in our case, across RGB channels). Therefore, the variable new_centroids has a shape of (k,3).

data = distance(data.copy(), new_centroids, kind)
old_centroids = new_centroids.copy()
new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].mean().values for Class in data.loc[:,'C1':f'C{K}'].columns])

Finally, we repeat all these steps until centroids’ coordinates don’t change anymore. I expressed this condition as this: the difference between average cluster coordinates should be less than 0.001. But you can play around with other numbers here.

while abs(new_centroids.mean()-old_centroids.mean())>0.001:
    L.append(abs(new_centroids.mean()-old_centroids.mean()))
    data = distance(data, new_centroids, kind)
    old_centroids = new_centroids.copy()
    new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].mean().values for Class in data.loc[:,'C1':f'C{K}'].columns])

And that’s it. The algorithm is ready to be trained! So let’s set k = 3 and store the results into dictionaries.

k = 3
segmented_1, segmented_2, distances_1, distances_2 = {}, {}, {}, {}
segmented_1['euclidean'], distances_1['euclidean'] = kmeans(df, k, 'euclidean')
segmented_2['euclidean'], distances_2['euclidean'] = kmeans(df2, k, 'euclidean')
segmented_1['manhattan'], distances_1['manhattan'] = kmeans(df, k, 'manhattan')
segmented_2['manhattan'], distances_2['manhattan'] = kmeans(df2, k, 'manhattan')
segmented_1['chebyshev'], distances_1['chebyshev'] = kmeans(df, k, 'chebyshev')
segmented_2['chebyshev'], distances_2['chebyshev'] = kmeans(df2, k, 'chebyshev')

I decided to compare all the distance metrics for this particular task as you can see, and it’s evident that here Manhattan distance was the fastest.

Before visualizing the clusters, let’s convert the clusters names into int type:

d = {'C1':0, 'C2': 1, 'C3':2}
for key in segmented_1.keys():
  segmented_1[key].Class = segmented_1[key].Class.apply(lambda x: d[x])
  segmented_2[key].Class = segmented_2[key].Class.apply(lambda x: d[x])

Time make the final plots!

for key in segmented_1.keys():
  fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(10,10))
  ax[0, 0].imshow(img)
  ax[0, 1].imshow(segmented_1[key].Class.values.reshape(460,390))
  ax[0, 0].set_title('MOD09GA RGB', fontsize=18)
  ax[0, 1].set_title(f'kMeans\n{key[0].upper()+key[1:]} Distance', fontsize=18)

  ax[1, 0].imshow(img2)
  ax[1, 1].imshow(segmented_2[key].Class.values.reshape(460,390))
  ax[1, 0].set_title('MOD09GA RGB', fontsize=18)
  ax[1, 1].set_title(f'kMeans\n{key[0].upper()+key[1:]} Distance', fontsize=18)

  for i in range(2):
    for j in range(2):
      ax[i, j].set_facecolor('black')
      ax[i, j].set_xticks([])
      ax[i, j].set_yticks([])

  plt.savefig(f'{key}.png')
  plt.tight_layout()
  plt.show()

It’s not hard to see that Euclidean and Manhattan distance turned out be the most suitable for this particular task. But to make sure that it’s true, let’s evaluate the k-Means clustering results using the Silhouette Coefficient. This metric is perfect for training results assessment when there are no labeled true values for the clustered points.

To calculate it we will use sklearn function [1].

a — the mean distance between a sample and all other points in the same class.
b — the mean distance between a sample and all other points in the next nearest cluster.

The range of values of the Silhouette Coefficient is [-1,1]. And yep, it’s computationally expensive, as you need to calculate distances between thousands of point several times, so be ready to wait.

scores_1, scores_2 = {}, {}
for key in segmented_1.keys(): #key is a metric for the distance estimation
  scores_1[key]=round(silhouette_score(segmented_1[key].loc[:, :'C3'], segmented_1[key].Class, metric=key),2)
  scores_2[key]=round(silhouette_score(segmented_2[key].loc[:, :'C3'], segmented_2[key].Class, metric=key),2)
  print(f'Distance: {key}\t Img 1: {scores_1[key]}\t Img 2: {scores_2[key]}')

Image by author.

Now you can see that we proved it: Euclidean and Manhattan distances have similarly good performance, so let’s estimate the water surface area loss, using both of them.

for metric, Class in zip(['euclidean', 'manhattan'], [2,1]):
  img1_water = np.count_nonzero(segmented_1[metric].Class.values == Class)*500*500*1e-6 #pixel size is 500, so the area is 500*500 and to convert to km2 * 1e-6
  img2_water = np.count_nonzero(segmented_2[metric].Class.values == Class)*500*500*1e-6

  print(f'Distance: {metric}\tWater Area Before: {round(img1_water)}km\u00b2\tWater Area After: {round(img2_water)}km\u00b2\tChange: -{100-round(img2_water/img1_water*100)}%')

Image by author.

— — — —

Distance: euclidean
Water Area Before: 17125 km²
Water Area After: 1960 km²
Change: -89%

— — — — —

Distance: manhattan
Water Area Before: 16244 km²
Water Area After: 2003 km²
Change: -88%

As you can see, according to our clustering results, the change in water surface area is almost 90% (!!!) water loss over last 23 years, which is real proof of the fact that the Aral Sea shrinkage is a planetary tragedy…

===========================================

Reference:

[1] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score