Compression, search, interpolation, and clustering of images using machine learning

How to use image embeddings for compression, search, interpolation and clustering

Published in

Towards Data Science

7 min readOct 24, 2020

Embeddings in machine learning provide a way to create a concise, lower-dimensional representation of complex, unstructured data. Embeddings are commonly employed in natural language processing to represent words or sentences as numbers.

In an earlier article, I showed how to create a concise representation (50 numbers) of 1059x1799 HRRR images. In this article, I will show you that the embedding has some nice properties, and you can take advantage of these properties to implement use cases like compression, image search, interpolation, and clustering of large image datasets.

Compression

First of all, does the embedding capture the important information in the image? Can we take an embedding and decode it back into the original image?

Well, we won’t be able to get back the original image, since we took 2 million pixels’ values and shoved them into a vector of length=50. Still, does the embedding capture the important information in the weather forecast image?

Here’s the original HRRR forecast on Sep 20, 2019 for 05:00 UTC:

We can obtain the embedding for the timestamp and decode it as follows (full code is on GitHub). First, we create a decoder by loading the SavedModel, finding the embedding layer and reconstructing all the subsequent layers:

import tensorflow as tf
def create_decoder(model_dir):
    model = tf.keras.models.load_model(model_dir)
    decoder_input = tf.keras.Input([50], name='embed_input')
    embed_seen = False
    x = decoder_input
    for layer in model.layers:
        if embed_seen:
            x = layer(x)
        elif layer.name == 'refc_embedding':
            embed_seen = True
    decoder = tf.keras.Model(decoder_input, x, name='decoder')
    print(decoder.summary())
    return decoderdecoder = create_decoder('gs://ai-analytics-solutions-kfpdemo/wxsearch/trained/savedmodel')

Once we have the decoder, we can pull the embedding for the time stamp from BigQuery:

SELECT *
FROM advdata.wxembed

This gives us a table like this:

We can then pass the “ref” values from the table above to the decoder:

import tensorflow as tf
import numpy as np
embed = tf.reshape( tf.convert_to_tensor(df['ref'].values[0], 
                     dtype=tf.float32), [-1, 50])
outimg = decoder.predict(embed).squeeze() * 60
plt.imshow(outimg, origin='lower');

Note that TensorFlow expects to see a batch of inputs, and since we are passing in only one, I have to reshape it to be [1, 50]. Similarly, TensorFlow returns a batch of images. I squeeze it (remove the dummy dimension) before displaying it. The result?

As you can see, the decoded image is a blurry version of the original HRRR. The embedding does retain key information. It functions as a compression algorithm.

Search

If the embeddings are a compressed representation, will the degree of separation in embedding space translate to the degree of separation in terms of the actual forecast images?

If this is the case, it becomes easy to search for “similar” weather situations in the past to some scenario in the present. Finding analogs on the 2-million-pixel representation can be difficult because storms could be slightly offset from each other, or somewhat vary in size.

Since we have the embeddings in BigQuery, let’s use SQL to search for images that are similar to what happened on Sep 20, 2019 at 05:00 UTC:

WITH ref1 AS (
SELECT time AS ref1_time, ref1_value, ref1_offset
FROM `ai-analytics-solutions.advdata.wxembed`,
     UNNEST(ref) AS ref1_value WITH OFFSET AS ref1_offset
WHERE time = '2019-09-20 05:00:00 UTC'
)SELECT 
  time,
  SUM( (ref1_value - ref[OFFSET(ref1_offset)]) * (ref1_value - ref[OFFSET(ref1_offset)]) ) AS sqdist 
FROM ref1, `ai-analytics-solutions.advdata.wxembed`
GROUP BY 1
ORDER By sqdist ASC
LIMIT 5

Basically, we are computing the Euclidean distance between the embedding at the specified timestamp (refl1) and every other embedding, and displaying the closest matches. The result:

This makes a lot of sense. The image from the previous/next hour is the most similar. Then, images from +/- 2 hours and so on.

What if we want to find the most similar image that is not within +/- 1 day? Since we have only 1 year of data, we are not going to great analogs but let’s see what we get:

WITH ref1 AS (
SELECT time AS ref1_time, ref1_value, ref1_offset
FROM `ai-analytics-solutions.advdata.wxembed`,
     UNNEST(ref) AS ref1_value WITH OFFSET AS ref1_offset
WHERE time = '2019-09-20 05:00:00 UTC'
)SELECT 
  time,
  SUM( (ref1_value - ref[OFFSET(ref1_offset)]) * (ref1_value - ref[OFFSET(ref1_offset)]) ) AS sqdist 
FROM ref1, `ai-analytics-solutions.advdata.wxembed`
WHERE time NOT BETWEEN '2019-09-19' AND '2019-09-21'
GROUP BY 1
ORDER By sqdist ASC
LIMIT 5

The result is a bit surprising: Jan. 2 and July 1 are the days with the most similar weather:

Well, let’s take a look at the two timestamps:

We see that the Sep 20 image does fall somewhere between these two images. There is weather in Gulf Coast and upper midwest in both images. As it is in the Sep 20 image.

We would probably get more meaningful search if we had (a) more than just one year of data (b) loaded HRRR forecast images at multiple time-steps instead of just the analysis fields, and (c) used smaller tiles so as to capture mesoscale phenomena. This is left as an exercise to interested meteorology students reading this :)

Interpolation

Recall that when we looked for the images that were most similar to the image at 05:00, we got the images at 06:00 and 04:00 and then the images at 07:00 and 03:00. The distance to the next hour was on the order of sqrt(0.5) in embedding space.

Given this behavior in the search use case, a natural question to ask is whether we can use the embeddings for interpolating between weather forecasts. Can we average the embeddings at t-1 and t+1 to get the one at t=0? What’s the error?

WITH refl1 AS (
SELECT ref1_value, idx
FROM `ai-analytics-solutions.advdata.wxembed`,
     UNNEST(ref) AS ref1_value WITH OFFSET AS idx
WHERE time = '2019-09-20 05:00:00 UTC'
),...SELECT SUM( (ref2_value - (ref1_value + ref3_value)/2) * (ref2_value - (ref1_value + ref3_value)/2) ) AS sqdist
FROM refl1
JOIN refl2 USING (idx)
JOIN refl3 USING (idx)

The result? sqrt(0.1), which is much less than sqrt(0.5). In other words, the embeddings do function as a handy interpolation algorithm.

In order to use the embeddings as a useful interpolation algorithm, though, we need to represent the images by much more than 50 pixels. The information lost can not be this high. Again, this is left as an exercise to interested meteorologists.

Clustering

Given that the embeddings seem to work really well in terms of being commutative and additive, we should expect to be able to cluster the embeddings.

Let’s use the K-Means algorithm and ask for five clusters:

CREATE OR REPLACE MODEL advdata.hrrr_clusters
OPTIONS(model_type='kmeans', num_clusters=5, KMEANS_INIT_METHOD='KMEANS++')
ASSELECT arr_to_input(ref) AS ref
FROM `ai-analytics-solutions.advdata.wxembed`

The resulting centroids form a 50-element array:

and we can go ahead and plot the decoded versions of the five centroids:

for cid in range(1,6):
    embed = df[df['centroid_id']==cid].sort_values(by='feature')['numerical_value'].values
    embed = tf.reshape( tf.convert_to_tensor(embed, dtype=tf.float32), [-1, 50])
    outimg = decoder.predict(embed).squeeze() * 60
    axarr[ (cid-1)//2, (cid-1)%2].imshow(outimg, origin='lower');
12.15992

Here are the resulting centroids of the 5 clusters:

When the embeddings are clustered into 5 clusters, these are the centroids

The first one seems to be your class midwestern storm. The second one consists of widespread weather in the Chicago-Cleveland corridor and the Southeast. The third one is a strong variant of the second. The fourth is a squall line marching across the Appalachians. The fifth is clear skies in the interior, but weather on the coasts.

In order to use the clusters as a useful forecasting aid, though, you probably will want to cluster much smaller tiles, perhaps 500km x 500km tiles, not the entire CONUS.

In all five clusters, it is raining in Seattle and sunny in California.

Next steps:

Peruse the full code on GitHub
Read the two earlier articles. One is on how to convert HRRR files into TensorFlow records, and the other is on how to use autoencoders to create embeddings of the HRRR analysis images.
I gave a talk on this topic at the eScience institute of the University of Washington. See the talk on YouTube: