How to create a concise image representation using machine learning

Designing and training an autoencoder on HRRR images in Keras

Lak Lakshmanan
Towards Data Science

--

Autoencoder examples on the internet seem to be either about toy examples (MNIST, 28x28 images) or take advantage of transfer learning from ImageNet bottleneck layers. I will show you how to train an autoencoder from scratch, something that you will do if you have enough data and data that is completely unlike the photographs that ImageNet consists of.

In an earlier article, I showed how to take weather forecast images and create TensorFlow records out of them to make them ready for machine learning.

In this article, I will show how to do one machine learning task on these images: create concise representations of the radar reflectivity “analysis” fields from the High Resolution Rapid Refresh (HRRR) model using autoencoders. The HRRR images are 1059x1799 and look like this:

Designing an autoencoder in Keras

An autoencoder consists of an architecture that typically looks like this:

Autoencoder architecture. Source: Chervinskii on Wikipedia (https://commons.wikimedia.org/wiki/File:Autoencoder_structure.png)

The encoder tries to represent the input image as a much smaller array of numbers. The decoder reconstructs the image from the code. The entire model is trained to minimize the difference from the input and the output.

There are a couple of gotchas to watch out for:

  1. The code (or embedding) needs to be much smaller than the input. Otherwise, the ML model can simply copy input pixels to the code. There will be no generalization.
  2. The number of parameters with an input that has 1 million pixels can get out of hand. Traditionally, the way to reduce the number of parameters is to use convolutional layers.
  3. However, convolutional layers preserve locality information. So, essentially, all that the layers are doing is to blur the image more and more. To get beyond mere blurring, you should increase the number of “channels” in the convolution filters at each stage. This way, you get to preserve information between layers.
  4. Also, you need at least one dense connection that brings in “teleconnections” between different parts of the image. I chose to do this when creating the embedding.
  5. Consider carefully whether you will need short-circuit layers, attention, etc. This depends on the type of image you are processing, and the number of layers. In my case, these are geographic images with a strong locality constraint. So, I decided to not use these tricks.
  6. The way to reduce the size of a convolutional filter is to use pooling layers. In the decoder, the inverse operation will be to do upsampling.

This was my final architecture:

Let’s walk through it in code (the full code is on GitHub). The first layer is the input layer:

input_img = tf.keras.Input(shape=(1059, 1799, 1), name='refc_input')

I’d prefer that the size is a power of 2 to make it easy to keep getting smaller layers. I could pad 1059 all the way to 2048, but it seems more rational to crop it to 1024 — the edges of the HRRR image are poorly resolved in any case. So, the second layer is a cropping layer:

x = tf.keras.layers.Cropping2D(cropping=((17, 18),(4, 3)), name='cropped')(input_img)

We don’t know how many convolutional layers we need, so let’s make the number of layers (nlayers) a hyperparameter. In addition, we will make the number of filters (or number of channels) a hyperparameter since this controls how much information is passed from one layer to the next. Finally, the poolsize will also be a hyperparameter since this controls how much reduction in image size happens from one layer to the next:

    last_pool_layer = None
for layerno in range(nlayers):
x = tf.keras.layers.Conv2D(2**(layerno + numfilters), poolsize, activation='relu', padding='same', name='encoder_conv_{}'.format(layerno))(x)
last_pool_layer = tf.keras.layers.MaxPooling2D(poolsize, padding='same', name='encoder_pool_{}'.format(layerno))
x = last_pool_layer(x)
output_shape = last_pool_layer.output_shape[1:]

I also make sure to capture the last layer so as to get the output shape before the embedding layer (for reasons that will become evident).

Once the convolutional layers are done, we can pass it through a dense layer. I made the number of dense nodes (essentially the length of the embedding) a hyperparameter as well:

        # flatten, send through dense layer to create the embedding
x = tf.keras.layers.Flatten(name='encoder_flatten')(x)
x = tf.keras.layers.Dense(num_dense, name='refc_embedding')(x)
x = tf.keras.layers.Dense(output_shape[0] * output_shape[1] * output_shape[2], name='decoder_dense')(x)
embed_size = num_dense

Note how I’m using the output shape pre-embedding to get the back the original flattened length in the first layer of the decoder.

Then, we reverse the operations one-by-one. First is to reshape the decoder to get to the last conv-pool block’s output shape:

x = tf.keras.layers.Reshape(output_shape, name='decoder_reshape')(x)

Then, create a decoder block that consists of convolution of the opposite length followed by upsampling:

for layerno in range(nlayers):
x = tf.keras.layers.Conv2D(2**(nlayers-layerno-1 + numfilters), poolsize, activation='relu', padding='same', name='decoder_conv_{}'.format(layerno))(x)
x = tf.keras.layers.UpSampling2D(poolsize, name='decoder_upsamp_{}'.format(layerno))(x)
before_padding_layer = tf.keras.layers.Conv2D(1, 3, activation='relu', padding='same', name='before_padding')
x = before_padding_layer(x)
htdiff = 1059 - before_padding_layer.output_shape[1]
wddiff = 1799 - before_padding_layer.output_shape[2]

Where we did the cropping in the encoder, we now need to do zero-padding:

before_padding_layer = tf.keras.layers.Conv2D(1, 3, activation='relu', padding='same', name='before_padding')
x = before_padding_layer(x)
htdiff = 1059 - before_padding_layer.output_shape[1]
wddiff = 1799 - before_padding_layer.output_shape[2]
decoded = tf.keras.layers.ZeroPadding2D(padding=((htdiff//2,htdiff - htdiff//2),
(wddiff//2,wddiff - wddiff//2)), name='refc_reconstructed')(x)

And now, create the autoencoder:

autoencoder = tf.keras.Model(input_img, decoded, name='autoencoder')
autoencoder.compile(optimizer='adam',loss=tf.keras.losses.LogCosh())

Why LogCosh? Because we are doing regression, and LogCosh is more tolerant of outliers than Mean Squared Error.

Writing the trainer

Once the model is written, the trainer (full code: check it out) is quite boilerplate. All our images are TFRecords, so we create a TFRecord Dataset, create the model and call model.fit.

To create a TFRecord Dataset:

def parse_tfrecord(example_data):
parsed = tf.io.parse_single_example(example_data, {
'size': tf.io.VarLenFeature(tf.int64),
'ref': tf.io.VarLenFeature(tf.float32),
'time': tf.io.FixedLenFeature([], tf.string),
'valid_time': tf.io.FixedLenFeature([], tf.string)
})
parsed['size'] = tf.sparse.to_dense(parsed['size'])
parsed['ref'] = tf.reshape(tf.sparse.to_dense(parsed['ref']), (1059, 1799))/60. # 0 to 1
return parsed
def read_dataset(pattern):
filenames = tf.io.gfile.glob(pattern)
ds = tf.data.TFRecordDataset(filenames, compression_type=None, buffer_size=None, num_parallel_reads=None)
return ds.prefetch(tf.data.experimental.AUTOTUNE).map(parse_tfrecord)

To create the model and train it:

def run_job(opts):
def input_and_label(rec):
return rec['ref'], rec['ref']
ds = read_dataset(opts['input']).map(input_and_label).batch(opts['batch_size']).repeat()

checkpoint = tf.keras.callbacks.ModelCheckpoint(os.path.join(opts['job_dir'], 'checkpoints'))

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
autoencoder, error = create_model(opts['num_layers'], opts['pool_size'], opts['num_filters'], opts['num_dense'])

history = autoencoder.fit(ds, steps_per_epoch=opts['num_steps']//opts['num_checkpoints'],
epochs=opts['num_checkpoints'], shuffle=True, callbacks=[checkpoint, HptCallback()])

autoencoder.save(os.path.join(opts['job_dir'], 'savedmodel'))

A couple of points here:

  1. I take the TFRecord and return the same reflectivity image data as both the input and the label. This is because we are doing autoencoding.
  2. The MirroredStrategy will allow me to create a machine with multiple GPUs and get fast distributed training.
  3. I create model checkpoint (Checkpoints is one of the design patterns in the Machine Learning Design Patterns book; it helps to make distributed training more resilient)
  4. In the fit() method, I mess around with steps_per_epoch and number of epochs. This is the Virtual Epochs variant of the Checkpoints pattern. Again, read the book for why this is important.
  5. Because we are doing autoencoding, I won’t bother with a validation dataset. The more closely we can represent the input, the better.

Overfitting a batch

A best practice when designing a model architecture like I did is to make sure that the resulting architecture is powerful enough to learn what we need it to learn. The way to do this is Useful Overfitting (another pattern in our book). Basically, take a really small dataset (I used 4 images) and overfit the model on this dataset. If you can get the error to go really small, then the model is powerful enough.

So, here’s what I did. I wrote my ML model and then I did a grid search over the hyperparameter space, and chose the smallest embedding size that allowed me to learn the 4 images as perfectly as possible (it won’t be perfect because the image representations will be blurry).

Because of this, there is a Hyperparameter callback as well:

class HptCallback(tf.keras.callbacks.Callback):
def __init__(self):
self.hpt = hypertune.HyperTune()

def on_epoch_end(self, epoch, logs):
self.hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='final_loss',
metric_value=logs['loss'], #history.history['loss'][-1],
global_step=epoch
)

I can now do the training on Google Cloud AI Platform:

gcloud ai-platform jobs submit training $JOB_NAME \
--package-path $PACKAGE_PATH \
--module-name $MODULE_NAME \
--job-dir gs://${BUCKET}/wxsearch/trained \
--region $REGION \
--config hyperparam.yaml \
--input gs://${BUCKET}/wxsearch/data/2019/tfrecord-00000-* \
--project ${PROJECT} \
--batch_size 4 --num_steps 1000 --num_checkpoints 4

And as a result, I found that this architecture was enough:

--num_layers 4 --pool_size 4 --num_filters 4 --num_dense 50

Just 50 numbers to represent the 1m pixels!

Train the autoencoder

Once we have decided on the hyperparameters on the small, overfit batch, we can take the model and train it on the entire HRRR 2019 dataset:

gcloud ai-platform jobs submit training $JOB_NAME \
--package-path $PACKAGE_PATH \
--module-name $MODULE_NAME \
--job-dir gs://${BUCKET}/wxsearch/trained \
--region $REGION \
--config train.yaml -- \
--input gs://${BUCKET}/wxsearch/data/2019/tfrecord-* \
--project ${PROJECT} \
--batch_size 4 --num_steps 50000 --num_checkpoints 10

where the train.yaml is:

trainingInput:
scaleTier: CUSTOM
masterType: n1-highmem-2
masterConfig:
acceleratorConfig:
count: 2
type: NVIDIA_TESLA_K80
runtimeVersion: '2.2'
pythonVersion: '3.7'
scheduling:
maxWaitTime: 3600s

This ended with a final loss of 0.0023, actually even better than the overfit loss on the tiny dataset.

Next steps:

Try it out (all links to GitHub)

  1. Create TensorFlow Records from HRRR images: hrrr_to_tfrecord.sh
  2. [Optional] Overfit batch to find best set of hyperparameters: overfit_batch.sh
  3. Train the autoencoder on the 2019 HRRR dataset: train_autoencoder.sh
  4. In my next article in this series, I show how to take the trained model, and create embeddings for the entire 2019 dataset. And then take an embedding and show you how to reconstruct the HRRR image from it.

--

--