Hands-on Tutorials

Annotating a large image dataset is labor-intensive, and as a consequence, expensive. In some cases, only large corporations have the resources to build the dataset that will give their convolutional neural networks (CNN) an edge for a given application.
A possible approach to work around this problem is image synthesis. When the object class is so simple that we can programmatically create realistic examples, we can generate images with corresponding target heatmaps. These pairs will serve as training observations for a fully convolutional neural network (FCNN) that takes as input an image and predicts the corresponding heatmap.
Stopping is mandatory when you see a red octagon
As a simple example, we’ll create a generator for monochrome images of stop signs, one of the many signals that must be dealt with by an autonomous vehicle. Ignoring the color information will pose an additional difficulty for training, but if we succeed, the resulting FCNN will be less likely to be confused by a round red spot, for example.
You can find the code used in this article here.
The generator starts from a random background image. It will draw an octagon with the word ‘STOP’ inside. Adding noise, blurring, and applying an affine transformation will complete the process of rendering a reasonably realistic image of a stop sign on a random background. In some cases, the generator will draw an additional random uniform octagon. That octagon simulates the backside of a stop sign, that the FCNN will have to learn to ignore. The generator will also sometimes write the word ‘STOP’ outside of an octagon. The surface of the octagon, morphed by the affine transformation, will be the target heatmap that the FCNN will have to learn to generate.

One of the main advantages of using synthetic images is a practically infinite training dataset. Even if the number of available background images is finite, the various transformations, the random location, and the scale of the drawn object assure us that the neural network will never see the same image twice during training.
The training images are synthesized on demand, but the validation images are generated once and are retained, to provide a fair comparison of losses from one epoch to another.

FCNN architecture
The architecture that I chose (after an embarrassing number of failed iterations) is a stack of three convolution layers, followed by a series of three transposed convolutions.

The implementation of this FCNN architecture accepts a list of input tensors and returns the pixel-wise maximum of each tensor’s output as its final output. This feature was motivated by the observation that the trained FCNN’s tend to be very sensitive to the level of blur ¹ applied to the input image. In some cases, when an input image was not blurred, the output heatmap did not highlight the stop sign. When that input image was a little blurred, the heatmap highlighted the stop sign. In some cases, the opposite was true: the stop sign would be detected when the input image was not blurred, but would not be detected with a blurred input. This observation led to submitting increasingly blurred images and retaining the pixel-wise maximum of each output. This mechanism is integrated into the definition of the neural network’s forward() function.
The FCNN was trained by supplying a list of two images: the unblurred image and an image blurred with a 3×3 kernel.
Training
Using the binary cross-entropy loss seemed to work well for this type of task.


Testing the trained FCNN
After training with synthetic images, we can test on real scene images. Some of these images include a stop sign, and others do not. Some scenes feature the backside of a stop sign, which should not activate the heatmap.

The size of the synthetic images used for training was 256 x 256, but in inference, the results are best when the real scene images get resized to 512 x 512 before being passed to the FCNN.

We observe that we obtain no false positives (i.e. when the scene doesn’t include a stop sign, the heatmap won’t display an area of high intensity), but some false negatives happen (i.e. some stop signs are missed). This phenomenon is not observed with synthetic images.

Study of the missed cases
With neural networks, the identification of the malfunction causes is often difficult. One thing we can do is alter the problematic input images, and observe if the outputs improve.
Multiple blurring
Two cases that failed to generate a good heatmap were submitted to multiple blurring sizes: 1×1 (i.e. no blurring), up to 9×9:


Resizing
Resizing the input image to various resolutions is possible with fully convolutional neural networks because their architecture doesn’t include fully connected layers whose input and output dimensions are fixed. This flexibility allows us to choose the resizing dimension that we apply to the original image, before passing through the FCNN. The same problematic images were resized to different sizes before being passed through the neural network.


Importance of the printed word
For the training images, the word ‘STOP’ appeared in the center of the octagon. We can test if the presence of this word is a necessary feature. The following image shows the result of submitting a synthetic image of an octagon without any text.

The FCNN didn’t activate for most of the octagon surface. We can conclude that the presence of text is necessary.
The following image shows the heatmap when the printed text is a randomly chosen word from the Oxford English Dictionary.

Conclusion
We generated synthetic images of a stop sign on a random background. These images are used to train a fully convolutional neural network to generate a heatmap of the stop sign octagon.
The initial observation that the FCNN is very sensitive to the level of blurring in the input images was reinforced by the analysis of problematic cases. While the FCNN training used images blurred by sizes of 1 and 3, it is not enough for some real scene inputs. I would recommend training with blurring sizes of [1, 3, 5, 7].
![Left: Result when blurring sizes are [1, 3]. Right: Result when blurring sizes are [1, 3, 5, 7]. Image by the author.](https://towardsdatascience.com/wp-content/uploads/2021/11/1Xb9eE_TSICc-EugdF6NcQ.png)
As for resizing, only one of the two studied cases of a missed object showed an improvement with a specific dimension, while the other showed no improvement. For this reason, I would recommend keeping the resizing size to (512, 512), which gives good results overall.
Finally, the trained FCNN seems to be insensitive to the word that is written inside the octagon, as long as there is at least one light letter on a dark background. Although this feature was not explicitly engineered, it is useful, especially if, like me, you live in a part of the world where stop signs can display either "STOP", "ARRÊT" or "ARRÊT STOP".
This project was both challenging and fun! There is a lot of satisfaction in having the power to compensate for the lack of an expensive manually annotated image dataset with a simple programming trick. Let me know what you think!
[1] Blurring is the process of replacing each pixel value in an image with a weighted average of its neighbor values. In a typical scenario, the neighborhood will be a small square around the central pixel (ex.: 5×5), and the weighting will be uniform. In the remainder of this article, a "blurring size of N" means a uniform average over a neighborhood of NxN pixels around the central pixel.
[2] The training and validation images were synthesized by drawing on random background images that were downloaded from https://picsum.photos. According to the Picsum website, the images were obtained from https://unsplash.com/