Learning from Simulated Data

Connor Shorten
Towards Data Science
5 min readFeb 21, 2019

--

One of the biggest problems plaguing Deep Neural Networks is that training them requires massive labeled datasets. Some researchers are trying to overcome this limitation with research into areas such as Meta-Learning and few-shot learning. However, for the time being, massive datasets are needed for success in tasks such as object detection.

This article discusses combining Graphics Engines such as Unity with GANs to build big datasets

Graphics engines used to power video games such as Unity are more than capable of producing massive labeled datasets. Deep Learning agents could be trained in video game environments to perform tasks such as autonomous navigation. However, their is still a gap between the simulated and real datasets which Deep Convolutional Neural Networks cannot overcome on their own.

To the rescue of this problem are Generative Adversarial Networks. GANs can use the loss function of ‘real’ versus ‘fake’ to push simulated data closer to real images. This enables Deep Learning models to take advantage of graphics engines which can produce enormous labeled datasets. This article will explain the details of the SimGAN model presented by researchers at Apple. This paper focuses on moving simulated eye images from the Unity engine towards real images from the MPIIGaze dataset for the task of eye gaze estimation.

High Level Overview of the SimGAN model

Following is a link to the paper:

Introduction and Results

Before diving into the details of how this model was implemented, here are some results from the experiment to motivate your interest. One evaluation of results was conducted with a Visual Turing Test. This requires using human labelers to determine which images are ‘real’ or ‘fake’. The original simulated data was labeled correctly by human labelers at a rate of 162 / 200 (81%). After applying SimGAN, the human labelers performed as well as random guess, (maximum entropy), at a rate of 517 / 1000 (51.7%).

Additionally, a model trained on the Synthetic Data alone classified eye gaze on the real dataset at a rate of 64.9%. Once refined with the SimGAN, a model trained on the refined data classified the real dataset at a rate of 87.2%. Improving the data with the GAN resulted in a 22.3% classification performance!

Following is a description of the major techniques used to make the SimGAN work so well:

Adversarial Loss with Self-Regularization

In this GAN formulation, the generator will refine the simulated eye images such that the discriminator cannot tell the difference between refined and real images. However, GAN training can be unstable and without additional regularization, there is no guarantee that the generator will refine the image in a way that preserves the label of the original image. Note that the entire purpose of doing this is that the simulated images are all labeled. Thus, the enhancement of images must preserve this label. Directly quoting the paper,

“The learned transformation should not change the gaze direction.”

The self-regularization loss used in the SimGAN penalizes the generator for producing an image that is massively different from the original image. This is done with the following loss:

Regularization Loss Term

The loss above add an additional loss term that is the L1 distance from the refined image and the original image. Note that this is a per-pixel loss, meaning that each individual pixel in the refined image is compared with the original image. Adding this loss with the adversarial loss results in the following objective for the generator:

Generator Loss for the SimGAN model, top component: adversarial loss with respect to the discriminator, bottom component: self-regularization term which ensures the generator does not dramatically alter the original image.

The phi(x), (sorry not sure how to add greek symbols to a medium article), term refers to a function that modifies the refined image and the original image. This could be an identify mapping such that phi(x) = x or image derivatives, mean of color channels, or learned transformations. This paper simply uses the identity mapping phi(x) = x.

Local Adversarial Loss

One of the biggest problems with GANs in general is that the generator tends to produce artifacts such as a dog with three legs or a frog with two heads. One of the solutions to this is to have the discriminator criticize local image patches rather than the entire image. The implementation of this in the SimGAN model is similar to another discriminator known as PatchGAN but differs heavily in the output dimensions of the discriminator.

The objective of this loss is for the discriminator to classify each local patches as real or fake. The SimGAN implements this by having the discriminator output a probability map of size (w x h) where (w x h) is the number of local patches to be classified. This signal is propagated back to the generator by taking the cross-entropy loss over each of the local patches probability output.

Updating Discriminator using a History of Refined Images

Another big problem with GANs is that Deep Neural Networks struggle with online/continual learning. This refers to the task of learning from a distribution which is constantly changing. In this case, the distribution of images created by the generator is constantly changing and thus the discriminator is subject to continual learning issues such as ‘catastrophic forgetting’.

The SimGAN model combats the challenge of learning from a non-stationary distribution by keeping a buffer of generated images from the previous iteration. This buffer contains previously generated images. If b is the batch size of images created by the generator, the discriminator is then shown b/2 of the currently generated images as well as b/2 images sampled from the buffer.

Discussion

One of the interesting components to the SimGAN is that the generator starts from a simulated image rather than a random vector. I wasn’t able to figure out from the paper if the discriminator is additionally conditioned on the prior image, although given the details of the local adversarial loss, I don’t think this is the case. It would be interesting to see if that extra conditioning could improve the model.

The local adversarial loss is another very interesting concept with the SimGAN model. I wonder what the authors would think of using a self-attention block instead. The self-attention block seems to be well fitted for combatting artifacts introduced by the generator and it has the added benefit of modeling long-range dependencies which the local loss will not capture. However, in this case, and in applications such as super-resolution, long-range dependencies aren’t really the primary challenge for the GAN model. In this case it is the local statistics that really differentiate between simulated and real images.

This was a very interesting paper to read and I am very excited about the potential of using GANs to build Deep Learning models trained on Simulated Data from Graphics Engines such as Unity! Thank you for reading!

--

--