The world’s leading publication for data science, AI, and ML professionals.

How To Generate Synthetic Images For Object Detection Tasks

A step-by-step tutorial using Blender, Python, and 3D Assets

Image created by the author
Image created by the author

Not having enough training data is one of the biggest problems in deep learning today.

A promising solution for computer vision tasks is the automatic generation of synthetic images with annotations.

In this article, I will first give an overview of some image generation techniques for synthetic image data.

Then, we generate a training dataset with zero manual annotations required and use it to train a Faster R-CNN object detection model.

Finally, we test our trained model on real images.


Image Generation Techniques

In theory, synthetic images are perfect. You can generate an almost infinite number of images with zero manual annotation effort.

Training datasets with real images and manual annotations can contain a significant amount of human labeling errors, and they are often imbalanced datasets with biases (for example, images of cars are most likely taken from the side/front and on a road).

However, synthetic images suffer from a problem called the sim-to-real domain gap.

The sim-to-real domain gap arises from the fact that we are using synthetic training images, but we want to use our model on real-world images during deployment.

There are several different image generation techniques that attempt to reduce the domain gap.

Cut-And-Paste

One of the simplest ways to create synthetic training images is the cut-and-paste approach.

As shown below, this technique requires some real images from which the objects to be recognized are cut out. These objects can then be pasted onto random background images to generate a large number of new training images.

To generate additional synthetic training images, cut out a few real examples of your objects and then paste them on background images. Image from Dwibedi, Misra, and Hebert [1]
To generate additional synthetic training images, cut out a few real examples of your objects and then paste them on background images. Image from Dwibedi, Misra, and Hebert [1]

While Georgakis et al. [2] argue that the position of these objects should be realistic for better results (for example, an object should stand on a supporting surface such as a countertop, table, or the floor), Dwibedi et al. achieved good results with objects flying around in the scene.

A disadvantage of this approach is that we are limited by the segmented images of the real objects. For diverse data, we need images from many different angles, and we can’t change the object’s illumination afterward.

Photorealism

Another approach is to render images that are as realistic as possible.

This means using high-quality 3D models and textures, computationally intensive rendering engines to simulate realistic lighting, and physics simulations for object placement.

The figure below from Hodaň et al. [3] shows two photorealistic 3D environments that were used to generate synthetic training images.

Photorealistic 3D environments are used to render synthetic training images. Image from Hodaň et al. [3]
Photorealistic 3D environments are used to render synthetic training images. Image from Hodaň et al. [3]

The limitation of this approach is the time, effort, and skill required to create these high-quality 3D environments and 3D objects.

Domain Randomization

The idea behind domain randomization is to make the virtual training environment as random as possible. Because each training image generated is different, real-world images look like just another variation of our training data.

Tobin et al. [4] were one of the first to use this concept to generate synthetic training images. They randomized the following parameters in their simulation environment to generate diverse data:

  • Number and shape of distractor objects
  • Position and texture of all objects
  • Position, orientation, and field of view of the virtual camera
  • Number of lights in the scene & position, orientation, and specular characteristics of the lights
  • Type and amount of random noise added to images
Domain randomization randomizes a lot of parameters to generate diverse training images. Image from Tobin et al. [4]
Domain randomization randomizes a lot of parameters to generate diverse training images. Image from Tobin et al. [4]

Because object texture and color are usually randomized, a computer vision model must learn to rely mostly on the shape of objects. This means that a lot of useful information can be lost due to strong randomization.


Creating a Synthetic Object Detection Dataset Using Blender And Python

Now that we have a general understanding of the different approaches, let’s generate images using Blender.

Blender is a very popular free and open-source software. It has a render engine called Cycles that we can use to generate photorealistic renderings of 3D models.

It also has a Python API that we can use to automatically generate lots of images, including annotations for each image.

Install Blender

First, download Blender version 2.93.18 (LTS) from download.blender.org/release.

On Linux, download and unpack blender-2.93.18-linux-x64.tar.xz into a folder of your choice, for example ~/synth-data/blender-2.93.18-linux-x64. For Windows, use blender-2.93.18-windows-x64.zip.

Blender 3.3.16 also seems to work fine, but newer versions do not seem to work on my computer for some reason.

You can open Blender in a terminal by typing ~/synth-data/blender-2.93.18-linux-x64/blender. This should open the Blender start screen. The idea is to render images as seen by the virtual camera.

Blender's default start screen. In the middle is the active workspace, at the top you can switch to different modes, at the top right are our scene objects, and at the bottom right are the properties. Image by author
Blender’s default start screen. In the middle is the active workspace, at the top you can switch to different modes, at the top right are our scene objects, and at the bottom right are the properties. Image by author

blender-gen

To automatically generate training images and bounding box annotations with Blender using the Python API, I’m using the open-source tool blender-gen. I created blender-gen specifically for research purposes to try out different features [5].

Clone the repository into a folder of your choice, for example ~/synth-data/blender-gen.

To see if everything works, let’s test it. The following command will start Blender in background mode and run the Python script main.py.

cd ~/synth-data/blender-gen
~/synth-data/blender-2.93.18-linux-x64/blender --background --python main.py

After a few seconds, you should have rendered some sample JPG images in the folder ~/synth-data/blender-gen/DATASET/Suzanne/images/ using Blender’s monkey object named Suzanne and some distractor cubes.

In this tutorial, we will generate training images for an apple detection model.

3D Models

First, we need our main 3D model of the object we want to detect.

You can download assets from ambientCG.com and polyhaven.com that are free to use and licensed under the Creative Commons CC0 1.0 Universal License.

As our object of interest, download for example _3DApple001LQ-1K-PNG.zip from ambientCG and unpack all files into the folder ~/synth-data/blender-gen/models/Apple/.

3D Apple 001 model from ambientCG.com (CC0 license)
3D Apple 001 model from ambientCG.com (CC0 license)

As additional distractor objects, I use the Avocado, Lemon, and Pear models shown below, and unzip these files into ~/synth-data/blender-gen/distractors/Avocado , ~/.../Lemon, and ~/.../Pear respectively.

Distractor objects are used to confuse the deep learning model so that it doesn’t just learn to recognize rendered foreground objects. They also add occlusion when placed in front of our main 3D model.

Distractor objects: 3D Lemon 001, 3D Avocado 001, and 3D Pear 002 from ambientCG.com (CC0 license)
Distractor objects: 3D Lemon 001, 3D Avocado 001, and 3D Pear 002 from ambientCG.com (CC0 license)

Background Images

Random images are used as backgrounds for our rendered 3D models. You can use any images you like. I always use the COCO image dataset. Specifically, I use the 2017 validation images, which consist of 5000 images.

Put all your background images into the folder ~/synth-data/blender-gen/bg/.

HDR Image Lighting

Next is lighting. Lighting can be set up manually by creating point lights and specifying their number, location, intensity, and color.

A much simpler approach is to use image-based lighting with High Dynamic Range (HDR) images. With image-based lighting, the light comes from a 3D image of a real scene.

Go to ambientCG.com or polyhaven.com, download a few *.exr* or .hdr** files, and put them into the folder ~/synth-data/blender-gen/environment/.

An HDR image can be used for photorealistic image-based lighting. This here is "Day Sky HDRI 041 A" from ambientCG.com (CC0 license)
An HDR image can be used for photorealistic image-based lighting. This here is "Day Sky HDRI 041 A" from ambientCG.com (CC0 license)

An HDR image provides realistic lighting and reflections, resulting in photorealistic renderings of our 3D models.

By changing the HDRI emission strength, we can control the strength of the lighting and thus the brightness of our 3D models.

Renderings with different HDRI emission strengths. Image by author
Renderings with different HDRI emission strengths. Image by author

Configuration

Configuration is done by simply editing the config.py file.

Starting with our initial config.py, open it and edit the following lines:

self.out_folder = 'Apple'
self.model_paths = ['./models/Apple/3DApple001_LQ-1K-PNG.obj']
self.distractor_paths = ['./distractors/Lemon', './distractors/Pear', './distractors/Avocado']
self.object_texture_path = '' # use default texture
self.distractor_texture_path = '' # use default textures
self.cam_rmin = 0.15 # minimum camera distance
self.cam_rmax = 0.8  # maximum camera distance
self.resolution_x = 640  # output image width
self.resolution_y = 360  # output image height
self.number_of_renders = 1  # number of rendered images

And then run:

cd ~/synth-data/blender-gen
~/synth-data/blender-2.93.18-linux-x64/blender --python main.py

This will cause Blender to create a 3D scene with our 3D model of an apple, some distractor objects, a virtual camera, HDRI lighting, and a random background image.

Blender 3D scene (left) and the resulting rendered image with a background image (right). Image by author
Blender 3D scene (left) and the resulting rendered image with a background image (right). Image by author

The camera is placed in the 3D scene using the spherical coordinates radial distance, inclination angle, and azimuthal angle.

Each parameter is defined by a minimum and a maximum value. For each rendering, the parameters are randomly sampled within the given interval. The camera distances cam_rmin and cam_rmax define how large our objects appear in the image.

Generated Synthetic Images and Annotations

Running the following command will quickly visualize the generated bounding boxes:

python show_annotations.py
Visualization of a rendered image with a bounding box annotation as a green rectangle. Image by author
Visualization of a rendered image with a bounding box annotation as a green rectangle. Image by author

The generated images are stored in the folder ~/synth-data/blender-gen/DATASET/Apple/images.

The annotations are stored in the file ~/synth-data/blender-gen/DATASET/Apple/annotations/instances_default.json.

If we open the annotation file, we can see the bounding box (bbox) annotations for each image in Microsoft COCO data format:

{
  "images": [
    {
      "id": 0,
      "file_name": "000000.jpg",
      "height": 360,
      "width": 640
    }
  ],
  "annotations": [
    {
      "id": 0,
      "image_id": 0,
      "bbox": [
        470.81,
        201.34,
        79.22,
        68.45
      ],
      "category_id": 1,
      "segmentation": [],
      "iscrowd": 0,
      "area": 5423.04,
      "keypoints": [],
      "num_keypoints": 0
    }
  ],
  "categories": [
    {
      "supercategory": "",
      "id": 1,
      "name": "Apple",
      "skeleton": [],
      "keypoints": []
    }
  ]
}

Now it’s time to generate a larger data set. In the config.py file, change the following line to the number of training examples:

self.number_of_renders = 2000

And then run Blender:

cd ~/synth-data/blender-gen
~/synth-data/blender-2.93.18-linux-x64/blender --background --python main.py

Using this synthetic training dataset, consisting of 2000 640×360 jpg images, we can now train an object detection model of our choice.


Training and Testing an Object Detection Model on Real-World Data

Now that we have our synthetic dataset consisting of images and bounding box annotations in the COCO data format, we can train an object detection model.

For validation purposes, I generated an additional 100 images with a different random seed.

Model Training

I have already written an article on how to train a Faster R-CNN model in PyTorch using a synthetic toy dataset created with blender-gen.

How to Train a Custom Faster RCNN Model In PyTorch

A higher-level alternative is using MMDetection, which is a framework based on PyTorch.

Learn How to Train Object Detection Models With MMDetection

For this tutorial, I re-used my code from How to Train a Custom Faster RCNN Model In PyTorch and trained my model with 2000 synthetic training examples for 12 epochs using Stochastic Gradient Descent (SGD) with a learning rate of 0.00001.

Training a faster R-CNN apple object detection model on 100% synthetic training and validation images. Image by author
Training a faster R-CNN apple object detection model on 100% synthetic training and validation images. Image by author

Training my apple object detection model (Faster R-CNN with MobileNet backbone) for 12 epochs took about 20 minutes on a GeForce RTX 3060 GPU.

After that, the training and validation losses plateaued, which means that model training is done 🚀 .

Let’s see the inference result on a random synthetic validation image.

Model inference (red rectangle) on a generated validation image. Image by author
Model inference (red rectangle) on a generated validation image. Image by author

Model Testing on Real-World Images

Finally, let’s test our model on unseen test images taken with my smartphone.

Model inference (red rectangle) on three test images taken with my smartphone. Image by author
Model inference (red rectangle) on three test images taken with my smartphone. Image by author

Even though the model has not seen any real images during training, which means that there is a sim-to-real domain gap, the model can still accurately detect the apple.


Conclusion

In this article, we covered three different techniques for generating synthetic training images: cut-and-paste, photorealism, and domain randomization.

In the step-by-step tutorial, we generated synthetic training images using Blender and blender-gen. Using this data, we can train an object detection model without any manual annotation effort.

For this tutorial, I used a free 3D model of an apple as an example. However, you can use any 3D model that you like to generate synthetic images and then train another object detection model.

If the sim-to-real domain gap is too large and performance on real test images is not good enough, I have found that fine-tuning a base synthetic model on a small training set of real images works quite well.


References

[1] D. Dwibedi, I. Misra, M. Hebert, Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection (2017), IEEE International Conference on Computer Vision (ICCV)

[2] G. Georgakis, A. Mousavian, A. C. Berg, J. Kosecka, Synthesizing Training Data for Object Detection in Indoor Scenes (2017), Robotics: Science and Systems XIII

[3] T. Hodaň et al., Photorealistic Image Synthesis for Object Instance Detection (2019), IEEE International Conference on Image Processing (ICIP)

[4] J. Tobin et al., Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World (2017), IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

[5] L. Eversberg, J. Lambrecht, Generating Images with Physics-Based Rendering for an Industrial Object Detection Task: Realism versus Domain Randomization (2021), MDPI Sensors 21 (23)

Resources


Related Articles