The Startling Power of Synthetic Data

Prince Canuma
6 min readNov 28, 2018

In this article, we are going to talk about this little-known subset of anonymized data that is assisting Machine Learning in Data Scarce Environments.

This is a very exciting topic to me personally because of its direct applicability in environments where real data is scarce like developing countries.

“Optimism is the faith that leads to achievement. Nothing can be done without hope and confidence.” — Helen Keller

We all know that Deep Learning has become the number 1 field of machine learning. Deep Learning has been deployed and is being applied across a broad spectrum of disciplines, having demonstrated that by combining big data with supervised learning, that we can train systems to perform artificial intelligence (AI)-centric tasks previously considered impossible with traditional approaches. One of the biggest drivers of the machine learning trend has been the availability of large amounts of data with which these algorithms are calibrated, or trained.

So before we continue let's answer some important questions.

What is synthetic data?

According to Wikipedia, that’s right this one is straight from my buddy wiki!(For more information on this work, you can explore the full publication:
Synthetic data)

from Horizon Zero Dawn’s game world inspired by Planet Earth

Synthetic Data are generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing any type of system because the synthetic data are used as a simulation or as a theoretical value, situation, etc. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. Another use of synthetic data is to protect the privacy and confidentiality of authentic data. As stated previously, synthetic data is used in testing and creating many different types of systems.

History

The history of the generation of synthetic data dates back to 1993. In 1993, the idea of original fully synthetic data was created by Rubin. The idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.

Applications

Accelerating AI development and solutions using synthetic data

Continuing…

What happens if the problem at hand doesn’t come with a treasure trove of readily available raw data? What if the company or entity trying to solve a problem can’t or does not have to right tools to gather the raw data? (This is a case that can be found in developing countries). What if, for reasons such as privacy or scarcity, researchers are prevented from obtaining large enough sample sizes of real-world data to train these complex artificial neural networks? Some researchers have solved this problem creatively by employing what is known as Synthetic Datasets– virtually constructed datasets designed to be used in absence of real-world data in the machine learning process.

“The best way to predict the future is to create it.” — Peter Drucker

Synthetic data will definitely drive the next wave of deployment and application of custom deep learning solutions in the real world across a variety of problems involving image classification, object recognition and etc.

All industries, companies and countries trying to implement a custom deep learning solution will benefit directly from this approach, as synthetic data can create conditions through simulation, instead of an authentic situation.

“Engineers like to solve problems. If there are no problems handily available, they will create their own problems.” — Scott Adam

Virtual worlds enable you to avoid the cost of damages, spare human injuries, and other factors that come into play; unparalleled ability to test products, and interactions in any environment.

It does not end there!

Since this approach has shown some very interesting benefits in certain applications, researchers started playing with it and some startups were born thanks to this technology. Here are some examples:

Structured Domain Randomization(SDR)

Bridging the Reality Gap by Context-Aware Synthetic Data

(For more information on this work, you can explore the full publication: https://arxiv.org/abs/1810.10093)

A sample image of this amazing work

This is a paper by Aayush Prakash et al.

In this paper, the authors presented a variation of Domain Randomization(DR) that takes into account the structure and context of the scene. SDR places objects and distractors randomly according to probability distributions that arise from the specific problem at hand. In this manner, SDR-generated imagery enables the neural network to take the context around an object into consideration during objection detection.

According to the paper, this approach achieved competitive results on real data after training only on synthetic data for the 2D bounding box car detection problem. Compared with previous approaches SDR provides more variability in the geometry of the scene while most previous approaches are static; this variability is the key to successful object detection.

They also state that synthetic data has been used in a myriad of vision tasks where labelling images besides being expensive ranges from tedious to borderline impossible. Applications like semantic segmentation, classification, object pose, 3D reconstruction and etc, have all benefited from the use of synthetic training data.

Imagine!

There is no need to label or create segmentation masks for each object, it’s all taken care of for us.

Training a Stereo Depth Algorithms from Synthetic Image Pairs

(For more information on this work, you can explore the full publication: Evaluation of Synthetic Data for Deep Learning Stereo Depth Algorithms on Embedded Platforms by authors Kevin Lee and David Moloney.)

Sample

A very common application of computer vision is stereo depth. Put simply, stereo depth refers to a method of delivering 3-Dimensional depth data by comparing disparities in the input from two slightly offset image sensors. Much like our own pair of eyes, the small distance separating the two points of view enables an algorithm to determine geometric information such as depth. Stereo depth is an important component of many modern applications in fields such as robotics, drones, and even consumer-centric categories such as virtual reality, and smart shopping.

Much like other fields, researchers have begun to attack the problem of stereo depth algorithms with new approaches in machine learning. While visual elements such as occlusion (when objects are partially hidden by others), and reflections cause problems for conventional stereo matching algorithms, deep neural networks appear to be less prone to these challenges.

There is a lot of potential to explored here, this can benefit a lot of business and customers as well as create new businesses. Startups have their imagination as the limit on how they can use this technology.

I personally see the combination of this technology with other technologies like AI revolutionizing developing countries and speeding up development.

“Everything comes to us that belongs to us if we create the capacity to receive it.” — Rabindranath Tagore

So, this is interesting development and can bring benefits to such countries.

Conclusion

Stay tuned for more amazing articles.

Thank you for reading. If you have any thoughts, comments or critics please comment down below.

If this article was useful or insightful in some way for you, please give me a round of applause 👏👏 👏(+50) and share it with your friends.

Follow me if you want to join me on this adventure on AI jungle. :D

--

--

Prince Canuma

Helping research & production teams achieve MLOps success | Ex-@neptune_ai