Deep learning with synthetic data will make AI accessible to the masses

Practicus AI
Towards Data Science
3 min readJul 11, 2018

--

In the world of AI, data is king. It’s what powers the deep learning machines that have become the go-to method for solving many challenging real-world AI problems. The more high quality data we have, the better our deep learning models perform.

Tech’s big 5: Google, Amazon, Microsoft, Apple, and Facebook are all in an amazing position to capitalize on this. They can collect data more efficiently and at a larger scale than anyone else, simply due to their abundant resources and powerful infrastructure. These tech behemoths are using the data collected from you and most everyone you know using their services to train their AI. The rich keep getting richer!

Amazon.com Inc. employees shop at the Amazon Go store in Seattle

The massive data sets of images and videos amassed by these companies have become a strong competitive advantage, a moat that keeps smaller businesses from breaking into their market. It’s hard for a startup or individual, with significantly less resources, to get enough data to compete even if their product is great. High quality data is always expensive in both time and money to acquire, two resources that smaller organizations can’t afford to spend liberally.

This advantage will be overturned by the advent of synthetic data. It’s being disrupted by the ability for anyone to create and leverage synthetic data to train computers across many use cases, including retail, robotics, autonomous vehicles, commerce and much more.

Synthetic data is computer-generated data that mimics real data; in other words, data that is created by a computer, not a human. Software algorithms can be designed to create realistic simulated, or “synthetic,” data. You may have seen Unity or Unreal Engine before, game engines which make it easy to create video games and virtual simulations. These game engines can be used to create large synthetic data sets. The synthetic data can then be used to train our AI models in the same way we normally do with real-world data.

A recent paper from Nvidia showcases how to do this. Their general procedure is shown in the figure below where they have generate synthetic images by randomising every possible variable including the image scene, objects, lighting position and intensity, texture, shapes and scale.

Illustration of how to generate high quality and realistic synthetic data in a game engine, from the research paper: Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization

Being able to create high quality data so quickly and easily puts the little guys back in the game. Many early-stage startups can now solve their cold start problem (i.e starting out with little or no data) by creating data simulators to generate contextually relevant data with quality labels in order to train their algorithms.

The flexibility and versatility of simulation make it especially valuable and much safer to train and test autonomous vehicles in these highly variable conditions. Simulated data can also be more easily labeled as it is created by computers, therefore saving a lot of time. It’s cheap, inexpensive, and even allows one to explore niche applications where data would normally be extremely challenging to acquire, such as the health or satellite imaging fields.

The challenge and opportunity for startups competing against incumbents with inherent data advantage is to leverage the best visual data with correct labels to train computers accurately for diverse use cases. Simulating data will level the playing field between large technology companies and startups. Over time, large companies will probably also create synthetic data to augment their real data, and one day this may tilt the playing field again. In either case, technology is advancing more rapidly than ever before and the future of AI is bright.

Like to learn?

Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science! Connect with me on LinkedIn too!

Recommended Reading

Want to learn more about Deep Learning? The Deep Learning with Python book will teach you how to do real Deep Learning with the easiest Python library ever: Keras!

And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone! As an Amazon Associate I earn from qualifying purchases.

--

--