Data Science by Hazy

The Many Use Cases for Synthetic Data

How privacy-protecting synthetic data can help your business stay ahead of the competition.

Ravi Malde
Towards Data Science
6 min readAug 25, 2020

--

A 2016 study found that, after just 15 minutes of monitoring driver braking patterns, researchers were able to identify that driver with an accuracy of 87 percent. Turns out, the way that you press the brake pedal is almost entirely unique to you.

This sensitivity of data extends into every aspect of our lives. That fancy hipster coffee that you buy at your favourite cafe also leaves a data trail of behaviour. And companies are chomping at the bit to get hold of this data so that they can formulate new business strategies that aim to attract your business. This is why privacy protection laws, like Europe’s GDPR are rapidly changing the data landscape, by prioritising consumer protection, giving you the right to be forgotten, and controlling who has the right to own and access your data.

This is where the magic of synthetic data comes in. Synthetic data is generated using machine learning algorithms that ingest the real data, train on the patterns of behaviour, and then expel entirely artificial data that retains the statistical characteristics of the original dataset. This should be distinguished from the more traditional anonymised datasets that are actually quite vulnerable to re-identification techniques. Since synthetic data is inherently artificial, this vulnerability does not apply.

Due to the privacy-preserving nature of synthetic data, it is not governed by the same data protection laws. Machine learning engineers and data scientists can confidently use this synthetic data for their analyses and modelling, knowing that it will behave in the same manner as the real data. This simultaneously protects customer privacy and mitigates risk for the companies that leverage it — all while unblocking data that is otherwise frozen behind compliance barriers… often for many months or even years.

Since the end of June, I have been a data science intern at Hazy synthetic data. The Hazy team has built a sophisticated synthetic data generator and enterprise platform that helps customers unlock their data’s full potential, increasing the speed at which they are able to innovate, while minimising risk exposure.

Synthetic data use cases.

Now that you’ve been introduced to synthetic data and the high-level problems that it can help solve, let’s get into some more detailed synthetic data use cases.

Vendor evaluations.

Picture this. You work at an organisation that is looking to outsource some work, like app development, testing, data science, and analytics and business intelligence. Like with any big purchase, you want to test drive before you buy. Often this means handing real — and highly sensitive — data to third parties, which is not only a security risk, but can also take as many as six to 18 months to jump over legal and procurement hurdles. This is a lot of hassle considering that it’s all just to determine whether or not you want to partner with this vendor.

As the data is no longer sensitive, using synthetic data eliminates the lag in this process. AI-generated synthetic data can be representative enough that, if you choose to work with that vendor, you could just eliminate the risk of any security compromises down the line, by continuing to build on only artificial data.

Sharing data with third-party services.

In a similar vein to vendor evaluation, using third-party services such as online applications or cloud compute resources would require handing over sensitive data to that service. The same goes for sharing data with third parties for better or at least external analytics. Due to hardware limitations, a business may not be able to keep all of its data on-premise, and therefore it needs to use an online storage platform or faster cloud providers. However, compliance rules dictate this data must remain on-premise. Along with the usual headache of compliance, this can (and should) be a significant worry for companies as a security breach can leave both your customers and your reputation vulnerable. With synthetic data it’s all Hakuna Matata.

Data monetisation.

Many business models these days are entirely based around monetising the data that they collect from their user base. If you’re not paying for the product then it’s more than likely that this is the case. Companies can collect data, conduct analyses, and sell any of the insights on to external businesses that have a vested interest. Some organisations sell the raw data so that the external companies can conduct their own nuanced analyses, but this comes with many more regulatory compliance issues, and often the data is deemed too sensitive to do so.

With synthetic data, compliance and risk are no longer issues — subsequently the value of that data and the speed at which value can be generated from it are drastically increased. Companies may even be able to generate entirely new revenue streams. After-all, the value of most data isn’t the personal information, but the insights gained from it. Plus, synthetic data is more flexible than real data, as it can be infinitely automated, amplified and enriched, opening up even more monetisation opportunities.

Cross-organisational data portability.

Restrictions on the transfer of data are not only limited to that of dealings with external companies. Within one organisation, there can be many compliance criteria that must be met before data can be passed between departments, and this can often take weeks. Even longer if it involves sharing across geographical boundaries and regulations.

Being able to create a safe and synthetic dataset means that organisations can have centralised data repositories — often called data pools — that can be managed by simple role-based access control. For example, banks have a particular wealth of data in their customers’ transaction histories. By pooling synthetic twins of this data, it can be safely shared among data scientists from multiple departments and across borders.

This unprecedented level of collaboration can be used for training on much larger datasets that unearth more patterns for better money laundering and fraud detection algorithms. With the freedom to share information internally, enterprises can innovate and act on new data much faster — from personalised marketing or international crime. This gives businesses a significant edge over competitors that have more traditional data lifecycles and artificial barriers to innovation.

Data retention.

Regulations are also in place that limit the amount of time a company is able to keep a hold of personal data, making it very difficult to conduct longer term analyses, such as when trying to detect seasonality over several years. Remember synthetic data is not dictated by the same privacy protection laws — while it retains the customer usage patterns, it’s utterly artificial. With no risk of re-identification, companies are able to hold onto their synthetic data for as long as they wish, and can come back to it any time in the future to conduct analyses that were not previously being carried out or even technologically feasible at the time of data collection.

Simulating unforeseen events.

Preparation is usually better than a knee jerk reaction. More and more companies are looking to use data to prepare for unforeseen circumstances, never more so than now in these unprecedented times. This kind of preparedness is now possible thanks to conditional synthetic data generation. It’s possible to take a ‘normal’ or precedent dataset, add conditions to the generator, and output a synthetic dataset that is representative of events that have never occurred before, which allows you to analyse, model and subsequently prepare for such circumstances.

Conditional synthetic data use cases can range from predicting customer behaviour if there’s a second wave of this pandemic to the probability a type of cancer will metastasise to the effects of global heating. More generally, it could combine customer behaviour in one country with open public data sources to accurately predict how a product or service would perform in a completely new location.

Don’t be left behind.

Ninety percent of the world’s data was created in the last two years, with 2.5 quintillion bytes of new data being captured every day. The data economy is already a highly regulated space, but with data’s current trajectory, it is likely to become even more so as governments and regulatory organisations rush to catch up with the unfathomable amount being collected.

Businesses who utilise synthetic data will be one step ahead of the competition. It will increase the speed at which you are able to develop new products, create fresh partnerships with third parties, and even generate entirely new revenue streams, all while substantially reducing your risk vector.

--

--