The world’s leading publication for data science, AI, and ML professionals.

VirtualDataLab: A Python library for measuring the quality of your synthetic sequential dataset

Including built-in open-source datasets, a synthetic data generator interface, and accuracy/privacy metrics.

Photo by American Heritage Chocolate on Unsplash
Photo by American Heritage Chocolate on Unsplash

Gartner estimates that by 2022, 40% of AI/ML models will be trained on synthetic data. [1]

Indeed, synthetic data is more and more popular – I see this every day, working at a synthetic data company. If you feel like it’s time for you to pick up some synthetic data skills, please read on. I will tell you about the basics and introduce you to a cool open-source tool you will find handy going forward. Let’s start!

So, the basics. Synthetic data is created with a synthetic data generator. A synthetic data generator or synthesizer takes its so-called input or target data and returns output or synthetic data containing the same schema and statistical relationships as the input or target data.

Synthetic data generators can range from trivial to complex. A trivial example would be returning a shuffled sample of the target data. This is highly accurate but violates privacy as it is a direct copy of the target data. A complex example would be a deep neural network. In this case, we trade off some accuracy for privacy protection.

Making synthetic data is easier than ever with research institutions continually developing new algorithms, but how do we decide which generator is the best?

Best practices for picking a synthetic data generator

Typical ways to measure quality can range from looking at summary statistics or using the synthetic data in a downstream Machine Learning task. Unfortunately, these methods only measure how well synthetic data mirrors the target data. But having too high accuracy can also come with serious consequences in where the synthetic data copies large parts of the original data. A key benefit of synthetic data is the privacy protection of all individuals in the original data. Thus, we want to include how well our generator preserves privacy in our quality measurement.

There is no shortage of tools or how-to-guides on creating synthetic data, but very little on how exactly to measure the utility/privacy aspect of the synthetic data. One such resource is SDGym – from MIT Data to AI Lab. However, we found that we wanted more features than what the library offered, so we created the Virtual Data Lab.

Introducing the Virtual Data Lab VDL is a Python framework to benchmark sequential synthetic data generators in terms of accuracy and privacy.

We wanted it to:

  • Work with pandas (Python’s dataframe library)
  • Use any possible target data source for generation
  • Provide a common interface to add new synthesizers
  • Create dummy data filled with randomly generated numeric types and categorical types
  • Work with sequential data
  • Have intuitive privacy metrics
  • Provide a set of real-world datasets to benchmark on

My team at MOSTLY AI frequently uses Virtual Data Lab for rapid evaluation of modifications to our Synthetic Data generator. It has empowered us to have a standardized testing analysis and as a result saving us countless hours crafting an ad-hoc single-use report. Along those lines, a use case for Virtual Data Lab could be testing various synthetic data generators for a downstream machine learning task. Rather than deploy several expensive to train machine learning models with several different synthetic datasets, VDL can be used as a proxy for selecting the dataset closest to the original without being too close.

What’s inside?

In the current release, there are three different synthesizers included. Two of the three synthesizers are trivial, intended to be a baseline for the metrics. One synthesizer is a simple variational autoencoder implemented in PyTorch. You can write your own synthesizer by following these instructions.

Next, we move on to accuracy and privacy metrics. For a deeper dive, please check the readme.md

Accuracy is measured by taking the difference in empirical distributions across columns. We summarize the errors by either MAX Error or L1 Distance/Sum (L1D). MAX Error tells us what is the largest error we saw across the datasets, whereas L1D provides an overall score of how much error we saw. Looking at empirical distributions gives us confidence in how well the synthetic data captured statistical trends from the target data. Distribution measurement is also used by NIST in their synthetic data challenge.

Privacy is measured by splitting the target data into a reference set and holdout set.

A sample is taken without replacement from the synthetic data that is the same size as the holdout set. Both holdout sets are compared to the reference set with two nearest neighbor based calculations.

The Distance to Closest Record (DCR) measures the distance between a point in the holdout set and a point in the reference set. Bad synthetic data is when the original target data is perturbed by noise. The DCR is intended to capture such scenarios.

NNDR Visualized: Image by Author
NNDR Visualized: Image by Author

The Nearest Neighbor Distance Ratio (NNDR) measures the distance of a holdout point’s first neighbor and second neighbor. The NNDR of a point can range from 0 to 1. A value close to 1 implies that the point is likely to be in a cluster. A value close to 0 implies that the point is likely to be close to an outlier. We calculate the NNDR for every point pair between the target and holdout set to obtain a NNDR distribution. Ideally, we want the synthetic NNDR distribution to match the target NNDR distribution. By testing for discrepancies therein we can ensure that the synthetic data is not closer to the training points than we would expect based on a holdout.

These two privacy metrics were designed to provide privacy guarantees on datasets. We can compare these two metrics to differential privacy, a popular mathematical limit on the amount of privacy loss. Differentially private mechanisms have to be built in the algorithm’s design. However, not all synthetic data generators need to have differential privacy to generate high-quality and private data. By using these two metrics, we gain the ability to compare on an even playing field differentially private synthetic data generators to non differentially private synthetic data generators.

You also don’t need to write any code to generate synthetic data. At MOSTLY AI, we’ve released a community edition of our product. Synthesizing data is as easy as dragging and dropping your original data into your browser.

Code Demo

The code is also available in a Google Colab notebook.

And there you have it! Generating synthetic data and comparing it to the original synthetic data all in less than 5 lines of code.

Analyzing the results

Interpreting the results is just as easy. Here we look at the generated comparison results for the IdentitySynthesizer and FlatAutoEncoderSynthesizer on the cdnow dataset provided in VDL. This table was created with the virtualdatalab.benchmark.benchmark function. An example of this function in action with more synthesizers + datasets can be found in this Google Colab notebook.

Summary Table : Image by Author
Summary Table : Image by Author

All columns labeled with MAX or L1D are measures of error. A high accuracy corresponds to low values across these columns. We notice that the IdentitySynthesizer has values close to 0. This is no surprise, as the IdentitySynthesizer simply returns splits of the target data. However, that is also precisely the reason why it failed both privacy tests. In contrast, the FlatAutoEncoderSynthesizer has slightly worse error, but passed both tests. This makes sense as the FlatAutoEncoderSynthesizer is generating completely new data, but maintains a statistical relationship with the target data.

Your criteria of whether or not to use a particular synthetic data generator technique will be based on which metrics are the most important to you. We recommend if you’re optimizing for privacy then be wary of any synthetic data generator method failing the privacy tests.

Try it out yourself

To get started with Virtual Data Lab, fork the repo or try one of our Google Colab notebooks! If you are interested in contributing to the repo, please reach out to me!

Acknowledgements

This research has been supported by the "ICT of the Future" funding programme of the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology.

[1] Gartner. 2018. Maverick* Research: Use Simulations to Give Machines Imagination.


Related Articles