Synthetic Data at the VHA

Approaches in Patient Privacy Preservation

Avik Bhattacharya
Towards Data Science

--

With contributions from (in no particular order): Purnell, Amanda L. (VHA Innovation Ecosystem); Howarth, Gary S. (NIST); Bhattacharya, Avik (Booz Allen Hamilton Inc.); McAuley, Erin (Booz Allen Hamilton Inc.); Hunter Zinck, Haley (Sage Bionetworks); Watford, Sean (EPA); Task, Christine (KNexus Research).

Photo by Chris Liverani on Unsplash

Introduction

The rise of digital health data is driving significant evolution of the healthcare sector. Biomedical data, including clinical notes and imaging, wearable fitness trackers, as well as genomic sequencing, can be used by researchers and clinical practitioners to find novel biomarkers, potentially improving efficacy of care delivery. Additionally, collection of non-health data such as social determinants of health (SDoH) (e.g., income, educational level, employment history, food security, and housing status) can help shed light on geographical or population-level healthcare disparities.

In parallel, emerging opportunities from this wealth of digital data come at the steep cost of individual privacy. With the breadth of data sharing amongst consumers, data brokers, technological platforms, etc., the risk to patient privacy has increased.

Statistical and cryptographic methodologies for use with digital health data are currently under development. These privacy-preserving data mining (PPDM) techniques can add noise to data, obscure sensitive details, or enable researchers to use the data for machine learning without ever moving the data. PPDM techniques aim to quantify the trade-offs between data utility and privacy. For example, how much noise can be added to data before scientific utility is compromised? Which use cases are feasible under cryptographic protocols?

To accelerate adoption of PPDM techniques within the healthcare sector, the community would benefit from developing norms around the use of these techniques. How will clinical practitioners and researchers successfully map PPDM methods to specific use cases, and what is the criteria? What are standards and quantitative thresholds for health privacy losses?

We offer a discussion of issues and considerations for the integration of PPDM techniques within the biomedical research and clinical enterprises.

What is Synthetic Data?

Briefly, synthetic data is data that mimics realistic patterns but does not correspond to real data records. If data can be generated that reflects the patterns in clinical EHR data without any identifiable correlation to the actual data records used in their generation, this would have significant benefits for data privacy in the research and clinical contexts.

Synthetic data has the potential to address the two seemingly conflicting challenges of realism and privacy. It must reflect the feature frequency and correlations in the underlying real data used to generate the data set and function with comparable efficiency at training models and predictive algorithms as using the real data set. It must also ensure that probability for privacy risk is minimized or ideally zero. Thus, the ability to predict the actual sample used to train a synthetic data set should remain as low as possible.

Applications for synthetic health data relevant to this audience would include

· Access to hitherto unavailable data for research groups and clinicians.

· Ability to augment training data sets to balance under-represented groups and outcomes in the set.

· Ability to study clinical data from the perspective of time and date to identify clusters and patterns.

Generation of Synthetic Data

Photo by Markus Spiske on Unsplash

Synthetic data is typically generated via different statistical methods depending on the use cases for the data. Rule based methods (such as Synthea) and Generative Adversarial Network (GAN-based) methods (such as CorGAN, medGAN etc.) are typically used and have received the most attention thus far in health care data. There are other techniques used outside the medical data field, such as multiple imputation which is used by a variety of national statistical agencies for survey data. Probabilistic Graphical Models (PGM) are also very powerful and are starting to see more use. Numerous commercial applications have also been developed by companies such as Syntegra.io, MDClone, Unlearn, Diveplane et al.

Different models for the generation of synthetic data offer different tradeoffs in terms of risks, benefits, requirements, effort, and cost of use, etc. We attempt to capture some of the differences in synthetic data output from key different models of data generation in the chart below.

Chart summarizing various features of Rule Based and GAN based models.

Selection of Synthetic Data Method/Tool

Privacy Versus Utility (Liu et al [1])

Selecting the appropriate algorithm and tool for the generation of synthetic data is critical to the success of clinical and research projects seeking to extract meaningful information from patient data. Successful projects would have to balance low privacy risk of an individual being identified in the data, and low risk of the noisy data leading researchers to draw an incorrect conclusion. Usually a good safety check is to compare the noise of synthesis to the noise of sampling error. This is highlighted in the graph segment above (Privacy Versus Utility), from work by Liu et al [1].

Thus, the users of synthetic data must pick methods along a privacy — utility/realism continuum to find a data generation algorithm that is a best fit for their needs. A simple decision schema to determine whether to pursue GAN-based methods, hybrid methods or rule-based methods for the generation of synthetic data based on the realism and privacy needs of specific projects is below.

Selecting the right synthetic PPDM (Image by Authors)

Evaluation of Synthetic Data Method/Tool

For any method used, it is essential to evaluate the quality of the synthetic data generated. One example of an evaluation tool is the k-marginal metric. The k-marginal evaluation metric was developed for the first NIST synthetic data challenge and has been used in all but one sprint across four years of challenges. It evaluates distributional similarity between two data sets across the entire data space (relationships over every possible combination of k features). The NIST challenge found k-marginal to be predictive of solution performance for a wide array of other analytics: if a synthesis solution scores well on the k-marginal test, it also performs well on other data quality metrics or use cases. The formal definition of the k-marginal evaluation is listed here.

Modern data synthesis techniques (such as GANS or the tree-based models used in r/synthpop) generally use fully conditional synthesis, meaning that the model for each variable is fit using all available information, in order to capture all meaningful relationships between variables in the data (including surprising or unexpected ones). This is significantly different from older approaches that used a small handpicked set of variables when developing models. When two different subgroups of the data have very different patterns of distribution across the same variables, it can be difficult for some models to correctly capture both groups. In that case, synthesis can be partitioned to prevent the groups from conflicting. NIST’s k-marginal metric is very effective for identifying locations in the data where this problem occurs. The process of iteratively evaluating the synthetic data quality, and then adjusting the synthesis process to address any issues uncovered, is referred to as synthesizer tuning.

Current and Recent Activity

The potential and tangible advantages of synthetic data for preserving privacy has led to some exciting new collaborations. It has also spurred research for the generation of synthetic health records for clinicians and researchers in the federal health space.

  • NIST’s PSCR set up two sequential data challenges in 2018. The first — the “Unlinkable Data Challenge” offered awards to concept papers that proposed mechanisms to enable the protection of personally identifiable information while maintaining a dataset’s utility for analysis. The second — the “Differential Privacy Synthetic Data Challenge” — tasked participants with creating new methods, or improving existing methods of data de-identification, while preserving the dataset’s utility for analysis. All solutions were required to satisfy the differential privacy guarantee, a provable guarantee of individual privacy protection.
  • NIH announced in mid-2020 that MDClone would provide the enabling technology for the synthetic workstream for the National COVID Cohort Collaborative (N3C)2 — a centralized secure portal for hosting COVID-19 clinical data that accepts data via multiple models and transforms them into a common OMOP model.
  • NIST followed its successful 2018 challenge to offer the PSCR Differential Privacy Temporal Map Challenge in 20201 which offered public safety data sets and asked for algorithms that preserved data utility to the maximum possible extent while guaranteeing the protection of individual privacy.

Future Areas

There are multiple questions that are being explored in this space including:

  • Quantifying the scientific validity of the generated data.
  • Quantifying the accuracy and relevance of models trained on generated data and comparing to models derived from real data sets.
  • Verifying that the generated data does in fact preserve privacy and that membership inference in real data samples is only minimally probable from the synthetic data. The challenges set up by NIST and referenced in the previous section established introductory benchmarks that can be developed and utilized further from a policy perspective for privacy preservation.
  • Evaluating the computational resource load for generating and validating synthetic data sets.

References

[1] Liu et al. “Privacy-Preserving Monotonicity of Differential Privacy Mechanisms.” (2018). Source URL

[2] Haley Hunter-Zinck, D-Lab Securing Research Data Working Group “Validating the privacy preserving properties of synthetically generated electronic health record data” (2020).

[3] Raths D. News Article- National COVID Cohort Collaborative Preparing to Open Enclave to Researchers (2020). Source URL

[4] Ridgeway, D., Theofanos, M., Manley, T. and Task, C. Challenge Design and Lessons Learned from the 2018 Differential Privacy Challenges, Technical Note (NIST TN), National Institute of Standards and Technology, Gaithersburg, MD, (2021). Source URL . PDF Publication.

--

--

Senior Lead Technologist @ Booz Allen Hamilton. Supporting the VHA Innovation Ecosystem with a focus on Data Science, Analytics and Healthcare.