Synthetic Data: The Good, the Bad and the Unsorted

A legal take on using synthetic data for AI training

Tea Mustać

Published in

Towards Data Science

9 min readJan 13, 2024

DISCLAIMER: This is not another pro and con article for the collection.

Collage depicting human-AI collaboration in content moderation. Multiple arms, screens, computer cursors and eyes highlight the extensive human labor involved. — Anne Fehres and Luke Conroy & AI4Media / Better Images of AI / Humans Do The Heavy Data Lifting / CC-BY 4.0

Synthetic data has been mentioned as a promising avenue for enabling research, product development or even supporting whole business models that depend on large quantities of data. Especially where real data is scarce, sensitive, or legally problematic to use. However, there appears to be a lot of confusion (caused mostly by existing generalizations in the available legal literature) as to just how useful synthetic data is for avoiding legal landmines and improving AI models. This confusion resulted in a sea of articles mentioning the same pros and cons, however, based on different perspectives. But also mentioning the same characteristics interchangeably on both sides. So, for instance, while most people appear to agree that scalability and enhanced privacy are definitive pros of relying on synthetic data, cost is mentioned as both a plus[1] and a minus.[2] And so is bias, as synthetic data can be used as a means of bias reduction,[3] but it can also sometimes cause bias amplification[4]. All depending on which articles you happen to stumble upon and which particular circumstances and scenarios were considered in them. The worst thing being, that none of them are wrong.

We will not try to engage in the same exercise here. We will rather try to draw some lines between the currently intermingled uses, purposes, concepts, circumstances and situations, that at least contribute to the existing confusion. Thereby, temporarily complicating things further to try and clear up the fog of confusion afterwards. All the while we will, however, keep our eyes on privacy. (It is after all the thing about which I can at least pretend to have some answers and know what I’m talking about.)

Synthetic data for AI by AI

At least in the legal context, the most often analyzed pros and cons of synthetic data refer to its use for privacy-enhancing purposes. However, in this context, it also seems that synthetic data is often compared or even equated to anonymous or pseudonymous data. This is unfortunate because pseudonymization and anonymization have a long and complicated history of being misinterpreted and misused. Meaning there isn’t really a strong consensus as to what both of these mean in the context of privacy and the GDPR, which makes them bad reference points. But also because synthetic data is (or should be) something separate and different from both of those. However, and although it would be ideal to distance ourselves from these two misunderstood concepts, we will stick to them. We don’t live in a bubble after all, so we might as well make peace with it.

First and foremost, synthetic data should never be pseudonymous. Because if it is, that means we didn’t synthesize it very well. Second, it does (often) fall under the umbrella of anonymous data. Why? To answer this question, let’s tackle the question of pseudonymization first.

Pseudonymous data means data that can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and securely.[5] Two things to keep in mind here: 1. The possibility of re-identification is still there and 2. Someone holds the decryption key for it. This is why pseudonymous data is still considered personal data and which is also why synthetic data is not pseudonymous. Or, in any event, shouldn’t be.

Now, anonymization, on the other hand, is the process of rendering personal data (drum rolls) anonymous.[6] This in turn means anonymous data does not relate to any natural person (at all) or it once did but now it is rendered anonymous in a manner that makes that person no longer identifiable.[7] Multiple things to watch out for here. First, anonymous data can also refer to processed and anonymized personal data. Second, anonymous data can no longer be used to identify the person they originally belonged to (if they did belong to one). At least not using any of the techniques and technologies available today.[8] Thirdly, anonymous data can also be completely non-personal data, such as weather statistics.

Now enter synthetic data. Synthetic data is artificially generated data created to reproduce the characteristics and structure of original data.[9] We can think about it as a sort of proxy for the original data. In this process, the original real data doesn’t necessarily have to be personal and, even if it was, the new synthetic data is in any event anonymous as it is completely made up and contains none (or at least it shouldn’t) of the original data it is based on. This in turn makes synthesis just one of the possible anonymization techniques. (Now leaving aside the debate on whether data can ever be fully anonymized.[10] )But it still doesn’t mean that these two are always one and the same.

All this is important because since we are talking about something that should be anonymous, it would seem that the whole privacy debate is actually completely non-sensical. Since anonymous data falls outside of the scope of the GDPR completely. And yet we are discussing it. And we are not wrong in doing so.

Is synthetic data (privacy’s) friend or foe?

The first thing to consider as to why we are, well, only partially wrong in discussing synthetic data in terms of privacy is that synthetic data has to be generated, sometimes based on data sets containing personal data. This is now (most often, but not necessarily) done by machine learning models and using deep learning algorithms. Probably because we are talking about processing and synthesizing massive amounts of data. And because recognizing patterns and statistics, regardless of purpose, is something that ML models naturally excel at (as opposed to most humans). This in turn means that the mere generation of synthetic data is already a processing operation, which is relevant from a legal point of view, especially if the original data was personal. This also means that common claims and discussions on how using synthetic data for training ML models is good or bad, completely neglect two key aspects of the matter. Not only do they oversimplify by making synthetic data just about privacy. But they also in most cases overlook that there still will be models digesting personal data in order to synthesize it. These aspects are not marginal points now dragged out to overcomplicate an already complicated thing. They are quite to the contrary crucial to have a (grown-up) conversation about synthetic data, with or without privacy in the room.

And privacy is often not in the room when we are talking about synthetic data. Just to point this out once again, synthetic data has a life of its own. It is used as a technique and strategy in various contexts, including situations when real data is scarce, sensitive or whose use is associated with legal issues and uncertainty. So, for instance, one of the important, albeit seldom mentioned in public discourse, purposes of using synthetic data is training AI systems in the military context. There, synthetic data can serve the purposes of diversifying data sets and providing fine-grain control of data attributes, both of which are very important for increasing quality as well as shortening training cycles in this domain of very scarce, difficult-to-obtain, and highly sensitive real data.[11] You can’t read much about these types of tradeoffs in the popular literature, and yet they exist, are also important, and can be worth considering in other contexts as well.

On the other hand, even when synthetic data is associated and mentioned in the context of privacy, we are again over-generalizing things and mixing up two different fundamental rights. Because, even though privacy and data protection overlap to a large extent, they cannot be used as synonyms. And, in the context of discussing synthetic data, even though we can generally use it as a privacy-preserving technique, the situation with data protection is again more complicated. Primarily due to issues that it can cause, such as bias or lack of accuracy. Now although this may sound like a bad thing at first, with data protection shutting yet another promising method down, this might just be us jumping to unnecessarily cynical conclusions about the good old GDPR. Namely, these same issues are also bothering AI developers, who are often quoting data distribution bias, incomplete and inaccurate data, lack of noise, over-smoothing or recursion as some of the issues reducing the performance of models when using synthetic data. Nobody wants an inaccurate and biased model, regardless of whether that model later affects human beings or not. Finally, the seriousness of these issues again heavily depends on the type of model being trained and its intended purposes. Still, this does mean that the lawyers and the AI developers are actually bothered by many of the same things, it might just be less obvious at first sight. This is a good thing. It means we just have to talk to each other and work on problems together.

Final thoughts

So, is synthetic data a friend or a foe? It is neither and it is both. Truth be told, here we have a classic example of a double-edged sword. Synthetic data creates new problems while resolving some of the existing ones. And this doesn’t hold true just for privacy, it also holds true for performance goals, where for instance scalability and data augmentation can stand opposite of bias amplification or generalization concerns. This is not a reason for either giving up or regurgitating the same old pro vs con type of articles and analyses, that either overgeneralize or focus on just one minuscule point of the larger picture. Which also renders anyone reading a particular article blind to the forest behind the tree.

The utility and appropriateness of the use of synthetic data in the process of training ML models will always depend on the particular circumstances of the case. It will depend on the type of data we need to train the model (personal, copyright protected, highly sensitive), the quantity of the necessary data, the availability of the data, and the intended purpose of the model (as inaccuracy or bias amplification will carry different weights in models assessing creditworthiness or those for optimizing the supply chain). So maybe we can start by answering these kinds of questions for any given context and then proceed to consider the various existing tradeoffs in a more appropriate setting.

Key takeaways:

· Synthetic data is never pseudonymous.

· Synthetic data should always be anonymous.

· Synthetic data does not solely revolve around privacy.

· Although always helping preserve privacy, synthetic data causes other data protection issues.

· Privacy and data protection are not the same thing.

· Some data protection issues also happen to be performance issues. This is good because it means we are all (at least sometimes) trying to fix the same thing.

· All tradeoffs associated with synthetic data are very context-specific and should be discussed within their relevant context.

A banana on a table and an image of a banana on a laptop on the same table. Each of the two bananas has a white frame around it with the word ‚Banana‘ sticked on top of it — Max Gruber / Better Images of AI / Ceci n’est pas une banane / CC-BY 4.0

[1] Exploring Synthetic Data: Advantages and Use Cases, Intuit Mailchimp, https://mailchimp.com/resources/what-is-synthetic-data/

[2] John Anthony R, When It Comes To AI — Synthetic Data Has A Dirty Little Secret, https://www.linkedin.com/pulse/when-comes-aisynthetic-data-has-dirty-little-secret-radosta/

[3] Michael Yurushkin, How Can Synthetic Data Solve the AI Bias Problem?, brouton lab blog, https://broutonlab.com/blog/ai-bias-solved-with-synthetic-data-generation/

[4] Giuffrè, M., Shung, D.L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 186 (2023). https://doi.org/10.1038/s41746-023-00927-3

[5] GDPR

[6] AEDP, 10 MISUNDERSTANDINGS RELATED TO ANONYMISATION, https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf

[7] Recital 26 GDPR

[8] AEDP, 10 MISUNDERSTANDINGS RELATED TO ANONYMISATION, https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf

[9] Robert Riemann, Synthetic Data, European Data Protection Supervisor.

[10] Alex Hern, ‘Anonymised’ data can never be totally anonymous, says study, The Guardian, 23 of July 2019, https://www.theguardian.com/technology/2019/jul/23/anonymised-data-never-be-anonymous-enough-study-finds ; Emily M Weitzenboeck, Pierre Lison, Malgorzata Cyndecka, Malcolm Langford, The GDPR and unstructured data: is anonymization possible?, International Data Privacy Law, Volume 12, Issue 3, August 2022, Pages 184–206, https://doi.org/10.1093/idpl/ipac008

[11] H. Deng, Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer,

Geneva, Switzerland: UNIDIR, 2023, https://unidir.org/wp-content/uploads/2023/11/UNIDIR_Exploring_Synthetic_Data_for_Artificial_Intelligence_and_Autonomous_Systems_A_Primer.pdf .