APPLE | GOOGLE | SPOTIFY | OTHERS

AI meets the law: Bias, fairness, privacy and regulation

Nicolai Baldin on the TDS podcast

Jeremie Harris
Towards Data Science
36 min readDec 16, 2020

--

To select chapters, visit our Youtube video here.

Editor’s note: This episode is part of our podcast series on emerging problems in data science and machine learning, hosted by Jeremie Harris. Apart from hosting the podcast, Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

Listen on Apple, Google, Spotify

The fields of AI bias and AI fairness are still very young. And just like most young technical fields, they’re dominated by theoretical discussions: researchers argue over what words like “privacy” and “fairness” mean, but don’t do much in the way of applying these definitions to real-world problems.

Slowly but surely, this is all changing though, and government oversight has had a big role to play in that process. Laws like GDPR — passed by the European Union in 2016 —are starting to impose concrete requirements on companies that want to use consumer data, or build AI systems with it. There are pros and cons to legislating machine learning, but one thing’s for sure: there’s no looking back. At this point, it’s clear that government-endorsed definitions of “bias” and “fairness” in AI systems are going to be applied to companies (and therefore to consumers), whether they’re well-developed and thoughtful or not.

Keeping up with the philosophy of AI is a full-time job for most, but actually applying that philosophy to real-world corporate data is its own additional challenge. My guest for this episode of the podcast is doing just that: Nicolai Baldin is a former Cambridge machine learning researcher, and now the founder and CEO of Synthesized, a startup that specializes in helping companies apply privacy, AI fairness and bias best practices to their data. Nicolai is one of relatively few people working on concrete problems in these areas, and has a unique perspective on the space as a result.

Here were some of my favourite take-homes from the conversation:

  • It’s really unclear what “AI bias” actually means. All machine learning systems use patterns in datasets to make predictions, but it’s hard to determine which patterns should qualify as “undesirable biases” that we shouldn’t use, as opposed to “valid patterns” which we should.
  • One of the cornerstones of GDPR (Europe’s flagship AI regulation) is the distinction between “data controllers” and “data processors”. Different constraints are applied to companies based on which of these categories they fall into. The trouble is, it’s not always clear how a particular company or service will be classified, and there’s a lot of room for ambiguity in interpreting that distinction. Because companies don’t always know how regulators will apply legislation to their specific use case, they’re forced to invest heavily in various forms of (potentially unnecessary) legal and regulatory compliance, just to avoid the risk of a fine down the road.
  • As new algorithms, business models, and machine learning techniques are developed, old user data can be leveraged for different applications — many of which they would never have anticipated when they first gave it away. In the early days of Facebook, for example, Deepfakes didn’t exist, and there were fewer reasons for consumers to want to prevent their pictures from being misused. Advances in speech synthesis and mimicry may give YouTubers reason to avoid posting with their real voices online, long after they’ve already given YouTube access to enough speech data to put them at risk. As the value of different types of data changes, the regulatory landscape will have to adapt as well.
  • Companies engage in a fair bit of self-regulation, trying to stay ahead of laws and norms for the purpose of brand protection and goodwill. That’s an important ingredient in the privacy battle, particularly as governments fall further and further behind the state of the art in machine learning and privacy.

You can follow Nicolai on Twitter here, or follow me on Twitter here.

Links referenced during the podcast:

Chapters:

  • 0:00 Intro
  • 1:48 What is Synthesized?
  • 3:37 Europe vs. North America
  • 8:01 Restrictions on companies
  • 15:17 Justifying the added complexity on companies
  • 16:21 Privacy of EU citizens
  • 18:51 Cultural shift regarding privacy
  • 23:22 Algorithmic bias and data bias
  • 29:44 Machine learning influence
  • 33:27 Incentivizing companies
  • 36:02 Reporting and enforcement requirements
  • 41:32 Policy adjustments
  • 45:21 What is synthetic data?
  • 49:10 Wrap-up

Please find the transcript below:

Jeremie (00:00):
Hey everyone, welcome back. Today, we’re talking to Nicolai Baldin, who completed his PhD at the University of Cambridge studying machine learning and statistics, and who’s now the founder and CEO of Synthesized.io, which is the whole reason I wanted to talk to him for this episode of the podcast.

Jeremie (00:13):
Synthesized is currently, one of relatively few companies focused entirely on reducing bias and improving fairness in AI systems, as well as improving user privacy in compliance with complex regulations like GDPR. Basically, they help companies anonymize user data and share data without the risk of users’ identities being compromised in the process, or users’ rights being infringed. Now, this is actually a really difficult problem to solve. And it intersects with philosophy, ethics and morality quite a bit, which is why Nicolai has to be well-versed in all those areas as you’ll see in our conversation.

Jeremie (00:43):
One of the most interesting things about Nicolai and his work is that it’s entirely focused on concrete applications of AI bias, AI fairness and privacy theory. Most of the conversations we’ve had so far, and actually most of the work currently being done in these fields is still theoretical. It’s about people figuring out how we should build systems in the future that won’t infringe users’ rights. And it’s about making sure that systems are built safely and fairly in general. But actually, solving those problems in practice is a whole other kettle of fish. And that’s a big part of what we’ll be talking about today.

Jeremie (01:11):
Hopefully, you get as much out of this conversation as I did. I think the topic is fascinating. And I think Nicola is just the right person to talk about it. I hope you enjoy the episode.

Jeremie (01:20):
Hi, Nicolai. Thanks so much for joining me for the podcast.

Nicolai (01:24):
Thanks for having me.

Jeremie (01:25):
Cool. How are things going over there in the UK, by the way? Because I think we were talking earlier, you were mentioning that you’re in a second round of lockdown starting soon.

Nicolai (01:33):
Yeah. Actually, the second lockdown starts this Thursday. So I’m actually in the office right now. And that’s probably the last time I’m seeing all my colleagues face to face, and then they’re going into their lockdown for four weeks initially. But yeah, let’s see.

Jeremie (01:48):
Well, thanks for making the time to chat today. I’m especially excited about the idea, the prospect of talking to you just because we speak to so many people on this podcast from academic research backgrounds, people worried about AI bias, AI ethics, AI safety. But it’s quite rare to talk to people who are actually then having to implement these things in the industry and at scale. And I know that’s what you’re doing with Synthesized. So actually, maybe the first place to start here is, what are you working on right now? What is Synthesized? Can you give us a little bit of background on that?

Nicolai (02:19):
Sure. So the mission of Synthesized is to empower any data scientist, any ML engineer, any test engineer with a high volume of high quality datasets for development and testing purposes, staying compliant with regulations. And our vision is to build this so-called decentralized data access platform powered by machine learning to really facilitate innovation, again, in a compliant, privacy-preserving manner.

Jeremie (02:40):
And what do you mean by decentralized? How does that enter into the privacy equation?

Nicolai (02:47):
It’s probably something, well, I would say complimentary to the privacy question. So it’s more about the efficiency game. So we’ve seen lots of centralized solutions such as Databricks, such as Snowflake, and we really built something which is, in a way, decentralized. And is not, I would say, forcing clients to move into a certain say, category, into a set of, I would say, databases or data warehouses, data lakes. It’s really in a way, a decentralized solution which is able to work with an arbitrary number of data sources, again, databases, data warehouses, data lakes.

Jeremie (03:21):
It feels like one of the advantages of that from a privacy standpoint would be that if you’re decentralized, if you’re doing more computation at the source, that means data doesn’t have to fly around as much as usual. So there are fewer opportunities for leaks. Is that part of the idea?

Nicolai (03:35):
Yeah, exactly.

Jeremie (03:37):
Okay, sweet. Cool. Well, so I think that background is helpful just for people to understand, Because you’re doing all this stuff, because you’re moving a lot of very private data around. You’re also based in the UK. And I think as we’ll get into, there are some policy implications in terms of regulatory compliance, that European firms face that aren’t necessarily quite as common in North America.

Jeremie (03:57):
So I think there are a whole bunch of different things that you’re actually sinking your teeth into, from a practical perspective. GDPR is probably the most famous regulation, and I think that’s a good place to start. You have a lot of experience with it. Can you introduce what GDPR is, and what compliance with GDPR requires?

Nicolai (04:13):
Yeah, absolutely. I think maybe before I do that, so let me paint maybe the bigger picture.

Jeremie (04:18):
Yeah.

Nicolai (04:18):
I think the world has become data-driven with terabytes of data being collected, analyzed and stored and all this. And needless to say, as we discuss safety, privacy, ethical use of data, have become key elements of any data-driven company. Only recently have the companies started to adjust in order to comply with the new regulations. And I would say it’s not surprising that we got to this stage, given the amount of inefficiencies which many companies have, and also the speed with which the digital world has been growing.

Nicolai (04:50):
And well, essentially, GDPR for … well, I think we have technical audience here. And GDPR again, well, it’s a regulation on data protection and privacy in the European Union and the European Economic Area. It was introduced in May 2018. And I think still considered as, without exaggeration, the strongest set of rules on again, data protection and privacy, which is meant to enhance how people can access information about them, and also places limits on what companies can actually do with the data.

Nicolai (05:25):
Again, maybe to give you an example, well, GDPR, essentially well, controls how websites, companies or organizations are allowed to handle personal data, which is anything from names, well, email addresses, location data, browser history and many other things. And again, maybe I’ll go on a bit more in detail. So basically, the GDPR has I would say, key two elements. So it’s data controllers and data processors. So data controllers, essentially, they exercise the overall control over the purposes and means of the processing of personal data, whereas the processors essentially act on behalf. And only, it’s very important, only on the instruction of the data controller.

Nicolai (06:11):
It may sound I guess, a little bit technical, but just to give an example. So if you have, say, an AI service, then basically if you put some data in it, and you would like to essentially get some output, some result from the AI service and ML ops or some amalgam engine, which can be accessed online, then when you essentially upload your data to the service, you become the data controller. And when you essentially ask the service to do something to give you a result, say maybe about some insights about hidden data, the software, the service becomes the data processor. So that’s essentially like a clear … well, an example of these two entities.

Nicolai (06:53):
In the US, you have obviously some, well, similar data protection legislation, such as HIPAA, CCPA. But the question is, why do we have it in Europe? And what’s the purpose? And it was introduced to essentially protect the privacy of EU citizens. And I think it’s also important to emphasize that before GDPR, many companies had their own data registrations, data protection acts. So Germany had its own. UK, we had something else, and France, Spain had different legislations with regards to data privacy.

Nicolai (07:32):
And the idea of GDPR was to really unify, to really harmonize all these legislations and put them together. So now, we have only say, 99 articles, which essentially the guidelines, which form the guidelines on how companies should essentially manage and store data about people. And also essentially, well, outlines all the rights which people have about the data they, well, give away to companies.

Jeremie (08:01):
And so, what are some of the biggest restrictions on companies’ ability to use data that come out of the GDPR?

Nicolai (08:08):
Right now, again, it really places limits on what companies can collect about people. And again, if they exercise, if they become essentially, the data controllers, they have to comply with the given set of rules. And again, those vary from essentially what they can do with the data. And they also have to be able to, in a way, tell the users what the data is used for, and also be able to remove the data when it’s actually asked by the users. And there is I think, eight rights.

Nicolai (08:40):
Again, I would say the GDPR was designed not only to place, I would say the tools, unlike the data controllers and processors, but also in a way to protect the rights of, well, the users’ right of data of the users. And that means again, which varies from okay, how companies will access original data, but also again, ensuring that companies can delete them. You can request essentially some of the data to be deleted, the so-called right to be forgotten. And there are, I think eight rules outlines, laid out by GDPR to essentially protect the rights of individuals as well.

Nicolai (09:19):
I think in all, there are definitely some pros and cons. And what we’ve seen, again, working with many companies is that definitely one of the pros is that, I guess the users’ privacy is protected and people have obviously, control over their personal data. But also, one of the cons which we saw is that because still, companies are trying to understand whether they are data controllers or data processors, and also trying to understand how to essentially implement those rules, as a result of that, the innovation has slowed down.

Nicolai (09:51):
And well, maybe again, just to give you an example, with insurance companies we worked with, we used to work for about a year ago. So it may be took us up to three months essentially, for them to decide how they’re going to be, well, uploading their datasets into our platform, even though it was on premise, even though it was essentially the data which they already had consent to use. But at the same time, they were still struggling to understand again, how they’re going to be using the datasets with our platform.

Jeremie (10:21):
It’s interesting how these issues come up, because you see similar debates around, for example, the US tax code, just how it’s this giant Frankenstein monster of a document. There’s no way that any human being can plausibly understand and read the tax code. And yeah, it ends up being this big hurdle.

Jeremie (10:41):
I guess that’s one objection to the way GDPR has been implemented. Are there other issues that you found when you’re helping companies enforce it or develop infrastructure to allow them to apply GDPR? Are there other big gaps that you notice in the legislation that seemed maybe unnecessary?

Nicolai (10:59):
I think maybe there are still … well, companies are still trying to, I guess, figure out whether they are data processors or data controllers. And the reason being is that it’s quite a big difference in terms of what companies have to essentially be doing. And the boundaries, again, the line is very thin. So it’s, essentially, we’ve seen again, many companies essentially switching from being a data processor to data controller because they just realized okay, we’re actually a data controller with regards to, say, a new feature which we are rolling out.

Nicolai (11:29):
So essentially, what might happen is that as you develop new features on the platform, as you’re rolling out to the users, your status may change. And it’s very important to essentially ensure that okay, you have all the practices and all the rules to comply with this new … well, with this change as well.

Jeremie (11:48):
So depending on context, a single company could have data processor responsibilities in one context and then take the controller responsibilities in a different one? That’s how it breaks down?

Nicolai (11:58):
Exactly. And I remember again, we had this discussion with one of our, well, lawyers and external law firms. And they were essentially sharing that it’s actually a big debate right now among the AI companies, which is, again, are they data controllers? Are they data processors? And what are the obligations and all of that? So again, there is no silver bullet, and you really have to go really deep to understand some of the examples as well.

Nicolai (12:26):
And I think even on the ICO website they’re right now, with regards to, especially machine learning and artificial intelligence, they’re trying to, in a way, show some examples [inaudible 00:12:35]. In this case, you’re becoming a data controller, in this case, you’re a data processor. And you realize that there’s quite a lot of ambiguity when even on the well, official website, you have examples of again, this is company, this is data processor and this one is data controller.

Nicolai (12:50):
And really emphasizing okay, if you essentially train this neural network, and if you essentially train it in an isolated environment but you don’t really have any access to, well, I would say change in some of the parameters, I would say in some scenario, you may become essentially, a data processor. Whereas if you essentially exercise some additional control over the data, over the neural network, you essentially become the data controller. So it’s kind of [crosstalk 00:13:17].

Jeremie (13:16):
Oh my god.

Nicolai (13:17):
So it’s quite a thing. And again, many AI companies in the UK and Europe have spent quite a bit of time to figure out, how to essentially comply with the new regulation.

Jeremie (13:30):
Just because you said it, I remember for my startup, when we were … GDPR came out, I don’t know, 2016?

Nicolai (13:37):
Mm-hmm (affirmative).

Jeremie (13:37):
Something like that. Or 2017. We went through this phase of freaking out, trying to figure out, which case are we going to fall into? And what our lawyer told us in one case was like, “Yeah, actually, no one knows yet. We’re waiting for the first precedent to be set by actual courts that are going to have to rule on this.” And then everybody’s just navigating this uncertainty until there’s that precedence where you actually don’t have that much to go on in terms of determining which camp you fall into, kind of interesting that this data processor, data controller thing is so ambiguous.

Jeremie (14:09):
And it sounds like there’s also a mapping then on to different parts of the data science lifecycle. So like, if you’re doing just hyper parameter tuning, maybe you’re just a data processor. But then if it goes into feature engineering or data cleaning, then maybe that’s more data controlling. Or is that what might happen?

Nicolai (14:28):
Exactly. So it really depends on again, how exactly you use the data in the data … well, ML development cycle. And again, if it’s a simple algorithm, not even necessarily a stochastic ML algorithm. Not even an ML model if it’s just, again, something simple, deterministic, then you may get away with essentially the data processor.

Nicolai (14:52):
But at the same time when you exercise again, additional control, when you exercise again, some … you would like to do some sophisticated features using model selection, model … well, validations of that, you’re likely to become essentially the data controller because there’s going to be some exposure of the datasets to the ML team within the business. So at that stage, you’re likely to become the data controller.

Jeremie (15:17):
And do you think there’s enough value from GDPR, from a privacy standpoint, to justify the added complexity on companies? I know it’s a tough question to answer, probably, but do you have an instinct there?

Nicolai (15:29):
Absolutely. I think we’re going to touch on them. Yeah, I want to touch on a few things here. But one of the things is definitely different data, well, user agreements. Because when you essentially start using, say, an AI service, you get this big, well, user agreement, right?

Jeremie (15:48):
Yeah.

Nicolai (15:49):
And pages. And you need to go through it and understand how your data is going to be treated. GDPR, where it’s actually helpful is that those data agreements now have the same structure, and in a way that you know that privacy is by default. And you know that some of the elements already are meant to be covered by GDPR, and there is no way how you can essentially say, “Hey, man, I’m going to forget about the GDPR and I’m going to allow you to do something.” Because again, it’s impossible.

Nicolai (16:21):
So again, unlike maybe the US and other countries in the European Union, privacy is by default, which is very important. So I think the main benefit is that we can now protect, well, the privacy of EU citizens, which is the main goal, right?

Jeremie (16:42):
Yeah. I feel like this sort of thing is probably … my intuition is it’s going to become more important over time. If only because, if you are giving away your name, your email, your date of birth to Facebook in 2007, I think you’d have a reasonable expectation that there isn’t that much they could do with that data, because the state of machine learning just wasn’t that advanced at the time.

Jeremie (17:05):
But now, we’re starting to find that you can pull out so much more than anyone ever thought, both in terms of compromising privacy and in terms of just making really good predictive models. So it’s almost as if like, giving away the same amount of information implies giving away more power, more rights, more privacy today than it did back in the day. Is that a fair assessment?

Nicolai (17:28):
I think it’s definitely fair. And maybe to add to that, again, I like to say, “Hey, man, there is no way that in two to three years, people are going to say, ‘Hey, we don’t care about privacy, we don’t care about algorithms being fair, we don’t care about data being unbiased.’” And to be honest, we expect only more regulations to come into this area. And again, we believe that data should be treated ethically and in a privacy-preserving manner.

Nicolai (17:54):
And because again, also, it’s interesting that we have inadvertently created quite a bit of … well, I don’t really like the word, but hype around the company as well. And I think it’s a clear indication of how important this topic has become in the society. And we definitely expect more regulations to come into this play. And, again, GDPR at the moment, we focus primarily on privacy, but the topics of biases and the topics of fairness are likely to be covered very soon as well.

Jeremie (18:25):
And just to actually play devil’s advocate on this, because I remember I had a friend in grad school who got … this is back in 2015, he got a ping from Google every Friday, telling him what the traffic was like between his office and the sushi restaurant he’d like to go to every Friday. And I turned to him, I was like, “Dude, is this not creepy? Why would you want this on your phone?”

Jeremie (18:51):
And he goes, “Well, you know what? I’ve kind of gotten used to it. It’s a convenience that I like.” But do you think there’s a possibility that instead of our legislation making privacy more possible by basically just being firmer on these companies, that instead we’re going to see a cultural shift in the direction of less caring about privacy?

Nicolai (19:12):
I don’t think we are going to see, again, people caring less about privacy. And by the way, what we’ve definitely seen is that many companies, especially in the data space, they like to say, “Hey, we’re going to help you unlock data.” So we believe that data should stay locked away or locked down, which is becoming quite popular in the UK and other European countries.

Nicolai (19:37):
There is a big difference between unlocking data and unlocking data’s full potential. And in a way, so the question is, how can we keep original data locked away, but at the same time, build better services and enable them, well, provide them to users, which is not a goal of many, many companies including us? And we really built … so I want to touch on it as well, but the pattern we’ve built, it really allows to facilitate and speed up innovation by keeping original data locked away. And I think this is extremely important to distinguish these two things. Because again, original data has to stay locked away.

Nicolai (20:14):
And we need to essentially bring privacy back to the … well, back to the users, to ensure privacy. But at the same time, how can we enable better services to be built, better models to be served to customers? The fraud detection models, models in healthcare? So again, all those models, they require datasets. And yeah, the question is, how do we essentially ensure the innovation is still going, but at the same time, well, we respect the privacy, we respect the data protection of this nation?

Jeremie (20:45):
And when you say, keeping the data locked away, I guess there are a couple of different ways I could imagine that applying. On the one hand, only the company that I’ve explicitly given access to my data, gets to see that data. And I guess there’d be a whole bunch of questions about third party authorization. Like maybe I need to use AWS to train my model, so I have to send my data to AWS.

Jeremie (21:10):
And then there’s a separate question about, I gave you my data, like the Facebook example we were talking about. I give your company my data, with the understanding that technology only allows you to do a certain number of things with it. But then deepfakes comes up, and all of a sudden my photos on Instagram might turn up in a pornographic video or something. And had I known that was a possibility, I never would have given my data away to this company in the first place, or that kind of data.

Jeremie (21:39):
These seem like two different issues. So at the moment when you say, keeping the data locked away, I guess that has more to do with the first case right, just sending it out to other parties?

Nicolai (21:49):
Yeah, absolutely. So we really want to ensure that, again, if you’re talking about healthcare, we’re talking about say, financial services, we don’t really want our personal data be essentially flying around within the bank or within a hospital, because it’s obviously very sensitive data points. But at the same time, again, the question is, we still want to get better services to the public. And that’s the puzzle we need to solve. And there is a trade-off between those things.

Nicolai (22:22):
And I think one of the things again, we’re going to be talking about, but yeah, what we’ve designed is that it allows companies, and yeah, we work with some of the insurance companies, some of the banks in the UK and Europe, it allows them to innovate much faster. But at the same time, staying compliant with regulations such as GDPR. And we really expect actually banks to make more investments into this area. And not only the banks, but again, healthcare institutions, insurance companies, governments. Because again, there is no way how it’s going to be less. Well, we believe it’s going to be even stricter.

Nicolai (22:57):
And definitely, if there is time to invest in this, then differently, now is the right time because the sooner the better. And if you invest in this right now, then basically in two or three years, you can become, well, more competitive, because you can essentially invest in something else when other companies are going to be investing in privacy, fairness and other things as well. So it’s definitely happening right down.

Jeremie (23:22):
Actually, speaking of privacy and fairness, things like that, you’ve written a lot about algorithmic bias, data bias, that sort of thing.

Nicolai (23:30):
Yeah.

Jeremie (23:30):
Maybe it’s worth touching on that. Actually, yeah. Can you first introduce those concepts, and the distinction draw between the two?

Nicolai (23:38):
Yeah, absolutely. So I guess, well, the main difference, again, two things. So it’s algorithmic biases and biases in data. And typically, again, the public, we get exposure, well, to some algorithms. Say, when we applied for credit, we get a credit score when we go to say hospital. Well, sometimes again, some of the decisions are made by machines, and some of those machines are deterministic. Some of them are stochastic.

Nicolai (24:05):
And typically, if it’s a deterministic system, again, if it’s, say, a recommendation system or if it’s a credit scoring algorithm, they basically implement some rules. And again, imagine, say, a rule-based system. Very simple one; say, in the insurance company, in a hypothetical insurance company. So if I essentially apply for a credit and my age is higher than the threshold, then I’m going to have a certain score. And this is a simple algorithm.

Nicolai (24:35):
Well, this is a bias. This is an algorithmic bias. And well, this simple algorithm, it’s deterministic. And well, it’s a discrimination of course. And because age is a legally protected attribute in the UK, so you cannot really do that. And so that’s illegal.

Nicolai (24:53):
Bias in data is something else. Biased data is independent of the algorithms, and whether they are deterministic, whether it’s based on machine learning or some other techniques. And essentially, bias in data is imagine again, I’m playing hypothetical insurance company scenario. And they have this delinquency score. And they have, say, other attributes in the dataset such as age, such as gender, race and many other things.

Nicolai (25:22):
So the bias in data is when you basically condition on the subgroup of those sensitive variables, and you essentially compare the distribution over the delinquency with the distribution over the entire dataset, over the delinquency when they essentially don’t do any conditioning. Then essentially, we get the bias when there is again, skewness in the distribution, conditional distribution, when they essentially focus on the given minority class. And that’s, again, the bias in data.

Nicolai (25:55):
Well, if the algorithm which is ML algorithm, is taken in this dataset, then essentially, the results of this algorithm are also going to be biased, even though it’s a ML. And essentially, ML is, well, essentially designed to find those biases. Then you know that again, this ML algorithm is biased. And hence, I think it’s very important to look at probably three main things. Again, it’s how the dataset was collected? What’s the quality of the dataset? And also important to benchmark this dataset against a larger population. Because it may happen that not only this subset of this dataset is biased, but it may happen that the entire dataset is biased. So it’s really impossible to understand how to essentially find those biases within this, because all of it is biased.

Nicolai (26:50):
And this is again, another thing. By the way, at Synthesized, we are releasing the platform to the public, well, in the middle of November. So probably after or, well, before this. The platform is going to be essentially free access, and it’s going to allow any data scientist, any ML engineer, any test engineer, upload a dataset, identify all biases in data and also mitigate those biases. So we’re going to be essentially assigning the so-called fairness score to the dataset. And we’re also going to be saying, “Okay, hey, here are all the biases we found and here is how we can essentially mitigate them.” So we’re going to be doing that as well as part of the platform release. And we really want this to be used by HR departments in big organizations. We really want this to be used by institutions, by academics.

Nicolai (27:41):
And we touched on this previously. So I would say many companies right now, and many people especially, well, in different academic circles, like to talk about biased system fairness. Very few companies do actual work. And for us, it was very important to enable this service so that again, even not necessarily a technically skilled, technically savvy person can essentially upload the dataset and understand those biases and fairness, can essentially, well, mitigate those things and essentially get a better, well-balanced dataset.

Nicolai (28:16):
I can touch on this as well, but we essentially designed the so-called rebalancing technique, which is able to essentially correct those biases in data and make sure that the algorithms are going to be fair as well, which are built using those datasets.

Jeremie (28:31):
I think it’s a really fascinating problem to tackle, partly because it’s so unclear what the word bias and what the word fairness means. There’s implicitly a prior whenever we talk about bias or fairness. You just said, for example, a dataset is biased or a subset of a dataset is biased if let’s say, the average number of credit defaults in a group of red haired people is different from the average number of credit defaults for the broader population. We then conclude, oh, that’s biased.

Jeremie (29:05):
Whereas you also said, wait, but machine learning algorithms, all they do all day is latch on to that kind of bias. That’s where all their value comes from. And so, how would you define … maybe you just use a strict legal definition, so protected classes like gender and age, those are strictly just biases. But I guess philosophically, do you have a sense of where you would put that dividing line between like, “Oh, this is a bias” versus, “No, this is a useful feature or a useful trend or pattern that we can actually base our algorithm on”?

Nicolai (29:38):
I think as you rightfully mentioned, all datasets are biased.

Jeremie (29:44):
Right.

Nicolai (29:44):
And machine learning is essentially meant to, well, train to essentially find those biases. The question is, again, what does it mean? And is the bias against, say, legally protected set of attributes? That’s, again, the main issue. So we’ve seen people talking about, say, good discrimination versus bad discrimination. I believe that there is no … well, we can’t really separate those things. Again, discrimination is when the bias is against the legally protected attributes. So like, again, sex, age, race, and many other things.

Nicolai (30:18):
And if essentially, the bias is against those legally protected attributes, then it’s discrimination and it’s illegal. Whereas of course, there are many other attributes. And of course, there is some skewness and there is difference in terms of the distribution over those subsets of data, again, versus the overall distribution, the overall dataset. And if those attributes are not, well, obviously safe to use, then this is what machine learning is meant to be doing. And we have, as we discussed again, discriminative models which are essentially meant to find those things.

Nicolai (30:55):
And again, if we are not really focusing on those sensitive attributes, then again, those biases are essentially used to train machine learning models. But we need to be really careful with making sure that we, in a way, treat, well, legally protected attributes, well, respectfully. And this is what by the way, we are doing at Synthesized as well.

Jeremie (31:21):
It’s really interesting how different parts of this problem are abstracted away by different groups. The government is in charge of determining like okay, these are legally protected classes. So anything that leverages these attributes is performing some kind of bad discrimination. Whereas implicitly, everything else is … It strikes me that often, when I talk to people who are working in government, legislating these things, there’s this subtext that we don’t really fully know what we’re doing, partly because you don’t tend to see a lot of high caliber data scientists in government, just because of the way the incentive landscape works.

Jeremie (31:58):
You can make a lot more money working in the private sector. And so it’s one of these … I don’t know. It just feels like one of these industries where you face that uphill battle against incentives right out the gate. You’re wrestling against the best ML engineers at Google and Facebook and Amazon, and all you have is the resources of like the EU, which obviously is really big, but at the same time, there are always going to be these nuances that people can hide different applications of data in weird places, especially if they know what they’re doing.

Nicolai (32:29):
Absolutely. I would even say that there is the entire guess area in statistics and machine learning around sufficient statistics. And sometimes it’s possible to, in a way, hide those biases in again, like other attributes as well, which in a way, correlated with those biases. And even though again, those attributes are not sensitive, say, I don’t know, like location, but it’s sometimes possible to again, link them. And it’s very important.

Nicolai (32:55):
And this is what we’ve done a lot, spending quite a bit of time on is to ensure that those things essentially become sensitive as well. There is these legally protected attributes, which are protected by the well, data protection laws. And then we have some other derived attributes from those things. And it’s very important to, well, monitor them and also ensure that there is no kind of bias, there is no discrimination against those things. It’s not easy, and this is what we’ve spent quite a bit of time on that at the company. Well, at Synthesized as well.

Nicolai (33:27):
But also, well, I guess the important question is, how do we incentivize other companies to do the same? So, how do we incentivize companies to ensure that the datasets are unbiased, that the algorithms are fair? Again, we believe that there is big outside incentives for companies to self-regulate, because again, it’s essentially bad reputation amongst clients if the clients of a business know that this company takes fairness and biases extremely seriously when making decisions. And also, it’s a very good reputation for current and future employees.

Nicolai (34:05):
And, again, because people actually started looking [inaudible 00:34:10]. And if they know that this business is really careful with all these issues, then it’s definitely good for their reputation as well. Another thing which is quite … it’s also inevitable that companies are going to … well, the regulations are going to be stricter and stricter, and companies are going to invest more capital into these features. And it’s really important to start investing into this area right now, because it might be a little bit too late, say in two to three years. Because again, there are going to be many other things to essentially invest in. And it’s really important to essentially design the technologists which take fairness and biases at its core, as opposed to just trying to fix stuff which they’re developing right now.

Nicolai (34:56):
So just to give an example, say a data warehouse or data lake, or say data infrastructure layer. So imagine a big insurance company right now design something without thinking about privacy, biases, fairness. And then two, to three years they realize, “Okay, so how can that … right now, all this entire infrastructure is not compatible with the data protection law.” So it’s very important when designing those big systems to treat privacy, biases, fairness, and incorporate them. Well, the relevant tools at the core of the technology. So really ensuring that we are going to be compliant with best practices.

Nicolai (35:37):
So it’s definitely a real important sector, an area to invest for many, many businesses. And we are talking about, again, like insurance companies, banks, healthcare institutions, governments. We really expect data architects, data engineers and chief data officers, chief technology officers to start looking into this area, well, more and more. So that’s definitely, definitely coming.

Jeremie (36:02):
Is there a question? Because from what I remember of GDPR, again, from panicking about it back in the day, there’s a pretty big scale dependency for the companies. Companies at a certain scale have obviously, higher reporting requirements and higher enforcement requirements, so on. Can you speak to that a little bit? When do a lot of these things kick in?

Nicolai (36:21):
I would definitely say that, also thinking about the growth of the companies, to start looking into this as soon as possible. Because again, hey, if you have growth potential and well, today, you have say, I don’t know, 100 employees, 100 data scientists, in the next two years, this number might double. And it may triple. So it’s very important to really incorporate these principles at the beginning.

Nicolai (36:46):
And, it’s also not very hard. Because without solutions, so what you’ve designed actually, so it enables companies to become in a way, GDPR compliant in just under three to five days. So you can just say, “Hey, man, you just have this entire data access platform and that’s it.” So we essentially take all the risks away and make the company efficient again. And there is not only a huge risk-saving, but it’s also productivity gain. It’s also saving some of the data engineering because you know that you don’t really need to spend time on your data engineers, your [inaudible 00:37:21], your, well, legal departments to look into this area, because it’s all ready by default. So the privacy, the fairness is by design of the system.

Nicolai (37:32):
And again, it’s many interesting solutions exists in the market, and we are definitely pushing into this area of the game like it’s actually possible to become GDPR compliant, and essentially making sure that those principles, and it’s at the core of the technology, right, as opposed to trying to fix the past, we’re really making well, allowing the companies and businesses to think about the future, and how the entire again, like the market is going to develop and really incorporate best practices into the data infrastructure right now.

Jeremie (38:05):
Yeah, I guess one of the challenges of doing this too, is it seems like a one size fits all approach would be kind of challenging to pull off. Because let’s say I secretly am Facebook, and I secretly want to use your age to decide which ad to show you. And I shouldn’t be able to do that by GDPR, or whatever other regulation, I can always like leverage a whole bunch of other proxies, as you mentioned, like there are things that are related to age, but that I could still defend as having to use them in the context of some other totally valid use case. And from one company to the next, figure out where to draw that line seems like it would be just so hard for so many features, let alone interactions between features.

Jeremie (38:43):
Because, if I have one feature that tells me, 10% of the information about your age, and then I’ve got another feature that tells me, overlaps, like 30%, and then another, eventually, I could use those, but any one of those features could seem totally defensible on its own. How do you think about that problem? And how do you approach it if you do at this stage with with companies?

Nicolai (39:05):
Yeah, absolutely. And yeah, that’s, that’s why I kind of feel like we’ve spent more than two years on developing the platform. And it’s really ensuring that it’s not only we check for the sensitive attributes, but also we find all the things where kind of the bias can propagate. And again, like the simple examples, location, right, even if you control the, all the legally protected attributes, it’s still possible to get some leakage of the kind of bias into the location and so it’s important to check for those things. And yeah, this is what we have been doing as well. And there are like various ways how to do so; like again some correlation analysis, some again like you can do some modeling as well. See, okay, how much I can understand [inaudible 00:39:44] importance when building models and see again, like very slight kind of correlation between different elements in their data set as well. And yeah, this is what we’ve been focusing on a lot recently.

Jeremie (39:56):
Do you think that the big picture questions, like from a philosophical standpoint, is actually answerable? Do you think it’s actually possible to philosophically go through and say, you know, we’re going to pull out all of the age dependency in this data set? I guess the reason I’m asking is, it seems like even end users might not want that. It seems like an almost irreducibly fuzzy problem, but maybe my instincts are off on this.

Nicolai (40:27):
Yeah, absolutely. So I think you’ve touched on a very important issue, which is, the users may well sometimes may want to sacrifice some information just to get better services. Again, like healthcare, we’re talking about insurance, business, finance. I mean, to be honest personally, I’m okay with giving my data away. If it helps to essentially build better models and better solutions. Even I understand that unfortunately, many healthcare institutions they don’t really know how to stay compliant, well, still kind of figuring out how to stay compliant with GDPR and many other well, with HIPA in the US and CCPA. Am I okay with giving my data to the healthcare? Oh yes, definitely. I mean, if it helps Bob with better solutions, absolutely. Please, please do it.

Nicolai (41:18):
In this case, we can actually see that even if I’m given consent, it’s still difficult to essentially use my data. But yes it’s definitely an important issue.

Jeremie (41:32):
Are you generally optimistic about, from a policy standpoint, our ability to keep up with this stuff and roll out policies that actually makes sense for end users and for companies? I’m thinking, especially as the technology gets more and more advanced. We’ve talked about these deepfakes and unlocking all these new applications, some of which are disturbing and shocking of old data. I expect this is going to continue, but do you think policy is going to be able to keep up with this in the coming decades?

Nicolai (42:03):
I would say it’s inevitable. The question is when? But yeah. So definitely. Again, like right now, even in terms of fairness and biases, again, the regulator started to look into this. And again, like many, I think it’s also very, very important to distinguish between external policies and also internal … well, data governance policies within the bank, right?

Jeremie (42:24):
Mm-hmm (affirmative).

Nicolai (42:26):
And with bigger banks, so they actually have much stricter policies than say, GDPR when we talk about data privacy. And when GDPR came into force two years ago, many banks had already had much stricter policies internally.

Jeremie (42:40):
Right.

Nicolai (42:42):
And it’s more about, “Okay. Yeah, that makes sense.” And many forward-looking businesses have already started investing into the obvious issues around privacy and bias, especially in the insurance space, especially in the healthcare space. Governments as well. So it’s definitely something which is very important, because those guys are going to be … well, are going to be accountable. So especially, if you’re talking about the government and healthcare, for any discriminatory decisions they would make.

Nicolai (43:11):
And even if the regulators are not there, there is going to be some reputational risk, there is going to be some other issues involved. And hence, it’s very important to even independently, of forthcoming regulations to start investing into this area because even … Yeah, I think you’re right, so that the public is becoming more and more aware of the stakes. And it’s essentially driving the innovation within the businesses. We have many banks right now actually investing a lot, well, in fairness and well, privacy-preserving technologies and really developing internal solutions as well.

Nicolai (43:46):
And the banks in the US as well, even though GDPR doesn’t apply. But banks understand that it’s inevitable. So it’s not really a question right now. And again, it’s no way people are going to say, “We don’t care about data being unbiased, we don’t care about algorithms being fair.” So it’s definitely, definitely coming. And it’s very important to start looking into this very seriously.

Jeremie (44:12):
Interesting how the role the private sector is playing on anticipatory, almost pre regulation of their own activities. It’s an angle that seems like it’s going to be pretty relevant, especially since these private sector companies are going to be the ones coming up with the algorithms. They’ll know what the future looks like in many cases before government.

Nicolai (44:30):
Absolutely.

Jeremie (44:31):
Awesome. Well, thanks, Nicolai. This is a lot of fun. Actually, before I let you go, I just want to ask, do you have any links that you can recommend for people who want to learn more either about Synthesized’s platform. Or I know you’ve done a lot of blogging too, on bias and fairness in AI on your website, so maybe that’d be a nice link to share.

Nicolai (44:49):
Yeah, absolutely. So we do publish a lot about fairness, biases, algorithmic biases, biases and data, also about privacy-preserving technologies. And our focus is both on essentially, data scientists and engineers, but also on well, C level executives within a big corporate. On policymakers as well. So we try to well, I would say approach this problem from both angles. So one of the things we briefly touched on is the so-called synthetic data. So we essentially write a lot about it.

Jeremie (45:21):
Okay. So one other area I did want to ask you about is the whole area, a rich area of synthetic data. And it’s something that I think I’ve heard a lot of talk about. I’ve heard a lot of companies discussing, “Maybe we can take this approach.” I’d love to get your sense of first off, what is synthetic data? And then what’s its overlap with privacy and with regulation?

Nicolai (45:43):
Absolutely. Well, whilst I would say there is lots of different definitions of synthetic data, so it’s very important to, I guess, well define what we mean by truly synthetic data, even like synthesized data. So in a way, it’s the result of this so-called generative model, which learns how original data should look like, and creates a completely new simulated data point which has the same flavor, taste and smell. It smells like original data, but it’s just not original data.

Nicolai (46:12):
And again, it’s supposed to anonymize data sets, which is just essentially a one-to-one mapping from say, original data point to say, a new data point which is typically obfuscated or say masked. So synthesized or truly synthetic data is something which is designed by the ML system, by a generative model, and it has the same statistical properties. Again, if we talk about, well, statistical terms, statistical properties as the original data points. And some of the business benefits are really around, well, building better solutions, because you can essentially, in a way, create a variety of examples, similar examples, which do not exist in reality.

Nicolai (46:53):
Think about, say fraud, so fraud detection. So, what about you create a completely new class of fraudsters, and you’re able to backtest that fraud detection system in a much more efficient manner? And this is, again, one of the key benefits of, well, truly synthetic or synthesized datasets, which is you’re able to create, again, a completely new and larger variety of representative examples.

Nicolai (47:17):
I think another, I would say useful benefit is definitely facilitating innovation because it allows you to innovate, keep additional data locked away, but at the same time, without destroying the quality, facilitate innovation by means of simulated datasets. And again, as opposed to anonymizing data, as opposed to scrambling data, we create a completely … well, really high quality datasets which have the same properties as the original data, and hence facilitating the innovation in the healthcare space, in the insurance space, in the banking space. But at the same time, respecting privacy of individuals. And obviously, staying compliant with GDPR and [inaudible 00:47:58] and other key, important benefit of synthetic or synthesized datasets.

Jeremie (48:03):
And so, is this the main strategy you used at Synthesized where I guess, as the name implies, it’s mostly about generating synthetic datasets, so that then you can send that synthetic data around and not risk leaking real people’s anonymized data?

Nicolai (48:16):
So what we’ve built is the entire data platform, but, well, a very important part of the platform is really this technology. So the ML core, which is able to essentially create a variety of simulated examples, again, for development and testing purposes, to facilitate innovation in that privacy-preserving compliant manner. But there is a much bigger way for technology coming right now, which is the so-called simulations and well, data synchronization. And we write a lot about it.

Nicolai (48:47):
And well, this is essentially what the platform is able to do. And as part of the product release, so enable again, any data scientist, any ML engineer, any test engineer to essentially, well, again, upload the dataset, understand all biases and mitigate those biases with the so-called data synthesis technology, the technology we’ve developed. And this is going to be available to the public as well.

Jeremie (49:09):
Okay. All that, that’s great. The website is synthesized.io, I believe? Is that-

Nicolai (49:10):
Yeah, it’s synthesized.io. And you can also, well, connect to me on LinkedIn. And well, we also publish a lot on Twitter. And yeah, please just connect. And … yeah.

Jeremie (49:28):
Great. Lots of stuff there for people to check out for sure. And thanks, Nicolai, so much for joining me for the podcast.

Nicolai (49:34):
Thanks for having me. A pleasure. Mm-hmm (affirmative).

--

--

Co-founder of Gladstone AI 🤖 an AI safety company. Author of Quantum Mechanics Made Me Do It (preorder: shorturl.at/jtMN0).