When thinking of Data Science, one of the first mechanisms you will think about is Big Data. But as you think more about data, you might think about what type of data is stored: full names, email addresses, phone numbers, or even someone’s home address. Depending on how specific the data is, you may even have gender, age, occupation, and so on. At first, I disregarded this collection of data to improve things such as predictive analysis or even targeted marketing.
My trending searches for Data Science this week brought up the ethics involved in the collection and use of data. As I had said, some think it’s used to improve the intelligent guesses machine learning makes. Others, however, felt that the type of data being stored could be a breach of privacy. So, this is where I wanted to see the whole discussion from both sides. First, I wanted to hear about potential concerns there may be. Then secondly, I wanted to see points where data scientists attempt to address those concerns.
So, in today’s article, we’ll first discuss a few of the privacy concerns when looking at the use or collection of data, and then we’ll talk about a few ways to protect someone’s privacy when you do collect data along your journey as a Data Scientist.
User Consent
One of the first privacy concerns is whether the user grants their consent to take data. Recall that this first part will be from the user’s point of view. For example, when you are on a website your clicks or purchases could be tracked. Or perhaps you are using a popular app that uses your browser history to select apps for you. Especially when talking about sensitive data, which could be as specific as a credit card number or as general as a first name, you’ll want to know what is being tracked.
Further, you may even want to know what the data is being used for. At the end of the day, it’s your data and your privacy, so you will want to know what information is being taken and why.
But some websites make this information difficult to find. If a website prompts you to accept or reject cookies, you typically have to click another link to review what exactly is being tracked. It may give you the option to decide what cookies can track, but going through the list could take even more time.
User Consent: The Solution
From the Data Scientist side, you already know what information you will track and the purpose behind taking the data. But you will also need to gain user consent. For a website or even an app, this can be simple to do. A "terms and conditions" page lays out the rules you enforce on your site/app, but by providing "accept cookies", the user is given the choice to allow you to track the data or not. With those cookies or your "terms of service", you should clearly state what exactly you track or the intention.
User Consent: The Dilemma
The app or website you are using wrote lengthy "terms of service" to detail everything they track and what you agree to, and detailed what information is tracked in cookies before you accept those. However, as a user, do you bother to read either? Especially when lengthy, do you at least skim through what information they may track, or do you save yourself time and just accept? If you don’t read through, you’re not the only one. Many users don’t have the time or the patience to read long, detailed reports on every item.
The dilemma of this arrangement is that you may add all the information for your users to read, but they might just click "Accept cookies" or "accept terms of service" without ever really reading the terms. So, did your user give you consent? Yes, perhaps they did. But, as a user, it isn’t informed consent.
Privacy concerns for stolen data
When you track large amounts of data, there has to be some place to store it. Typically, that is one large database. But what happens if someone hacks that database? There’s always a privacy concern, and even with the best security team, you may still have breaches.
Every once in a while, a famous airline will warn its users that their databases were hacked and credit card information could have been stolen. As a user, you have to worry about the potential of your information being sold to the wrong people, or the wrong people finding a way to access your data anyway.
Avoiding Data Theft
As a Data Scientist or anyone working with security or even development, your goal would be to keep the data as secure as possible. But, things do happen sometimes, so it would be your job to recover as quickly as possible and to do damage control where needed.
Although security is the fix, it isn’t always the easiest solution. But if you plan to take user data, you have to be sure you’re protecting it as if it were your own.
The Ethics of Fairness and Algorithms
When a machine is making decisions, you have to give it a selection of data to learn from. However, in sample sets, no matter how large, you may find yourself favoring some groups over others. For example, your test set may include demographics that are not true to the population. If you examine a single neighborhood, you may, or your machine learning model may make assumptions about the population that would be untrue. This could create biased Algorithms.
An example of a biased algorithm is well put in one of the articles I read. In "Data Science Ethics – What Could Go Wrong and How to Avoid It" by Kylie Ying, Staples attempted to beat their competitors by offering cheaper prices and better deals. However, following the algorithms they had, those better prices were found in more wealthy neighborhoods, which meant the richer neighborhoods would receive cheaper prices than generally poor neighborhoods.
But fairness isn’t only about the algorithms or the populations. Being fair could also refer to the way data is being used. Ethics include how the data is used or sold. For example, it would be unethical to take someone’s sensitive data and sell it to untrustworthy parties. However, it is not necessarily unethical to sell generic interest-based data to companies targeting sales.
Depending on your dataset, the algorithm may also follow discriminatory assumptions. For example, in the same article, Kylie Ying talked about how she has a video on the controversial role of algorithms tasked to determine sentencing and paroles. In the algorithm, there was an opposition to black defendants and how it was in favor of white defendants. This shows racial disparities in the algorithm.
Making the world of Data collection a better place
As you can see, there are a few gray areas where data science, being machine learning and more specifically Data Collection, can be unethical. However, there are more ethical ways to approach it. The first is to practice transparency with the collection of data. That way, your user knows not only what data you are collecting, but also how you intend to use that data. Next, protect privacy. Secure the data as best as you can. As if it were your data and not a user’s. In these both, you will need to collect carefully. Consider what data you need, as to not take too much. Consider the context of the data, as it often changes the meaning. And finally, of course, incorporate inclusivity. This way, your algorithms will find as little bias as possible, and so that your data may also reflect the population, and not only the sample.
Conclusion
Today, we looked into the world of ethics within Data Science. We learned a few different concerns, including transparency (informed consent), security, disparities (whether racial or gender), and ensuring your algorithm does not contain biases. While looking at the algorithm, we also looked at two examples of how they can become biased based on your datasets.
While Data Science, specifically data collection and machine learning, is not inherently unethical, there are still several practices you should be aware of before you dive in. I certainly feel like I learned more about the ethics surrounding data science, and why there could be better visibility. Hopefully, you learned something too, and hopefully, you found this article interesting to read. Until next time, cheers!
Read all my articles for free with my weekly newsletter, thanks!
Want to read all articles on Medium? Become a Medium member today!
Check out some of my recent articles:
Quitting Kubernetes Kubeadm and Switching to MicroK8s
Leaving your job for "selfish" reasons
References:
Data Science Ethics in Practice –
Data Science Ethics – What Could Go Wrong and How to Avoid It