Opinion

The Data Dilemma

Think Ethically from the Beginning

Published in

Towards Data Science

6 min readNov 3, 2020

As an aspiring data scientist immersed in a data science bootcamp at the moment, “how can I work ethically from the get go?” That’s the question that pervaded my mind as I watched The Social Dilemma and listened to Tristan Harris and Cathy O’Neil, among many others, discuss the dangers of big data and these unmanned, black box algorithms that penetrate our subconscious minds and make decisions for us that we no longer understand. The stories of these whistleblowers lumped into a non-existent group informally known as the “Conscience of Silicon Valley” are disturbing. But they aren’t wrong, and they’re speaking with the authority of hindsight. So many of the data scientists and programmers interviewed throughout the documentary made mention of their lack of foresight into what these algorithms could become, or when an ethical dilemma was posed, the financial bottomline was the knockout punch. In other words, it was too late. So my question again is:

“How can I (we) enter the field of data science, knowing what we know now, with the benefit of the lessons learned by those before us?”

My Personal Dilemma

To be clear, this is not a movie review. I really appreciated the film and learned a lot from it, and the only thing more cringeworthy than the title of this article were the dramatizations interspersed between some of the most interesting interviews of some of the most interesting people in tech I can think of. But what this post is, is more of a personal reckoning with the dangers of the growing power of the industry of data science and a roadmap for myself (and maybe others) to keep in mind as seek to be responsible with the tools given to me. What are the tools I need and the guiding principles needed to continue down this path? Listening to the harrowing discussions of what big data is doing behind every click and scroll is daunting, but I’m reinvigorated by the ideas of humane technology and how tech can and should be used for greater purposes than profiteering.

I’m personally under no delusions of grandeur, and don’t presume to be putting into production the next model that is going to significantly alter the way society runs anytime soon, or ever for that matter. But I would hate to one day find myself in a position wishing I had asked these questions of myself, and of others sooner. Power, from my point of view, has this unique ability to change us, and more often than not in ways we don’t always like, unless we are grounded in reality, with checks and balances along the way. Without getting too philosophical here, I’ve taken the liberty of researching some other points of view on data ethics and compiled a shortlist of some of the most important considerations to be made. If you’re more interested in learning about scary black boxes and the way tech companies are destroying the fabric of society, I absolutely recommend giving The Social Dilemma a watch, or reading Cathy O’Neil’s book Weapons of Math Destruction. They will open your eyes to powers you are probably already aware of, and help you understand how they work.

Where is my data coming from?

This is a multifaceted question about the origins and quality of our data and there are a whole host of questions that derive from this initial idea of data collection. My classmate, Sidney Kung, opened my eyes to a lot of the ethics of data collection and data privacy. Based on her findings, collectors of private or personal data are held to higher standards elsewhere in the world and here in the US companies are storing endless amounts of data with no intent or plan of their own other than “it may come in hand one day.” The truth is, the models we make are only as good as the information we feed them, and the metrics by which we judge a model. The phrase “garbage in, garbage out” is more than applicable here. If a model is trained on an already broken system, the model will continue to predict or classify in a similar fashion, and will potentially exacerbate those biases as they become more and more mechanically consistent.

How is my model being used, and what are the consequences of being wrong?

One of the best questions we can ask ourselves is what is the cost of being wrong? The most common examples of this are raising type 1 and type 2 errors in the context of the justice system and in the medical field. Is it more costly to commit a type 1 or type 2 error in each of these contexts? In the justice system, which is supposed to operate under the assumption “innocent until proven guilty,” it is in theory better to commit a type 2 error of a “false negative” than to charge an innocent person with a crime they didn’t commit. On the other hand, in the medical realm, the consequence of committing a type 2 error is much costlier, where it would be better to detect an illness that isn’t actually there, than to miss one that really is there. But each problem, business or otherwise, has its own nuances and cost/benefit analysis. Risk factors and the implications of your research or modeling should be assessed from the beginning. The answers to these types of questions inevitably inform our processes and help us arrive at the results we ultimately are looking for.

Where is my model showing bias, and what can I learn from that?

One concept several of my instructors at the Flatiron School have repeated to me from virtually day one is that “the absence of data is often data in and of itself.” What might we be able to conclude or infer from null or missing values in our data set (again, perhaps this harkens back to where our data is coming from). Without laughing at my tin hat, asking the questions like is this dataset complete? Has it been tampered with or even censored? Who would stand to gain from analysis on this data? are potentially helpful in understanding your dataset, and should be considered part of any exploratory data analysis step. Eric Siegel wrote a very helpful article on the use of algorithms in predictive policing wherein he confronts the observed bias of an algorithm known as COMPAS and states:

“To that end, let’s educate and guide law enforcement decision makers on the observed inequity.”

Again, Cathy O’Neil (who I think is my new hero based on the number of times I mention her in this blog), is trailblazing in the way she has set out to audit algorithms. Her work at O’Neil Risk Consulting & Algorithmic Auditing (ORCAA) is paving the way for law makers to set boundaries for what an algorithm can and cannot be allowed to do given certain biases.

Is there more than one way to interpret my results?

The answer is probably yes, and because of that, what does it mean? Feedback is important in any field, and having more than your eyes and your mind to look at a problem is crucial. Given the amount of subjectivity there is in a field that prides itself on objectivity, collaborating will always help us to look at problems from more angle than one, and understanding that my interpretation is not the only possible outcome is an important step in asking questions and going deeper.

The deeper I go into the field, the more subjectivity, chance, and human decision making I see embedded at every turn of the model building process, and the quote attributed to statistician George Box, whose name I only know because of this quote, makes more and more sense to me: “all models are wrong, but some are useful.” I hope putting these thoughts and questions in writing will help solidify them in my mind, help me build useful models, and has given you more questions than you came with.

Some links to those doing the hard work of regulating these powerful tools:

Center for Humane Technology

Algorithm Watch

ORCAA