The world’s leading publication for data science, AI, and ML professionals.

With Big Data Comes Big Responsibility

A look at how bias and intellectual dishonesty can affect data analysis

Data Science / Data Analysis / Statistics

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Over the past decade, powerful new tools have empowered organizations to gather and analyze Data at much lower cost—enabling previously unimaginable power when it comes to predicting outcomes and behaviors. Given the availability of granular data and raw computing power accessible to us, data professionals are able to create powerful models that can predict patterns of consumption, travel, and behavior with ever-increasing accuracy. Ethical implications of this are many—far more complicated and nuanced than the scope of this article can cover. This article seeks to address the subject of honest engagement with data analysis.

Thanks to the nexus of availability and computability of data, institutions are turning to analysis and regression to solve a larger share of problems and answer a larger share of questions. This places responsibility on the shoulders of those who work with data to provide meaningful, but more importantly, accurate and unbiased analysis to stakeholders.

To meet this standard, a data professional needs to avoid using unrigorous statistical methods and inaccurately modeling, yes—but they also need to be statistically/intellectually honest. There are many avenues through which a data professional can avoid being dishonest—either intentionally or unintentionally—being aware of them can help avoid drawing fallacious conclusions.

Data collection

The first, and possibly most consequential mistake a data professional can make is improperly collecting data or using data that was improperly collected. This can mean gathering data from unrepresentative samples as a result of selection bias, or misrepresenting your target population. If the individuals sampled in a dataset don’t reflect the population you’re seeking to understand with a statistical study, your experiment can result in misleading conclusions being drawn.

It’s a possible pitfall to select elements of a population that may skew in favor of the experiment’s hypothesis. It’s therefore important to ensure that a sample is randomly chosen and not subject to common selection biases. And similarly, if using an existing dataset, verify how the data was collected to ensure it’s appropriate to use to predict on your target population.

Also worth noting, for specific data collection instances that involve surveying people, be wary of biased or leading questions. These can heavily skew the data you receive and lead to conclusions that are not based in truth. And likewise, if you’re conducting the surveys personally, be sure not to let those you survey know what hypothesis you’re hoping to prove or disprove, as that can lead to biased responses and similarly cloud your data.

Photo by CDC on Unsplash
Photo by CDC on Unsplash

Data cleaning

When prepping data for analysis, the ever-loathed process of data cleaning can cause issues if not handled with care. A first step in data cleaning is exploratory data analysis (EDA). This involves looking for missing or incorrect values in your data, visualizing elements of the data to gain insight into it’s distribution, and check for outliers.

After EDA is complete, generally we then deal with missing and extreme values. If outliers that are evidently the result of erroneous data collection, that data can be dropped to allow modeling to properly capture the majority of the data. However, it is important not to overuse this power, and never drop values because they are in contradiction with your hypothesis. Be mindful of the ways your handling of missing and extreme values could change your model and affect drawn conclusions. Try to keep everything as neutral as possible.

Photo by Clay Banks on Unsplash
Photo by Clay Banks on Unsplash

Data modeling

Modeling frameworks such as linear/logistic regression, decision trees, gradient boosting, neural networks, et al. allow for very fast and effective modeling of data. It may not be as obvious, but there is still room for bias to creep in when modeling data. In selecting hyper-parameters and in the selection of the model to use, it’s important to not try to find ways to contort your model to fit the data to your hypothesis—the goal should be to uncover the truth within the data, and find what data says objectively, then ask if the original hypothesis should be confirmed or rejected. In the case of predictive modeling, avoiding personal bias in how you define model features is keenly important as well.

And it’s worth stating the obvious here: if your data is biased, your model will be biased. Spend time considering how and from where your data was collected and what impact on your model that will have.

Presentation

This last area is one that would likely feel patronizing to include if this were written years ago, but given the wave of anti-science and anti-intellectualism that has taken the globe by storm (or taken the "disk" if you’re already victim to this), it would feel naive to not include abuse that can be perpetrated through misleading or false representations of your findings.

Whether it’s by exaggeration, truth bending, or outright lying, it is deeply unethical and harmful to present a hypothesis different than what has been found through the analysis. In an organization this can lead to bad decision making. And when shared with the public this can lead to proliferation of fake news. (And there is strong evidence that suggests that false and misleading information spreads faster than the truth.) We can see clear examples of the harm done by claiming data supports a harmful or hateful world views by looking at the horrors of eugenics, or by looking at the manipulated data used to justify "broken windows policing."

By avoiding these pitfalls and making rigorous, honest effort to find the truth within data—not find the best way to confirm personal bias or the bias of stakeholders, data professionals can ethically and economically help organizations eliminate waste, improve efficiency, and empower its members.

Thanks for reading, if you found this article helpful, please feel free to share this with your network. For questions, clarifications, criticisms, or inquiries, get in touch on my LinkedIn.


Related Articles