Photo credit: Serg from the Swarmrobot.org project via Wikipedia

Data Science’s Reproducibility Crisis

Published in

Towards Data Science

6 min readMay 17, 2018

What is Reproducibility in Data Science and Why Should We Care?

Hot off the heels of Joelle Pineau’s brilliant talk on Reproducibility, Reusability, and Robustness in Deep Reinforcement Learning at this year’s International Conference on Learning Representations (ICLR), it seems like everyone in the data science world (or at least in the data science research world) is talking about replicability and reproducibility.

This problem isn’t unique to data science. In fact, according to 2016 survey by Nature magazine, most scientific fields are facing a reproducibility crisis. Ironically, according to survey respondents, one of the most important factors driving the reproducibility crisis is insufficient statistical knowledge. This result is likely at least partially influenced by the high number of survey respondents in biology and medicine (906/1500), who often have suboptimal training in relevant statistics. One would hope that data science and machine learning practitioners have a higher general level of statistical training, given the nature of their occupations. However, data science still faces challenges in reproducibility, despite the field’s emphasis on statistics and deterministic modeling, and these challenges often highlight the structural and organizational forces that are driving the reproducibility crisis in most scientific fields.

What is Reproducibility?

Before we even get to addressing reproducibility in data science, we need to start with a firm definition. Chris Drummond argues that many discussions of “reproducibility” are actually centered around “replicability” (some refer to this latter attribute as “repeatability”). In his view, replicability is the ability of another person to produce the same results using the same tools and the same data. In a computational field like data science, this goal is frequently trivial in ways that do not hold for “real-world” research. Anyone can fork an open-access repository and run the exact same code using the same data and get the same result. Laboratory environments are rarely so perfectly replicable, which means experimental replication often involves some low-level perturbation of experimental parameters. Usually, even identically replicating someone else’s laboratory work means ordering raw materials from the same source, reformulating their reagents, finding similar equipment at your institution, and following their methods as closely as the publication allows. As bench scientists say, “I replicated this experiment for my own research, and the method also works in my hands.”

But the fidelity of experimental replication differs between laboratory and computational disciplines. The fidelity of computational replication is generally expected to be incredibly high. If another researcher applies the same code to the same data, it would be expected that any deterministic algorithm would produce the same or very similar results. Essentially, most open source projects meet this replicability requirement, so stopping at this level of experimental reproduction is likely to be trivial for most of the meaningful research in the field. However, despite its triviality, this sort of exercise may still be critically important to serve as a positive control for other practitioners rolling out a new tool or algorithm.

Conversely, in Drummond’s view, reproducibility involves more experimental variation. We can think of experimental reproduction as an activity that exists on a continuum from near-perfect similarity to complete dissimilarity. On the high-fidelity end of the scale, we have a forked project re-executed with no changes. On the other end of the scale, we have the sort of nonsense normally reserved for recipe reviews on cooking blogs. “I didn’t have any flour for this bread recipe, so I substituted ground beef, and it tasted awful!” In this view, experimental replication in a laboratory experiment looks more like reproduction in a computational experiment.

Good reproduction is about finding a middle ground between replication and irrelevance.

Why Should We Care?

Reproducible experiments are the foundation of every scientific field and, indeed, even the scientific method itself. Karl Popper said it best in The Logic of Scientific Discovery: “non-reproducible single occurrences are of no significance to science.” If you’re the only person in the world who can achieve a particular result, others may find it difficult to trust you, especially if they have spent time and effort attempting to reproduce your work. It is reckless and irresponsible to build a product or theory on a singular unconfirmed anecdote, and if you present anecdote as a reliable phenomenon, it can consume time and resources that would otherwise be spent on actual productive work.

Irreproducibility isn’t always malicious or even willful, but it is rarely positive in a scientific field. The effectiveness of scientific contributions lies in their usefulness as a tool or perspective for others to apply to their own problems. We admire researchers who solve problems that we have found intractable or who produce tools to address a dilemma we have struggled with. And as scientists, we should strive to produce tools and ideas that help others accomplish their own goals. In doing so, we (hopefully) enrich our own success and professional standing.

If our standards of reproducibility are lacking or if we fall into the trap of implementing the talismans of reproducibility without regard for their true purpose, we risk wasting our own, and everyone else’s, time. Science is about continuity of thought beyond a single practitioner. When we leave, for whatever reason, someone else should be able to pick up where we left off and continue producing new knowledge. Colleagues should be able to implement our ideas without us hovering over their shoulders.

Science is a way of exerting our unique experiences and interests on the world in a way that can help someone else in their own experiences and interests. We can’t always foresee how our new knowledge applied to our own interests may help someone else, nor do we need to. We only need to do our best work to solve the problems we’re interested in with reliable methods. Knowledge gained in ways that can’t be reproduced helps no one and lacks the potential to ever do so. So without reproducible practices, we are simply wasting our own and everyone else’s time.

Barriers to Data Science Reproducibility

Now that we have a basic framework for what reproducibility is and why it matters, we can start talking about how we can work to fix it. There are several barriers driving the reproducibility crisis in data science, and some of them will be very difficult, if not impossible, to solve. Common laments include data and model availability, infrastructure, publication pressure, and industry standards, as well as a host of other less frequently discussed issues. Almost all of these issues have multiple diverse drivers, each of which requires its own solution. Because we’re data scientists talking about nebulous and complex concepts, it can help to do one of our favorite tasks: classification.

Most problems have both “hard” and “soft” factors driving them. Hard drivers represent insurmountable barriers to execution. The availability of suitable infrastructure is a good example of this. Sometimes you just don’t have enough storage or GPUs available to reproduce someone else’s work. Maybe you can’t access clinical or commercial data because you can’t get permission to do so.

Soft challenges, on the other hand, represent the class of problems in which there is a notional solution but industry or professional pressures prevent you from doing so. The quintessential example of this would be the academic practitioner who really would like to reproduce someone else’s work, but can’t justify spending the time to work on something that journals wouldn’t be interested in publishing.

In many cases, addressing the reproducibility challenges facing data scientists require nuanced understanding of multiple disparate fields. Most of these problems won’t be solved with a single rule or policy, so sometimes the best solution available is to just start discussing ways we can improve the practice of data science and related analytical fields. As this series continues, I hope to take a deep dive into each of the biggest challenges affecting the reproducibility crisis in data science and discuss potential solutions that we, as a new and unique industry, can take to address these issues.

Data Science’s Reproducibility Crisis

Written by Zach Scott