The world’s leading publication for data science, AI, and ML professionals.

The Dark Side of Data Science

The paradigms of data analysis are rooted in the discredited dogmas of Positivism. Companies and institutions who do not primarily deal…

Author: Farcaster at English Wikipedia (Wikimedia Commons)
Author: Farcaster at English Wikipedia (Wikimedia Commons)

The problem with the prevailing paradigm of data analytics

The prevailing paradigm of data analytics is rooted in the philosophy of Logical Positivism. Institutions who do not primarily deal with data need a more flexible approach.

The prevailing paradigm of Data Analysis is buried deep in the arid soil of a philosophical school called Logical Positivism. The framework is characterized by three mythical tenets derived from Positivism

  • Data are the only natural point of departure for all analysis
  • We ensure objectivity in our models by gathering data impartially
  • Data "speak". That is, they guide our minds in the construction of models.

As I discuss in my article "Myths of Modelling: Data Speak", Positivism – and, by association, its mythical beliefs – had been pretty thoroughly discredited by the 1960s. Unfortunately, as if often the case in the history of ideas, the counter-revolution over-compensated. Where the early revolutionaries would loosen the chains of narrow empiricism and open up for a more enlightened dialogue between hypotheses and the data that inspire and regulate them, the next generation would throw empiricism out all together. In the ensuing vacuity of common sense, practitioners had little choice but to crawl back to frameworks steeped in positivism.

As a result, the Wikipedia page on data analysis today starts with the following homage to narrow inductivism:

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

and is furnished with the pearl of a positivist representation at the top of the page, complete with arrow from data analysis to models (and not the other way) and an activity called "raw" data, maximally removed from theory.

The problems of positivist data paradigms

Data Science has done just fine, despite its dubious epistemological underpinnings, but many of its greatest successes took place in companies that primarily deal with data: Companies who create data through transactions, companies that monetize data they have generated either by accident or design; companies, in short, who seek out and accumulate analyses, motivated by their data.

But with the acceleration in "digital transformation" institutions that do not primarily deal with data are looking more and more to data to explain the world in which they operate and to address the business and policy problems they face. They seek out and accumulate data, motivated by their analyses.

Here the very success of Data Science can be its undoing, as the paradigms that have been applied with so much success to data looking for analyses run aground on Positivism‘s epistemological sand banks when problems go looking for data.

In the following I an assuming we are in this latter situation, that we are analyzing data to a specific end, addressing a problem at hand, "informing conclusions" or "supporting decision making"

Data are not impartial

The selection of data necessarily supposes relevance, and relevance entails some level of commitment to a narrative connecting the data to the problem at hand. Processing and cleaning of data suffer the same necessity; you can not know what is signal and what is noise without some notion of relevance or expectation with regard the significance of the data and how they might relate to the problem at hand. There are no data without conjecture.

Data are not necessarily sufficient

Not only are we silently guided by our censored conjecture in the selection of data, if we try to suppress explanatory notions or frameworks, we have no way to know whether the data we have gathered are those data most relevant to understanding or solving the problems at hand. And we have no mechanism to motivate the search for further data for that may provide confirmation, refutation or further insight.

Data are not the only natural starting point for analysis

For analysis of phenomena with a view to forming conclusions or furnishing explanations, potential explanations are a valid starting point. For decision-making, the objectives you are trying to achieve and the decision levers with which you are trying to achieve those objectives are the natural place to start.

Data do not speak, much less explain

Neither the positivists nor their data analytics apologists were ever able to give any account of an objective, transparent process, much less a deductive process, by which an accumulation of data generates an explanatory theory or hypothesis. Data do not speak. We can not even interpret data without recourse to some theoretical context.

But what of objectivity?

The desire for impartial data arises from the fear of a judicious selection and distortion of data in the service of a pet theory or a hidden agenda. Here’s Karl Popper

…if we are uncritical we shall always find what we want: we shall look for, and find, confirmation, and we shall look away from, and not see, whatever might be dangerous to our pet theories. In this way it is only too easy to obtain what appears to be overwhelming evidence in favour of a theory…

But theorizing is essential in the discovery, selection and presentation of data. Popper’s solution is to ensure objectivity by creating multiple explanatory theories and then shifting the scrutiny of objectivity to an adversarial arena in which we adjudicate a critical contest between those conjectures.

The way in which knowledge progresses, and especially our scientific knowledge, is by unjustified (and unjustifiable) anticipations, by guesses, by tentative solutions to our problems, by conjectures. These conjectures are controlled by criticism…

Disposing of the notion that data "speak" liberates us to conjecture creatively and puts hypothesis back in its rightful place, in a dialogue with data, not subservient to it. Data motivate multiple conjectures, the attempt to distinguish between hypotheses guides us in the selection of data and motivates the search and discovery of new data. These in turn test the mettle of our conjectures in a crucible of critical discourse. Hypotheses are refined, rejected, confirmed; they merge and split and generate new hypotheses, which in turn motivate the discovery of additional data and so on.


Related Articles