The world’s leading publication for data science, AI, and ML professionals.

Myths of Modelling: Data speak

Data keep us honest, but they don't speak; they aren't objective, and they are never free from the taint of theory

Cryteria, CC BY 3.0 <https://creativecommons.org/licenses/by/3.0>, via Wikimedia Commons
Cryteria, CC BY 3.0 <https://creativecommons.org/licenses/by/3.0>, via Wikimedia Commons

.

The myth

The myth of speaking data has many related forms

  • Data are the natural starting point for investigation and analysis, and one should approach data pure in mind and cleansed from all taint of theory
  • Data are objective, or at least somehow more objective than models, which without data are just story-telling with a strong odour of opinionation.
  • We ensure objectivity in our models by impartially gathering data and then letting the data "speak".
  • Data objectively guide our minds in the construction of impartial models, motivated only by "the facts"

This is the myth of the primacy and probity of data.

The origins of the myth

Moritz Schlick, philosopher, physicist, and the founding father of logical positivism. Photo by by Theodor Bauer (Public Domain)
Moritz Schlick, philosopher, physicist, and the founding father of logical positivism. Photo by by Theodor Bauer (Public Domain)

This myth is deeply rooted in the arid philosophical soil of Logical Positivism, a philosophical movement centred around an exclusive club of thinkers called the Vienna Circle in the 30s. Logical positivists believed in the primacy of empirical observation to the extent that any statement that wasn’t itself an empirical datum or firmly linked to one through a principle of correspondence was declared meaningless. Only data and logical, analytical relations were worth discussing and the latter only through their correspondence with the former.

The philosophical foundations of Logical Positivism were mortally undermined after the second world war, first by Quine’s dissolution of the distinction between data and analytical discourse, then with Popper’s withering dismissal of theory-free observation and finally with Kuhn’s description of the role of scientific paradigm in the evolution of scientific understanding. By 1960, the movement was "as dead as a philosophical movement ever becomes".

But although never intended as a philosophical overhaul of scientific practice (rather as an scientific overhaul of philosophical practice), scientific practitioners had already drunk deep from positivism’s poisonous springs. Ironically, the very philosophical movements that washed away the pillars of positivism (Quine, Kuhn and social constructivism) served to entrench positivism as the paragon of good scientific practice (at least amongst those scientific practitioners who never read Popper, Quine or Kuhn).

Reports on the death of logical positivism are grossly exaggerated. Logical positivism is, in fact, alive and well and living in engineering and physics text books in high schools and universities across the globe.

Why we continue to hold the myth

Most modelling myths are forged and fed by an earnest, well-meaning endeavour to protect ourselves from falsehood, deceit and self-deception and the myth of the primacy and probity of data is no exception.

The myth is an attempt to guard against what Popper calls theory bias, but which more recently has become known as confirmation bias. Here’s Popper

…if we are uncritical we shall always find what we want: we shall look for, and find, confirmation, and we shall look away from, and not see, whatever might be dangerous to our pet theories. In this way it is only too easy to obtain what appears to be overwhelming evidence in favour of a theory which, if approached critically, would have been refuted.

We, very rightly, fear the judicious selection and distortion of empirical data in the service of a pet theory or a hidden agenda, or in order to perpetuate the dominance of an existing theory and suppress a plucky challenger.

Unfortunately, we do not fix the unavoidable subjectivity of our theories and the inevitable self-interest in our selection of models by denying the essential role of theory in the discovery, selection and presentation of data.

Why it is a myth

The first problem with the myth of the primacy and probity of data is that without commitment or at least a willingness to entertain a theory or model, we know neither what or how to observe anything at all.

The existence of data demands that someone has discovered, gathered and to some extent processed and presented those data. But why those data? If we are truly cleansed from the taint of theory, how do we ever decide what might be relevant? And how do we process and present those data without some theoretical concept of what they mean and how they relate to the problems at hand?

The second problem is that we can in no way know that all the data relevant to understanding and solving the problems at hand are readily available to us, and we can not go looking for them without some theory or model to guide us in that search.

So without theory, we have both far too many data and too few, but there’s more.

The fundamental failing of logical positivism wasn’t philosophical, it was practical. Neither the positivists nor their scientific practice apologists were ever able to give any account of an objective, transparent process, much less a deductive process, by which an accumulation of data generates a theory or hypothesis. Data do not speak. We can not even interpret data without recourse to some theoretical context. There is no point in cleansing the data of theory if the hands that build hypotheses are contaminated with opinion and theoretical prejudice.

Why it’s a problem

Does it matter that we delude ourselves when we think we can harvest data independently from theoretical commitments?

I argue that it does matter, for three reasons. First and foremost, the problem of theory bias remains unaddressed. Best case, we are as biased as ever, but – convinced of our own good data hygiene – we have persuaded ourselves we are not; worst case we learn to game the gathering of data and the generation of hypotheses, and we present the process as neutral and impartial in a secret, deceitful servitude to hidden agendas and pet theories.

Secondly, it’s driving flawed practice. For example, prevailing "model-free" statistical practices derive from a struggle for objectivity through the blind insistence on reasoning exclusively from data and experiment. The go-to mathematical methods of many non-mathematical disciplines — medicine, public health and economics to name but a handful — are statistical tests that purport to uncover causal relations without contaminating the putative purity of their data by commitment to a causal model.

The problem is that all of these tests are actually built on simple and often quite inappropriate causal models, but the flat denial that this is so obstructs us from assessing the fitness of those models and prevents us from aligning them with our expertise and understanding.

Finally, the myth prevents us from doing what does actually work: The construction of multiple, preferably causal, models that address the questions we face, align with the data we have and, critically, motivate the discovery of new data and insights that strengthen our understanding and enhance our agency in intervention.

What we should do instead

Here’s Popper again, from his book Objective Knowledge: An Evolutionary Approach.

Whenever a theory appears to you as the only possible one, take this as a sign that you have neither understood the theory nor the problem which it was intended to solve.

The key to ensuring objectivity is to move the scrutiny of objectivity away from the acquisition of data and over to a contest between multiple theories or models.

Data are necessarily the nodes at which our theories intersect with the world. They are essential to the adjudication between theories and we must do everything in our power to ensure they are not willfully distorted or suppressed. But by shifting the focus of our pursuit of objectivity, impartiality and real-world correspondence to the contest between multiple theories, we free data from the illusion of independence and open up for a much richer, more fruitful relationship between theories and the data that inspire and regulate them.


Related Articles