The world’s leading publication for data science, AI, and ML professionals.

But What is a Model?

A Wittgensteinian Approach to Data Science

Planetarium from 1766, Photo by Sage Ross, Creative Commons
Planetarium from 1766, Photo by Sage Ross, Creative Commons

The term model gets thrown around a lot. The word is ubiquitous to the point of lost meaning. The Wikipedia page alone shows the variety of usage of the word model, including statistics, astronomy, biology, product design, art, as well as conceptual models.

The etymology of model is interesting as well, stemming through French and Italian back to the Latin modus, for ‘measure, rhythm, or way’.

Nevertheless, the definition for ‘conceptual model’ captures the broadest interpretation of the word in any sense, as always from Wikipedia:

A conceptual model is a representation of a system, made of the composition of concepts which are used to help people know, understand, or simulate a subject the model represents.

How does this relate to data science?

In Python, data scientists often use packages such as scikit-learn or statsmodels to run linear regressions, clustering algorithms, random forests, or neural nets on a variety of data for the sake of classification or prediction.

Meanwhile, the Ancient Greeks and Romans used a geocentric model of the solar system to make sense of their universe, a cosmological model which dominated their understanding of the universe until December 1610, when Galileo inferred that Venetian phases ruled out the geocentric, or Ptolemaic, model, finally verifying a sun-centered model.

These usages of the word model may seem unrelated, but returning to our definition above, they are in fact the same applications of the word. Why? Our tangible interaction with Python models consists in Jupyter notebooks, that little * next to a cell while the model is built (while praying for no errors), and pickling those cherished prime-F1-score models for usage on AWS or on Heroku. But what is actually going on under the hood in something like scikit-learn? Why call it a model, apart from tradition? How does a pickled random forest model have anything to do with a little toy model of the solar system with metal spheres, or a styrofoam Bohr model of an atom?

Because these models are actually intended to be representations of reality. That is what unites them. This is a sublime and difficult point to make, but this fact has more significance than may appear at first glance.

Models, Data Science, and Wittgenstein

Photo by Ben Richards, Public Domain
Photo by Ben Richards, Public Domain

The Tractatus Logico-Philosophicus, by the philosopher Ludwig Wittgenstein, is often considered one of the most important works of the twentieth century, a treatise which unites logic, science, and philosophy and culminates in a mystical refutation of itself and, to an extent, philosophy. It is a beautiful work, laid out in seven overarching propositions, which essentially state:

The world consists in a ‘state of affairs’ or facts about the world. We develop representations of these facts in the form of logical propositions, though we can never explicitly say what a fact and a representation of a fact have in common, we can only show what they have in common, and this is an essentially mystical aspect of human life, which also makes philosophizing, including this whole work (the Tractatus), a waste of time except for that it liberates you from the urge to continue philosophizing.

For example, when we correctly recognize a mother’s face in the face of a child, what exactly are we seeing ‘in common’? We can’t quite put our finger on it: we see a family resemblance, that’s it. We see something that we can’t say, which we nevertheless know to be true, and to attempt to articulate that commonality fails miserably.

Even if we try to reduce this activity to a neurological state, a neuron-based type of facial recognition and feature-firing (which is the current suspected function of the fusiform face area of the temporal lobe), we don’t actually use that reductionistic-approach in the act of recognition itself: we simply do it, and that metaphysical distinction pervades all human activity and is the source, according to Wittgenstein, of our ethical and religious truths which must stay apart from the realm of scientistic [sic] fact.

Finally, getting back to models (and the point): models are just such representations! When using scikit-learn to make something as simple as a linear regression model, we are using the computational ability of computers to simulate and codify this process of human-brain-driven isomorphism recognition. The coefficients of a regression model are expressions of belief in the influence of certain features towards a certain target-variable related intention, which is usually prediction.

But a wondrous realization about these models is that, despite the feeling that Python models are just a sequence of 1s and 0s which cleverly captures gradient-descent driven loss-function minimization, the process of model building is parallel to the human activity of prediction and explanation! The human brain, whether making a numerical mental estimate, or attempting to verbalize a memory of a physical event, or recognizing the face of an old friend over a few seconds, is picking out isomorphic features of reality in its mental representation. The major difference is that Python models lack sentience: they are still tools which require a human understanding in order to effectively manipulate, and I believe this is the fundamental reason why the job of a data scientist is less about model building in Python, or large-scale distributed computing in AWS, but rather about producing insight, visualization, and expanding top-level understanding about a domain.

Interestingly, this is also why Data Science is so broad of a field: the scope of model building is as wide as there are human tasks to simulate. Word embeddings for natural language processing, convolutional models for image recognition, or regression models for numerical prediction are models of sub-domains of the world which, for most of human history, have been human generated.

The ultimate data science model, similar to the perfect map in the short story On Exactitude In Science by Jorges Luis Borges, would just be the world itself! But such a model would be, and is, too cumbersome to wield effectively. We make models to simplify the world in an actionable way, and I believe any data scientist would benefit from this perspective.

That said, I suspect the veteran data scientists already have. Thanks for reading!


Related Articles