
Data Science is a very technical, in-the-weeds type of work. We are often laser focused on very specific problems – which is good. We add most of our value by combining our focused attention and our skills to solve problems. But, I think it is a good practice to occasionally step back and try to take in the bigger picture.
Studying philosophy is a tool that I have found to be quite effective in helping me think deeply about data science. As a casual student of philosophy, I’ve observed that some fields of philosophical thinking are nicely intertwined with data science. Specifically, I’ve found that metaphysics, causality and epistemology have a lot of theories that are very applicable.
This is the first installment of a multi-part series that discuss various philosophical viewpoints and their implications on data and data science. I’m going to start with the fascinating metaphysical theory of determinism.
What is determinism?
Determinism is a philosophical theory about the nature of our universe. There are multiple different nuanced versions of determinism¹, but the overarching idea is that there is no randomness in our universe. Every event has a set of causes which entirely explain the event, and these causes themselves have a set of causes. The chain of causes is unbroken from the beginning of universe (or maybe there is no beginning of universe²?).
Below is a quote from Laplace that encapsulates a deterministic viewpoint on the physical world:
"We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes."
Pierre-Simon Laplace, A Philosophical Essay on Probabilities (1814)
I’ve found that determinism pops up in the following data science topics (I’m sure there are a lot more – let me know what I’ve missed!):
- Probability theory
- The concept of irreducible error
- The theoretical ‘god’ model
- Causality and design of experiments
- Random numbers
Probability Theory
The study of probability is largely about understanding how random variables behave. A random variable represents the outcome of a process with randomness in it. For example, the roll of a die. We can know a lot about how probable certain outcomes are, but we cannot predict the outcome of a single throw with certainty— presumably because of randomness.
The theory of determinism rejects that there is any randomness in the universe. Why then do we have the field of probability, in which we study random variables? Of course, a indeterminist would say that there is randomness in the universe. But, a determinist would likely say that the whole field of probability was created because of the ‘epistemic limits’ of humanity.
Epistemic limits bridge the gap between perceived randomness in the universe and the theory of determinism. These limits can be defined as the boundaries of what can be known or understood. If the universe is truly deterministic, we could hypothetically know the outcome of every dice roll (think about Laplace’s quote above). If we were able to gather and understand the causal relationship between all variables that impact each throw, we could calculate with 100% confidence the outcome of the roll (if the universe is deterministic). Imagine, though how much we would have to know to make such a calculation! The imperfections of the die, the exact placement of the die in my hand, exactly how I shake my hand, the barometric pressure that day, the hardness of the landing surface etc.
Epistemic limits bridge the gap between perceived randomness in the universe and the theory of determinism.
A determinist is okay with things appearing random because she would feel that the reason things seem random is because of our epistemic limits. Because of these limits, probability is still a very useful field of study regardless of whether or not determinism correctly describes the nature of our universe.
Irreducible error
Machine Learning models attempt to make predictions given a set of data. Typically, these models are only estimates or approximations of how a system works. In other words, models are often wrong to some degree – we call this error. Determinism has theoretical implications on model error!
A model’s error can come from a combination of three different sources:
- Model approximation
- Unavailable data
- Random noise
model approximation

When we create a predictive model, we are estimating the true relationships between our target and predictors. We hope we have a close approximation. This is why you may hear ‘estimate the model’ and ‘train the model’ used interchangeably.
For example, when we estimate a linear regression model, we assume that all of our predictors have a linear relationship with our target variable. Violations of this assumption (even small violations) result in at least some amount of error.
Unavailable data

This type of error comes from missing data that is necessary to describe the system. It can be missing because it is unobservable or impossible to accurately quantify (e.g. driver mood to predict speeding) or because it is simply not available (website was not set up to capture how long a potential customer spent on the checkout page to predict probability of completing a purchase).
Random noise
Randomness (assuming it exists) is the third cause of model error. Randomness by definition cannot be predicted, even given all of the necessary features and a perfect machine learning approach.
Irreducible error
Now that we understand the sources of error in a model, let’s talk about the nature of that error. Generally speaking, error (no matter the source) can be categorized as reducible or irreducible.
- Reducible error can be reduced with improvements to how the model learns from the training data.
-
Irreducible error is the amount of error that cannot be eliminated no matter how well our model is fit to the training data. I think of irreducible error as being further subdivided into ‘local irreducible error‘ and ‘universal irreducible error³.’ A. I define local irreducible error as error that cannot be reduced because of the limits of data science tools or what data are locally or readily available. For example, error that persists after having thoroughly tested all available machine learning algorithms. Or error that persists because we do not have access to all data points that explain the target variable. Local irreducible error exists because we don’t live in a perfect world, it recognizes that there is only so much we can do with the tools and data that we are given. B. Universal irreducible error is the error that persists when local constraints are lifted. We have to dive into a hypothetical world to get here. This is the error that we would observe if we had the perfect machine learning algorithm and all of the data that we need to fully explain our target variable.
Diagram of error taxonomy:


With an understanding of the sources and classifications of model error, let’s finally get to how determinism ties into everything!
Here is a thought experiment – if we have a perfect model structure (i.e. f(x) = f'(x)), and x is the exhaustive set of all features needed to predict y, would our model still have irreducible error? Or in the terms I created, would the ‘universal irreducible error’ be greater than 0? Determinism says ‘no!’ Our model would be 100% accurate, because randomness does not exist. If error is not coming from other sources, there is no error! Universal irreducible error is always 0 in a deterministic world!
Of course, we can’t go much past a thought experiment with this one because a ‘perfect model’ isn’t possible given humanity’s current epistemic limits.
The ‘god’ model
Under the previous section, we discussed a hypothetical model that has the perfect formulation and has a complete, comprehensive list of predictors. This is what I refer to as a ‘god’ model⁴, meaning a deistic level of knowledge would be required to create such a model.
Under determinism, ‘god’ models are a theoretical possibility. Since randomness doesn’t exist, a perfect model will have perfect predictions.
Going back to epistemic limits, the only reason we cannot create ‘god’ models is because of our limits, not because of the nature of the universe.
Causality and design of experiments
Determinism mandates that everything is strictly causal. Some philosophers feel that causality is a human construct⁵. To accept determinism, one must accept that causality is a real phenomenon. (Note that this statement is not commutative— you don’t have to accept determinism to accept causality)
This has implications in how we think of the design and execution of experiments. Would you expect that a perfectly controlled experiment would have zero error? In other words, if we could completely isolate individual causes, and run the same experiment a million times, would you expect the exact same results, without any variation ever? If ‘yes’, you are well on your way to becoming a determinist!
Venturing into a perfect hypothetical world is a useful tool, but reality requires that we adapt to its imperfections. Of course we cannot perfectly control experiments – this is why the field of experimental design has provisions for handling apparent randomness and errors. Depending on our opinion of the universe though, we can see these accommodations as necessary only because of our epistemic limits (under determinism) or necessary because randomness is inherent in the universe.
Random numbers
There are random number generators that use random processes (e.g. atmospheric noise) to make numbers that cannot be replicated. These random numbers require hardware to capture.
Most data professionals (who don’t mind or even want their random numbers to be replicated – think about setting a seed) only need to use pseudorandom numbers. Pseudorandom numbers appear random, but are created by deterministic algorithms and do not need anything more than a computer program to generate.
If determinism is true, all ‘random numbers’ are actually pseudorandom numbers -remember, randomness doesn’t exist! Of course, going back again (I’m so sorry) to epistemic limits, the distinction between random numbers and pseudorandom numbers is meaningful because we can easily replicate pseudorandom numbers while random numbers would require a god-like level of knowledge to replicate. Sorry hackers, determinism is unlikely to be useful to you on this front… at least for now!
Conclusion
With just a little bit of ‘big picture thought’ deterministic ideas and implications show up a lot in data science. This line of thinking may not help you solve a specific, technical problem at work. But, I believe that having deep thoughts about how data and the universe are connected will make you a more well-rounded, insightful data scientist.
Notes
- In this article I only cover a basic and general idea of determinism. There are many different versions of determinism. Also, I do not argue for or against determinism – there are multiple alternatives to the theory of determinism – indeterminism, agent causation, dualism etc.
- Whether or not the universe has a beginning or is infinite is a question that philosophers have debated for literally thousands of years. Aristotle believed that observing chains of motion indicated that the universe was eternal because the chain can not have a beginning. Many medieval philosophers, like Thomas Aquinas, held that there was a first motion and that motion was God. Stephen Hawking opined that the universe started at the big bang and therefore has a definite starting point.
- I further subdivide irreducible error into ‘local’ and ‘universal’ to help me think about error more thoroughly. Since these are terms that I made up, further searches on the internet probably won’t lead to anything!
- While the ‘god’ model doesn’t really have any practical applications, it is a device that helps me think about the full spectrum of possible models. On the left (simplest and probably worst prediction) we have taking the average of our target variable and all the way to the right we have the ‘god’ model. To illustrate in the context of a regression problem, the average model will likely have a test R-squared near 0 while the ‘god’ model will have a test R-squared of 1 always. When I develop models I like to ask myself where I think my model lands on this spectrum.
- The Humean Regularity Theory suggests that we observe patterns and deduce causation when there isn’t necessarily a causal tie. David Hume believed we could not directly observe causation, so we have no reason to assert that causation is anything other than us noticing patterns.
- Determinism and free will – while well outside of the scope of this article, if you are interested in determinism, you should study the Philosophy of free will. Determinism’s implications on free will and accountability are fascinating, but I couldn’t think of a way that they tied into data science!