I admit – the title is pretty provocative, but do read on to see if this will make sense by the end. I think what motivated me to write this article was my growing sense that with the increased availability of tools and resources to build such models (e.g. R, Python, etc.), came what I perceived to be a disproportionately small increase in awareness of how to use said tools and resources properly.
I find it pretty easy to imagine a scenario where a beginner looking to start doing some predictive modeling turns first to Google. If you Google "how to build a predictive model", you get tons of articles with themes like "Perfect way to build a Predictive Model in less than 10 minutes". Those articles are all fine and dandy – they do teach you the steps and the code to build a predictive model. I’m not knocking them. Their purpose is clear. What stood out to me most, however, was that Google’s search engine optimization did not return the results that I’d consider even more important – should you build a predictive model and if so, how do you do it correctly? Maybe it’s fine that these quick-step articles are the ones that come up first – after all, I’ve written articles like those myself. I do think they have value in that they can draw newcomers to the field and introduce them to this topic possibly more than a traditional peer-reviewed article written in stereotypically dry academic language could.
However, we must draw a line at the point where newcomers become intermediate users but fail to become "intermediate" and more rigorous in their approach to predictive modeling.
That’s where I hope my article can come in. I present this article as Part 1 of a series of articles. I aim to discuss three main points:
- Do you really want to build a predictive model? Knowing the difference between explanatory and predictive models.
- If you do, here are the things you should consider.
- What are the stakes? For most of you, they’re low, and you can "build a predictive model" and go on your way. For others, it becomes dangerous.
Explanatory vs. Predictive Models
This section is heavily based on Shmueli’s paper describing the differences between explanatory and predictive modeling – I highly recommend you give it a read!
The main motivation behind this paper is that many people conflate models that have high explanatory power as models that inherently posses high predictive power. OK, what does that mean? Let’s take a trivial example:
IQ = age*b1 + b
where we assume a linear relationship between one’s age and one’s IQ.
Explanatory means we seek to understand how much variability an association explains (in this case, the one between age and IQ). A model with high explanatory power includes independent variables (e.g. age) that explain a large variability in the dependent variable (e.g. IQ). It is important to note that in explanatory modeling we assume an underlying causal hypothesis of something that affects another thing. On the other hand, predictive means we seek to understand how well age can predict IQ in new data (i.e. how precisely can we estimate b1 and b).
This difference essentially manifests in four characteristics
- Explanatory models attempt to capture cause and effect. Predictive models attempt to capture associations precisely
- Explanatory models tend to be hypothesis-driven. Predictive models tend to be data-driven.
- Explanatory models are retrospective. Predictive models are prospective.
- Explanatory models focus on minimizing bias. Predictive models focus on minimizing both bias and estimation variance.
To understand point #4, we have to quickly talk about the bias-variance tradeoff. You may have heard of it before and you may have seen this popular teaching diagram:
Formally, you can represent the expected prediction error for a given observation as:
EPE = Var(Y) + Bias² +Var(f(x)) [from Elements of Statistical Learning]
where Var(Y) is the natural variability in life, Bias is the systematic difference between the "true" relationship (which we NEVER know!) and our modeled relationship, and Var(f(x)) is the variability in our model.
In explanatory modeling, because we seek to understand the "true" relationship as accurately as possible, minimizing bias is our primary concern. However, for predictive modeling, we want to get the "sweet spot" by minimizing bias and our estimation variability in order to minimize our expected prediction error.
I’ve compiled the following table that illustrates how the differences between explanatory and predictive modeling manifest in common steps when you’re in the process of trying to build your own model.

The TRIPOD Statement
So, ideally you’ve read the previous section and are more informed about whether you want to build an explanatory or predictive model. If you’re still interested in building a predictive model, read on!
Now that we know a little bit more about when to build a predictive model, let’s focus on the how. The how in this case applies more to the considerations you must think about when building a model. This article does not cover the technical how, such as choosing a model or implementing in a language like R (I am planning on writing that as a follow-up). But I’m choosing to prioritize writing this because I think it’s so important that you understand how to do something correctly instead of copying and pasting code that you find to build some type of regression model.
Luckily, a lot of people put together something called the TRIPOD statement, or "Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis". It was published in multiple health journals, but the items from this statement can also be applied to other predictive contexts (business, politics, etc.). I’ll focus on items specifically from the development checklist (as opposed to the validation checklist) since the original premise of this Medium article is building a predictive model. The bottom line is the following:
- Clearly justify with as much detail as possible the data itself, the purpose of the model, the definition of the predictors, the definition of the outcome, the choice of model
- Interpret the model coefficients correctly and provide confidence intervals (if using a linear model)
- Report discrimination and calibration metrics on holdout data
- Discuss any limitations with using this model (can this model be used on everybody or everything?If not, why?)
Does this really apply to me?
I also wanted to include this short section for those who may have read up until this point and still want to build a predictive model. Yes, you can of course! You can for sure learn how to build predictive models by grabbing a dataset from somewhere (e.g. Kaggle) and playing around with the code and seeing what happens. I want to emphasize the point I made at the beginning of the article that it’s fine to do so until you start building predictive models with potential real-world impact without doing your due diligence. In other words, the worst case scenario to me is when someone who has some degree of authority starts to claim something about a predictive model if he doesn’t at least understand the fundamental aspects of predictive modeling. That’s dangerous and misleading.
Explanation and Prediction are Friends
I want to end with a more abstract takeaway of explanatory and predictive modeling. The differences are more nuanced than anything, and many people may not know these differences explicitly. However, there is some truth to why people believe that the two overlap so much. Both are ultimately important. While prediction may not directly address causal understanding, it may reveal complex associations that may generate new hypotheses in phenomena that are poorly understood. Likewise, explanatory models seek to explain the under- or un-explained. You can’t really have one without the other. But understanding these nuances can make all the difference.
