The world’s leading publication for data science, AI, and ML professionals.

How to use the Assumption-Lean Approach for Correct Inference

Why all models are wrong but some are useful.

Photo by Michal Matlon on Unsplash
Photo by Michal Matlon on Unsplash

Statistics is based on assumptions. If those assumptions become invalid, all conclusions based on those assumptions likewise become invalid.

"All models are wrong but some are useful" – George E.P. Box

An [assumption-lean approach](http://Statistics is a practice that’s based on assumptions. If those assumptions become invalid, all conclusions based on those assumptions likewise become invalid.) proposed by statisticians at the University of Pennsylvania outlines how to develop more defendable conclusions from our data. The method describes a new language for interpreting model coefficients as well as confidence interval calculations that are robust to an assumption-lean approach.

While there is nuance in the method, it’s very computationally efficient and straightforward to implement.

Technical TLDR

  • In practice, guaranteeing a correctly specified model is very difficult. A "true" model requires that all relevant predictors are included and that the noise term is IID.
  • Employ the assumption-lean approach. Because it’s often impossible to prove we have all relevant predictors, we simply assume the data are independently and identically distributed (IID).
  • To account for only using the IID assumption, we change our confidence interval calculations. The two methods discussed are the sandwich method and a bootstrap of our observed data.

Ok, but what’s actually going on?

Let’s slow down a bit and understand how we derive conclusions from data.

A Quick Example

In this example, we’re looking to model the equation for force:

Figure 1: equation for force. Image by author.
Figure 1: equation for force. Image by author.

The relationship is multiplicative and the two predictor variables are mass (M) and acceleration (A). In theory, if we have a linear model as well as some mass and acceleration data, we could perfectly predict force. Right?

Yes, but we don’t live in theory. We live in the real world where relationships are fuzzy and messy. For example, what happens if we’re measuring the force of an object falling off a building? Suddenly, we have to include drag force (D) and the gravitational constant (g). The equation becomes more complex:

Figure 2: equation for the force of a falling object - source. Image by author.
Figure 2: equation for the force of a falling object – source. Image by author.

What about wind? What if it’s raining and a water droplet hits the object?

If we don’t account for all of these factors, we may get incorrect estimates of coefficients for our predictors. In mathematical terms, our β below may not be exactly 1.0.

Figure 3: linear regression for force equation. Image by author.
Figure 3: linear regression for force equation. Image by author.

Force is a well-known equation, but what would happen if we look to model something with an unknown relationship? Can we ever know that all relevant predictors are included?

This was George E.P. Box’s argument. Models simplify complex phenominon, so they’re never perfect.

Types of Model Error

Despite the fact that most models are "imperfect," some imperfections are worse than others. Here, we will discuss two categories of error:

The first category is noise (ε). Noise is fluctuation in our data that is completely independent of everything i.e. we couldn’t model it even if we had perfect data. In our example above, noise could be the unsystematic measurement error of the force.

Nearly all modeling techniques assume there will be noise in the dependent variable.

The second category is called the error due to model misspecification (η). It is not encompassed in most statistical modeling frameworks. Instead, we often assume a "perfect" model, and thereby η is zero.

But if Dr. Box was right, why do we assume models are correctly specified? Well, this assumption allows us to make causal conclusions. If we have a perfect equation for force, we can casually claim that F=M*A.

In practice, we never have a perfect model, so it’s common to assume error due to model misspecification is simply noise.

The Problem

That assumption is a problem. Statistics is a precise method that relies on assumptions. If the assumptions are wrong, the conclusions cannot be guaranteed to be accurate (although they might be close to the correct answer).

By now, we know that missing predictor variables is problematic, so let’s tackle another key assumption: our model has the ability to fit the relationship between X and Y. So, if our data exhibit a linear relationship we would be using a linear model. If the relationship is parabolic, we should be using X² instead of X.

If we fail to meet this assumption, it’s possible that we will observe very different fitted values when looking at new data.

Figure 4: difference between an incorrectly specified linear model on exponential data and a correctly specified linear model. Image by author.
Figure 4: difference between an incorrectly specified linear model on exponential data and a correctly specified linear model. Image by author.

Take figure 4 as an example. We can see that on the left we have an incorrectly specified model; the relationship between X and Y is nonlinear but we’re fitting with a linear model. If we happen to pull Sample 1 (S1) which has small values of X, the estimated slope of our line will be much smaller than if we sample larger X’s (S2).

Now in practice, we’d hope to have a representative sample of the entire dataset, but this example highlights how sensitive a misspecified model is to slight variations in the data.

On the other hand, if we correctly specify our model, we should see similar slopes for any sample of X, as shown on the right. Moreover, as we fit with larger sample sizes, our accuracy will increase. For a misspecified model, that may not be the case.

The Solution

To solve this problem, Statisticians at the University of Pennsylvania developed an approach that does not assume the model is correct.

This assumption-lean approach only requires that our data are independent and identically distributed (IID). In English, IID means that our observations are not systematically related and that they come randomly sampled from the same population.

With just an IID assumption, we can account for Box’s paradox and develop a correct interpretation of our model. However, we do lose causal interpretation because we don’t assume our model includes all relevant predictors.

Here are the two changes required to develop a valid assumption-lean model:

1 – Confidence Interval Caluclations

As you might imagine, confidence intervals (CI’s) are larger in an incorrectly specified model – we’re less confident in our estimate. To account for a potentially misspecified model, two methods were proposed.

The first, called the sandwich method, uses Heteroscedasticity-consistent standard errors to estimate CI’s that are robust to model misspecification. The computation of these estimates is out of the scope of this post, but libraries that do the calculation are available in most programming languages.

The second, called a bootstrap standard error, involves resampling the data many times and developing percentiles from that resampled distribution. Again, the calculation is beyond this post, but bootstrap sampling is commonly available in most programming languages as well.

There is little difference in the estimated CI’s from each method, so you can use either. The main advantage of the first method is it’s more computationally efficient. However, bootstrap sampling gives you more data to derive conclusions from – for instance you can test the normality assumption using a QQ-plot.

2 – Language of Interpretation

We also need to account for the lack of causality. Often, when interpreting model coefficients, we say:

"For a one-unit change in X, we will see a β-unit change in Y, holding all other predictors constant."

Here, beta (β) is our linear regression coefficient. However, this language assumes a correctly specified model so to cover the possibility that we have an incorrect model, we instead say:

"β is the difference in the best linear approximation to Y for a unit difference in X, holding all other predictors constant."

Is this overkill?

That’s a tough question to answer concisely.

In most data science applications, we don’t have to be perfect. However, if you want to be as statistically rigorous as possible, you should adopt this framework.

In most industry applications, if we’re looking to develop Causal Inference from a model, we look to include many predictors to develop the most robust model possible. However, if you subscribe to Box’s rationale, a perfect model is impossible.

The author suggests doing exploratory inference using "incorrect" models, then determine true causality using an A/B test.

Implementation Notes

  • Correct specification of models is very widely accepted. If you’re not a senior employee, it’s a good idea to really understand these nuances before pitching the rest of your team.
  • Sometimes we don’t need true causality to inform a decision. If we have a good approximation of the "true" model, we can still develop accurate inferences – we just can’t guarantee accuracy.

Thanks for reading! I’ll be writing 43 more posts that bring "academic" research to the DS industry. Check out my comments for links/ideas on developing correct models.


Related Articles