Data science is a mix of many disciplines – statistical inference, Analytics, visualization, classification, and forecasting. The one that people generally find most fascinating is forecasting.
Forecasting the future sounds like magic whether it be detecting in advance the intent of a potential customer to purchase your product or figuring out where the price of a stock is headed. If we can reliably predict the future of something, then we own a massive advantage.
Machine learning has only served to amplify this magic and mystery. As forecasting models increase in complexity (at the expense of losing interpretability), they are able to uncover previously unnoticed correlations as well as capitalize on previously hard to use data.
But forecasting, at its heart, is actually pretty simple. Today we will introduce a mental model for thinking about forecasting that will hopefully demystify this art and encourage you to add forecasting to your quantitative toolkit.
Forecasting with Linear Regression
One of the first models we learn in statistics is linear regression. It’s versatile, interpretable, and surprisingly powerful. And it is a great tool for illustrating how forecasts work.
The non-math description of how linear regression works is that it seeks to fit a line as optimally as possible through the dots that represent your data. This line attempts to capture the trend in your data. The plot below shows what this line of best fit (in blue) looks like. Let’s unpack what is going on:
- Each dot represents an observation. Its vertical position is the value of the target variable (what we are trying to predict) for the observation represented by the dot. And the dot’s horizontal position is the value of our feature (our independent variable).
- There is clearly a positive correlation between the target variable and our feature – notice how the dots trend upwards and to the right.
- The line of best fit is the line that, while going in the direction of the trend, passes through as many of the dots as possible. This line generalizes the relationship between our target and our feature.

So what’s this have to do with forecasting the future? Let’s imagine that we are trying to forecast U.S. GDP (economic) growth. We hypothesize that there is a relationship between GDP growth and corporate profit margins (how profitable companies are). We plot them and see the following relationship (all data is made up by me and for illustrative purposes only):

Strong positive correlation! So do high corporate profit margins predict high GDP growth? Not so fast, this is where the little voice in your head should be whispering:
"Correlation does not imply causation."
- Voice in your head
It’s true that periods of economic expansion are generally accompanied by fat corporate profits (and fat margins). But we are trying to understand whether profit margins predict economic growth. To do this, we need to modify our analysis:
- Wrong: current profit margins vs. current GDP growth.
- Correct: current profit margins vs. next year’s GDP growth.
This distinction is key. We are not trying to find features that are concurrently correlated to GDP growth. Rather, we want to find features that predict future GDP growth. That is, we want to find features correlated to next period’s GDP growth (our target is the change in GDP growth over the next year). Let’s redo our scatter plot with this in mind:

Where did our positive correlation go? Predicting the future is hard. Strong concurrent correlations between variables generally vanish when we switch one of the variables to its future state (and measure predictive correlation).
That doesn’t mean that all hope is lost. Usually, if we look hard enough and think creatively, we can still find some truly predictive relationships. But do not expect these relationships to be strong – in my experience (trying to predict the future in financial markets), predictive correlations of 0.20–0.30 are already quite good.
A Mental Model for Understanding Forecasts
Let’s say we continue to diligently test features against future GDP growth and eventually stumble upon the following relationship:

We have found a relationship between future GDP growth and the currently observed growth in the working young (people aged 25–35 years). It’s not as strongly positive as the concurrent relationship (GDP growth and profit margins) we saw earlier, but this is a predictive relationship that we can use to build a forecasting model (again, I emphasize that this is all data I made up in order to illustrate how forecasting works).
So how do we build the model? Well actually it is already done – the blue line of best fit is our model. We make our prediction by simply seeing what the current change in the working young population is (let’s say it is the value denoted by the gold dashed line). We then look at the value along the y-axis (vertical axis) where the gold dashed line intersects with our blue line of best fit (the green dashed line) – that value is our prediction.

Ok cool, that’s pretty easy. And the process is the same with more features or even with more complicated forecasting algorithms.
But at a more fundamental level, what does our forecast really mean? Every forecast, regardless of how complicated the model that produced it is, can be thought of as a conditional expectation. A conditional expectation is nothing more than an a type of expected value, a.k.a an average.
So am I telling you that the value produced by your fancy XGBoost regressor is nothing more than an average, quite literally the first concept you learn in a beginner statistics class?
Pretty much (with a few caveats). Don’t believe me? Let’s conduct a thought exercise. If I asked you to predict what the weather will be tomorrow (without checking the internet), how would you do it? You would do something like the following, "yesterday it was hot, and it’s August right now, and we are in San Jose, so I predict tomorrow to be sunny and hot."
Why did you come to that conclusion? It’s because, maybe even without consciously realizing it, you did a mental inventory of all the days within your memory where you were in San Jose, it was summer, and the weather of the previous few days had been hot. And the weather on average for each of those days in your mental inventory that were relevant, was sunny and hot.
So you made your prediction by filtering down to just the relevant observations (summer days in San Jose that had been preceded by hot weather) and taking the average of those observations. That’s generally how we, as people, intuitively make predictions – so why should we expect models and machines to be any different? The answer is that they aren’t different, except that they can process much more data and do so in a less biased manner.
So my suggestion for understanding how forecasting the future works is to think of each prediction as derived from an average of the most relevant previously observed observations. Forecasts of the future are just answering the following question:
When I saw something similar to this happen in the past, what typically occurs next?
Visualizing the Conditional Average (a.k.a. our Prediction)
Let’s use our previous example to visualize what we mean when we say that a prediction is a conditional average. The plot below shows how I tend to visualize linear regressions and more generally, the outputs of predictive models. Each gold rectangle is a slice of the sample (our observations) conditioned upon the the feature variable being between certain values. For example, the first gold rectangle on the left might be just those observations of GDP growth (our target, on the Y axis) where the growth rate of the working young population (our feature, on the X axis) is between 0% and 0.1%.

Notice how the blue prediction line (the line of best fit) falls in the middle of each of those gold rectangles? That’s by design – the middle of each rectangle is the average of that particular slice (recall that each rectangle represents one slice). And as we take smaller and smaller slices (as we increase the number of rectangles while at the same time decreasing the width of each rectangle), the series of conditional averages (represented by the middle of each rectangle) converges to the regression line.
"But hold on a second!" you are thinking. The middle of some of these rectangles is clearly not the average of the observations within that slice. For example, that first gold rectangle includes only one observation, and it is above our prediction line. So how can I say that the prediction line is the conditional average, a.k.a. the average within that slice?
For simple linear regression, the conditional averages don’t match the averages of the observations in each slice because the regression algorithm enforces a linear relationship between target and feature. But all is not lost! Our mental model is still valid.
OLS (ordinary least squares), the method linear regression uses to find the line of best fit, is like taking a bunch of conditional averages but with a twist. We can use a voting analogy to see what I mean:
- Imagine each gold rectangle as an individual precinct.
- Each precinct wants to set the value of the blue prediction line within the precinct (this is the same as the vertical midpoint of each gold rectangle) as close to the local precinct average as possible (the average Y axis value of all the black dots in the precinct).
- When an individual precinct moves the prediction line, that move affects the value of the prediction line across all precincts (because it must remain a straight line).
- Each precinct’s ability to impact the prediction line is positively proportional to the number of observations within that precinct (more observations, more impact).
The net result is that each precinct does its best to push and pull the line towards its own average. At the same time, each precinct feels and is affected by every push and pull made by the other precincts, especially the ones with a lot of observations. Ultimately, the prediction line settles in such a way where:
- Each precinct is satisfied, having moved the line, even if just a little, towards its own precinct average.
- The overall average of the group is taken into account (since everyone tries to move the prediction line towards their own individual average).
So while the prediction made by a linear regression is not a conditional average in the strictest sense, it is similar to one thanks to the way each precinct tugs the prediction line towards its own average.
So What?
How do conditional averages help us better understand forecasting? Well, they are the underlying backbone of every forecast. No matter how complicated, every algorithm at its most basic level tries to identify observations that are similar in nature to the ones it is trying to make predictions for. And the predictions it makes are derivatives of the average values of the target among those most similar observations.
There are even certain algorithms, like k-nearest neighbors or linear regression using only non-ordinal categorical features, that predict explicitly via the conditional average.
So there is nothing magical about a forecast – neither algorithms nor humans can transform current data into precise forecasts of the future. Rather, we are all just using what occurred in the past as a rough guide for what may happen in the future.
Key Takeaways
We should keep two things in mind when building predictive algorithms. First, since we use the past to inform our predictions of the future, context is critical. If we are in an environment that is significantly different from the available history that our model was trained with, then we cannot expect our predictions to perform well.
Second, this mental model is useful for thinking about over-fitting. We can build a model that generalizes reasonably well by following these 2 principles:
- Make sure that each conditional average includes enough observations (an average of just a few observations is more likely to be wrong than that of many). We can ensure that this is so by not including too many features – too many features (relative to the number of observations) would result in extremely fine slices, each with very few observations.
- Be careful with algorithms that allow an individual precinct (slice) too much influence on its own within-precinct prediction. An algorithm that incorporates too much non-linearity (like a polynomial regression or a deep neural network) would allow each precinct to pull the within-precinct prediction to be equal to its precinct average. This could lead to unreasonably volatile forecasts that don’t generalize well.
Hope this was helpful! Cheers!
More Data Science and Analytics Related Posts By Me: