Notes from Industry
In this article I want to take a page out of a data scientist’s book and explore an important guardrail that the Data Science community has implemented as part of their modeling process – defining model stability independently of model performance.
With the rise of big data, the number of variables available to model a given problem have grown exponentially. For instance, models used to identify songs or movies no longer just use generic variables like previously watched movies or liked songs etc. Now we can include a whole slew of metrics that make the user profile more dynamic – time of day, weather, history, potential mood, songs liked but not listened to again, etc. The list can grow in perpetuity and it probably will. Every day we have the least amount of data we’ll ever have the most we’ve ever had. More correlations are discovered every day. Some are causal and some are not.
How do we know which variables to use and which ones to disregard? What if some variables are significant to predict behavior for one user but not so much for another?
Fortunately, data science came up with the idea of defining model stability. It’s another way to define model performance but it’s not entirely dependent on forecast accuracy. "Stability" is such a fluid term that its meaning depends on the model at hand; More generally, it’s a measure of how a model learns (stability) and not what it learns (accuracy). Controlling for accuracy we want to choose a model that is more "stable" – can be applied to most users, is able to identify relevant set of variables consistently, and maintains the ordinality of relevance of variables. All of this to say – between two very accurate models, we want to choose one that can be applied more generally and one that does not get out of control if we change things ever so slightly. For example – a complex, accurate and pragmatic high frequency trading algorithm can go haywire if prices of securities become highly volatile. So much so that NASDAQ had to implement circuit breakers to halt trading if such an event were to arise and it has. The most recent blow up happened on March 9 2020. While there were many factors leading up to the circuit breakers being activated the fact remained that post a certain threshold the models could not be trusted to make accurate or stable decisions. So while data science is conscious of separating stability from accuracy, I wonder if Econometrics should actively work to build models keeping accuracy and stability in mind. And if we do, how should we define stability?
Once we start defining stability it becomes easy to make it more and more nuanced. Many machine learning models utilize k-fold or n-fold cross validation to measure stability or variables chosen and we can easily adapt these methods for econometric modelling. However, since econometrics works primarily in the "frequency" domain i.e. the data used to train models has temporal relationships, it would behoove us to more closely examine how we can define stability in this new variable space.
For starters, unlike k/n-fold cross-validation, we can’t subset the data randomly without loosing temporal relationships. The models we use to forecast are often dependent on the assumption that a value at time t might be causally correlated to its lags {t-1, t-2, … , t-n}. Additionally, based on how we decide to model these temporal relationships we use different time series structure representations – ARIMAX, Exponential State Space, Fourier basis representation, Radial basis representation etc. Each of these exploit temporal relationships slightly differently. This means that the learning mechanism (stability) cannot be measured by one generalized cross validation technique.
For the purposes of this article I want to focus only on one temporal representation – AR structures and how stable the algorithm implied in the function "auto.arima" in R is. The function is a part of the forecast package. I hope readers will come forward with ideas on how to best define stability for other structures I mentioned.
An ARMA process can be represented as below:

Coefficients for each of the lags of X and Epsilon can be calibrated, using an information criterion (AIC), which inherently works to reduce error (improve accuracy). However, AIC holds no information about the stability of the structure and only evaluates how efficient a model is about retaining information of the data it trains on.
If AIC is the selection criterion of coefficient values and lags then perhaps we can measure the stability of an ARMA model by looking at :
- The coefficient value assigned to each lag – an accurate and stable model should be able to calculate the correct coefficient and then do it every time we add another data point for it to train on.
- The model’s reaction as we add random perturbations to the data – if we try to trick the model by adding data that does not come from the same population as our training data a stable model, ideally, would not get tricked all that easily even at the cost of accuracy; **** it should not try to predict the shocks instead.
Another important characteristic of stability measurement is the idea that all the measurements must happen on subsets of the same sample data. Given the temporal nature of our data, we must switch up these techniques slightly to preserve the temporal information for our models. One solution is to use use rolling validation. Often we use it **** as a way to measure out-of-sample forecast accuracy but we can utilize the same machinery here; just have the machine measure different metrics as it works.
[ For a quick catch-up on rolling validation readers can visit Rob J. Hyndman’s blog]
We can easilty simulate an AR process. We will know a-priori what the lag coeffiecents are. Next, we can train the auto.arima algorithm using the rolling validation method and see how quickly and how often the auto.arima algorithm picks up correct lags and coefficients. See below for a visual representation of the AR process:

The above process has 4 lags with the coefficient vector = {0.7,-0.2,0.5,-0.8} respectively and is 1,000 periods long. The rolling validation/training starts at n=20 and below is a representation of the coefficients calculated at each iteration. At least for this simulated dataset, the auto.arima algorithm takes approximately 400 data points to approach a numerically stable and acceptably accurate solution for coefficients (Fig 1). But notice how the out-of sample accuracy of the calibrated model even for the first 200 data points, where the model is very volatile, is comparable to the later data points(Fig 2); the only difference is that later calibrated model is a lot more robust against each new data point added for the algorithm to train on.


To further drive the point on accuracy vs. stability – another common problem in Time Series Forecasting is when the training data contains random discontinuities that don’t necessarily follow the same underlying dynamics as the time series itself. Often, and if we are able to accurately identify these shocks, we look to either remove the data points from the training data entirely or choose to smooth them out to reduce its bias on the model. However, if we’re able to measure the robustness of our methods/algorithms we can take a much more informed choice on how much we should modify the raw data or engineer it before we can model over it. This is not to say that feature engineering is an ineffective modeling approach but one must stop to reflect on whether it helps us improve the model or just complicate it. In my opinion, parsimony should not be traded for complex sounding models that offer little to no improvement in the insights we draw from them.
To test auto.arima’s stability we want to perturb the data slightly by adding random shocks that do not come from the same the distribution as the data that we wish to forecast. We know that auto.arima attains stability, for this dataset, at around 400 data points; we can run a running a rolling validation on a perturbed version of the same time series and see how it compares to its peer series that was not perturbed. See below for the same time series but with random discontinuities sprinkled in :

And below is how the accuracy of auto.arima compares to the previous unperturbed version of our data (yellow lines show where the discontinuity was added):

It’s easy to see that the forecasting algorithm does in fact become relatively less accurate when we shock the training data randomly. What people tend to forget is that while stability might be independent of accuracy, it directly affects it. See below for the coefficients calculated for this new dataset:




The legend shows the true coefficient value which we know the model converges to, and accurately, in the unperturbed data. However, when the same data is shocked randomly, it completely throws the coefficient values off evident from the red lines. A few uncorrelated shocks to the data entirely changed the estimated model representation. A stable algorithm should not be influenced, too much, by an unsubstantiated shock to the data.
The interesting thing here is the the algorithm is unstable and hence inaccurate both in terms of the coefficients calculated – AR1, AR2, AR3 and AR4 values are very different than the original dataset and also in the number of coefficients calculated – the original data does not have a MA1 or MA2 terms which the algorithm, incorrectly, assigns non-zero values for our perturbed dataset. If auto.arima picks the wrong model representation (coefficients) then by construction the forecast it produces using those coefficients will be further off from the actual values compared to the model with unperturbed data.
I want readers to stop and think about the above results a little – when you finish out a forecasting exercise and the software spits out the result, which values do you think your algorithm is showing you – The black line or the red line? Since we only see the final value and not how it converged to it, the nuance of stability gets lost. Your answer will determine the feature engineering steps you need to take to ensure you’ve modeled an econometric process responsibly.
In this case, auto.arima sacrificed stability and in turn accuracy. Unfortunately, in my experience, breaking out stability from accuracy is not a part of the conventional econometric workflow. If we measure accuracy and stability as two different metrics on raw and processed data then we are able to take a much more informed decision on whether we should engineer the raw data or not. There’s many ways to engineer to reduce bias of a single measurement and even more algorithms to choose from to forecast; it’s imperative that we have a consistent and rigorous framework that helps us pick each.
Remember, we’ve just scratched the surface of measuring econometric stability with this article but we’re already able to provide adequate and pragmatic support for many modeling decisions that are often based on "educated" guesses or worse – hunches.
I hope readers are inspired to come up with newer ways to explore the topic!
Vedant Bedi is an Analyst at Mastercard working on the NAM portfolio development team. Vedant holds a Bachelor’s degree in Mathematics and Economics from NYU (magna cum laude class of 2019) and holds an avid interest in data science, econometrics and its many applications in finance.
Vedant is also an inducted member of Phi Beta Kappa (NYC chapter) – the oldest academic honors society in the United States.