Recession Prediction using Machine Learning

“Subprime is contained” — Bernanke (March 2007)

Terrence Zhang
Towards Data Science

--

  • To receive periodic updates on the model predictions via email, click here.
  • Recession Prediction chart is currently hosted here.
  • Python script and documentation can be found here.

Contents

In a Nutshell

This project presents a machine learning approach for predicting U.S. recession occurrence for 6-month, 12-month, and 24-month time frames. The predictive model uses a handful of employment, inflation, interest rate, and market indicators.

Defining the Problem

Simply put:

Can one predict the timing of future U.S. recessions by looking for clues in economic and market data?

Several interesting attributes of this problem include:

  • Rare-event prediction: Recessions are rare.
  • Small data set: Because I am using economic data (which is updated at a frequency of months or quarters), I will only have a few hundred data points to work with.
  • Unbalanced class samples: While the imbalance is not severe enough to render this project into an anomaly detection exercise, one must still be cognizant of sample class imbalance, especially for error calculations.
This is the most extreme example of class imbalance in the exploratory data set
  • Time series: Typical for a project involving financial data. This affects how one carries out cross-validation (one should not do simple K-fold validation).

Why does this problem need to be solved?

This is a timely problem as the current bull market grows old. Armed with a good sense of recession likelihood:

  • Policy makers could enact countermeasures to decrease the severity of cycle downturns.
  • Market participants can save money by adopting defensive investment strategies.

Main Assumptions

There are 2 main theoretical assumptions at play:

1) U.S. recessions exhibit markers / early warning signs.

There exist plenty of recession “signals” in the form of individual economic or market data series. While individually, these signals have limited information value, they may become more useful when combined together.

2) Future recessions will be similar to historical recessions.

This assumption is a lot more shaky, but can be mitigated by choosing features that maintain significance despite the changing economic landscape. For example, focusing on manufacturing data may have been relevant historically, but may be less relevant going forward as the world goes digital.

What Have Others Tried?

Model Benchmarking / Comparison

Ideally, I would compare my model performance to each of the alternatives above. Currently, I cannot do this for the following reasons:

  • Guggenheim model: Model performance data is not publicly released.
  • New York Fed model: Upon closer inspection, their model is built to answer the question “what is the probability that the U.S. will be in a recession X months from now?”, whereas my model is built to answer the question “what is the probability that the U.S. will be in a recession within the next X months?”. For more detail, see the How do I label the class outputs section below.
  • Rabobank model: The same reason I can’t compare my model performance to the New York Fed model. Additionally, the Rabobank model covers a 17-month time period, whereas my model covers 6-month, 12-month, and 24-month time periods.
  • Wells Fargo Economics model: Model performance data is not publicly released.

Getting the Data

Some things I had to consider when getting the data:

  • Economic data are released at different frequencies (weekly, monthly, quarterly, etc.). To time-match data points, I settled for only using data that could be sampled monthly. As a result, all predictions must be conducted at a monthly frequency. This means I had to forgo using valuable data series such as GDP, since it is released quarterly.
  • Even when only using monthly data, different data have different release dates over the course of a month. To control for this, all predictions are conducted using the most recent data available as of the 7th day each month. The FRED API has a parameter that can set this restriction. EDIT (January 2021): the code now pulls data as of the 8th of each month to accommodate more timing edge cases.
  • Varying data history lengths. Some data has been released since 1900, while other data only goes back a few years. This means I had to exclude potentially useful data that just didn’t have enough history.
  • Speaking of history, I needed enough data to encompass as many recessions as I could. In the end, the full data set included 9 recessions since 1955.
  • Economic data gets revised often. But FRED does not provide the original figures. It only provides the revised figures (no matter how far after-the-fact those revisions are made). One alternative would be ALFRED, which shows numbers available as of a particular “vintage” date, but ALFRED’s fatal flaw is it does not have vintages as of every date in a time series. By sticking to the most recent revised numbers from FRED, I make the implicit assumption that revisions are unbiased (i.e. revisions are equally likely to push the original figure up vs. down).

Data Sources

For practical reasons, I used public domain data available through FRED and Yahoo Finance. I did not use potentially useful data that is stuck behind a paywall, such as The Conference Board Leading Economic Index.

Feature Selection

Some project-specific considerations for feature selection:

  • Curse of Dimensionality. Since the data set is so small (only a few hundred data points), one cannot include too many features in the final model. Otherwise, the model will fail to generalize to out-of-sample data. Therefore, features must be carefully chosen for the incremental value that each feature provides.
  • Domain knowledge is key. Since the underlying process is a complex time series, automated feature selection methods have a high risk of over-fitting to prior data. Therefore, feature selection must be guided by a solid understanding of economic fundamentals.

First, a sneak peek at the final feature list. Note how small it is:

Note that only 6 features made it to the final list

A general outline for my feature-selection process

  1. Define the data set on which to perform exploratory data analysis (August 1955 to January 1972) to ensure no intersection with cross-validation periods.
  2. Organize potential features into buckets, based on economic / theoretical characteristics (domain knowledge).
  3. Plot pairwise correlations between each individual feature and each output type (6-month ahead, 12-month ahead, 24-month ahead) using the exploratory data set only (no peeking ahead!).
  4. Move sequentially from feature bucket to feature bucket, such that each bucket has at least one feature in the final data set. For tie-breakers, pick features that have low correlations to features that have already been “accepted” into the final data set.

Here is how it went:

First, pick 1 Employment feature

  • Civilian Unemployment RateBusinesses don’t typically start firing until after things start turning sour. Therefore, unemployment typically lags recessions.
  • Total Nonfarm PayrollsThis is a better starting point. Businesses may slow the pace of hiring on expectations of a more challenging economic environment.

I considered the 3-month change in nonfarm payrolls (%), 12-month change in nonfarm payrolls (%), and the difference between the 3-month change and 12-month change. Turns out that the difference between the 3-month change and 12-month change was the best predictor (out of the 3 candidates) in the exploratory data set, so I picked this as my employment feature.

Pick 1 Monetary Policy feature

Setting bounds for the Fed Funds Rate is one of the main monetary tools used by the Federal Reserve. I considered the Real Fed Funds Rate, as well as the 12-month change in Fed Funds Rate. The 12-month change in Fed Funds Rate was the best out of the 2 candidates, so I picked this as my Monetary Policy feature.

Some people believe that the Fed can influence economic activity via setting bounds for the Fed Funds Rate.

Pick 1 Inflation feature

Inflation has a more complex relationship with business cycles, as it can occur during periods of rapid expansion, as well as periods of contraction (stagflation). But there may be higher-order effects between inflation and other features, so it is worth including regardless.

The Consumer Price Index (CPI) is a commonly-used proxy for inflation. I considered the 3-month change in CPI (%), 12-month change in CPI (%), and the difference between the 3-month change and 12-month change. All candidates had low correlations to the Employment feature and Monetary Policy feature, so I picked the 3-month change in CPI as my Inflation feature since it had a slightly higher correlation to the output labels in the exploratory data set.

Pick Bond Market features

I would have loved to use a credit spread index of some type (investment grade or high yield) here, but could not find a credit spread index with data going back to 1955.

So instead I started looking at treasury rates. Treasuries, as a “risk-free” asset class, play an important role in the global asset allocation framework. Shifts in treasury rates can indicate shifting expectations about economic opportunity and market risk.

I grandfathered-in the spread between 10-year and 3-month treasury rates, since the yield curve slope has a decent historical track record for predicting recessions in advance. There are several theoretical explanations for this phenomenon. One of these explanations is that if investors perceive “bad times” on the horizon, they allocate money away from risky assets, and towards longer-term treasuries to lock-in a “risk-free” return over a longer time frame.

After picking the 10-year vs. 3-month treasury spread as my first indicator, I considered the 3-month treasury rate (12-month change), the 12-month change in the 10-year vs. 3-month treasury spread, and the 10-year treasury rate (12-month change). I rejected the first two candidates since there were extremely correlated (90%) to the Monetary Policy feature. I decided to pick the 10-year treasury rate (12-month change) as my second Bond Market feature since it can be a gauge for long-term growth expectations.

Pick 1 Stock Market feature

Stock markets are driven by present-day discounting of future expectations, so they are considered to be leading indicators. I considered the 3-month change in the S&P 500 Index, 12-month change in the S&P 500 Index, and the difference between the 3-month change and 12-month change in the S&P 500 Index. The latter 2 candidates both had high correlations to the output labels, but I picked the 12-month change in the S&P 500 Index since it is easier to interpret.

Designing the Testing Process

Some things I had to consider when designing the tests:

Time Series? Then no K-Fold Cross-Validation

Since the underlying prediction is a time series, one must conduct model cross-validation and prediction in real-time (i.e. only using data that was available at each point in time!). K-Fold cross-validation would violate this principle.

How are recessions defined?

The National Bureau of Economic Research (NBER) Business Cycle Dating Committee names “Peak” and “Trough” dates for business cycles. I treated the months between an NBER-defined peak and trough as recessionary months for the United States. But there is a huge problem here:

NBER declares “Peak” and “Trough” dates several months (sometimes years) after the fact!

This means one cannot label outputs one period at a time! So to deploy a recession predictor of any sort, one must re-train the model at each formal NBER “trough” announcement (after each recession end). Therefore, one must also take this same approach during backtesting!

If one follows this approach, there are 5 possible tests to run (peak-trough pairs), each with NBER announcement dates:

Link: https://www.nber.org/cycles.html

For each test, the training set starts at August 1955 and gets longer with each subsequent test. Similarly, each cross-validation set starts at January 1972 and gets longer with each subsequent test.

How do I label the class outputs?

At first, this seems easy: Recession = 1, No Recession = 0. That is true, but is insufficient as it pertains to the desired model scope. To determine the labeling process, one must figure out the question being answered.

Is it:

What is the probability that the U.S. will be in a recession X months from now?

If one wants to answer this question, one must forecast both the start and end of a recession. The model-predicted probabilities should start to drop before the recession actually ends. Needless to say, this is a pretty heroic task.

Another question one could ask is:

What is the probability that the U.S. will be in a recession within the next X months?

This question forecasts the start of a recession, but only “now-casts” a recession end. The model-predicted probabilities should start to drop only after the recession ends. This task is less idealistic, but it still allows market participants to prepare for market tops.

Thus all months that are in a recessionary period will receive a label of “1”. Additionally, the X months (where X = 6 for the 6-month-ahead predictor, X = 12 for the 12-month-ahead predictor, etc.) preceding each recession start will also receive a “1” label. All other months will receive a label of “0”.

Test Results

I considered a handful of models and ran the aforementioned 5 tests on each model. As stated before, all cross-validation and prediction was carried out in real-time (no cheating!), only using data that was available at each historical point in time.

Models Considered

I considered the following 6 individual models: K Nearest Neighbors, regularized Linear Regression (Elastic Net), Naive Bayes, Support Vector Machines, Gaussian Process, and XGBoost. I tuned hyperparameters via grid-search on cross-validation error. As mentioned earlier:

For each test, the training set starts at August 1955 and gets longer with each subsequent test. Similarly, each cross-validation set starts at January 1972 and gets longer with each subsequent test.

Additionally, I tested 2 ensemble models out of the 6 individual models above:

  • A Grand Average model that equally weights each individual model’s predictions
  • A Weighted Average model that weights each individual model’s predictions based on each individual model’s cross-validation error

I also tested a Stacked ensemble by training a linear regression model on each individual model’s cross-validation predictions. This approach produced atrocious results. My intuitive explanation is that the data set is too small to train a powerful Stacked ensemble. There are simply not enough cross-validation predictions available!

Choosing an Error Metric

Which error metric should I use to quantify prediction error? Since this is a classification problem, metrics such as accuracy, prediction/recall, and F1 score come to mind (read about all of these here). However, none of these metrics are well-suited for this particular problem, because the predicted output class (“0” or “1”) is not as relevant as the predicted probability of being in each class!

Instead, I used weighted log-loss as my error metric. It considers the distance between the predicted probability of class = “1” and the actual class value (“1” or “0”). Log-loss assigns an exponentially heavier penalty to larger discrepancies between predicted probabilities and actual class output.

Adjusting for Class Imbalance

During model training: for some of the models (Elastic Net, Support Vector Machines, and XGBoost), there are built-in parameters to set class weights. I set these parameters such that both classes have an equal effect on the loss function for model training.

During model cross-validation: Using scikit-learn’s implementation of log-loss, I can adjust for sample class imbalance (using the “sample_weight” argument) such that both classes have an equal effect on overall log-loss values for model cross-validation.

Prediction Smoothing

Raw probability predictions tend to be volatile, like this:

Predictions before smoothing

Therefore, I applied exponential smoothing to the raw probability predictions. I chose the weighting factor (for the exponential smoother) such that the half-life of each prediction is 3 periods:

Predictions after exponential smoothing

Final Model

I ended up selecting the Support Vector Machine as the final deployment model. Log-loss for the 8 models (6 individual models and 2 ensembles), across 3 time frames, are below:

Remember: the smaller the log-loss, the better

Astute readers may notice that the Support Vector Machine (SVM) does not score the best across all 3 time frames.

Why did I not choose the “best performing” model?

One word: Storytelling.

To elaborate, observe the side-by-side comparison of the Elastic Net, Weighted Average, and Support Vector Machine predictions, for the 6-month time frame:

The Elastic Net and Weighted Average models score better than the SVM (on a weighted log-loss basis) because they predict higher probabilities of recession in general! Essentially, they have a higher “false positive” tendency than the SVM, while having a marginally lower “false negative” tendency than the SVM.

This asymmetry is the key to not selecting the “best performing” model. Models (Elastic Net, Weighted Average) that indicate higher near-term recession probabilities, even during long periods of economic growth, are easier to discredit as “alarmist” and will be harder to sell. On the other hand, a model (SVM) that indicates recession odds closer to 0% during good times, but can also spike to high levels (albeit not as high as the “alarmist” models) during periods of recession, will be harder to discredit.

So I chose the SVM model not because it was the “best performing” model from a statistics view, but because it tells a story that is more consistent with how humans perceive economic conditions!

The chart hosted here displays the SVM model predictions, updated monthly.

To receive periodic updates on the model predictions via email, click here.

Special thanks to Jason Brownlee for his article on the Applied Machine Learning Process.

--

--