ML for Macroeconomics #1

Variable Selection with Random Forests in low-quality data

Philipp Schreiber
Towards Data Science

--

Photo by carlos aranda on Unsplash

While Machine Learning (ML) has taken many applied sciences by storm, economics’ emphasis on causality has rendered it difficult to find relevant applications. Over the past years, Susan Athey has successfully pioneered various applications in Microeconomics, however, students of Macroeconomics have yet to find similar role models to emulate. That is why with this article, I want to use the issue of variable selection to illustrate a smaller area where ML can easily supplement classical techniques.

Given the aggregate nature of Macro analyses, such as cross-country studies, high-dimensional (aka “Big”) data forms the most common scenario researchers face. But while the Least Absolute Shrinkage and Selection Operator (Lasso) has long been popular among economists, non-parametric alternatives like the Random Forest (RF) have not enjoyed similar attention. Therefore, together we’re going to i) conduct a simulation study to compare the behaviour of Lasso and RF under low-quality data and ii) apply them to Sala-I-Martin’s famous “millions” data set of economic growth.

Disclaimer: This article is based on a course project which is available on GitHub. All the material, further explanations and additional references can be found there.

Refresher on Variable Selection

Variable selection — known as “feature selection” in the ML literature — is one of the most common approaches to dealing with high-dimensional data, i.e. data where the number of predictors is high relative to the number of observations. The idea is that, if the researcher knows that only a subset of all covariates influences the dependent variable, correctly identifying these can alleviate some of the “curses of dimensionality”, such as prediction accuracy (Fan and Lv, 2009). In practice, this is called the sparsity hypothesis.

The Lasso and applications of random forests are then two examples of the variety of variable selection techniques that have been developed over the years. Importantly, like always in statistics, there exists no single optimal procedure. As Hastie et al. (2017) explain:

Different procedures have different operating characteristics, i.e., give rise to different bias-variance tradeoffs as we vary their respective tuning parameters. In fact, depending on the problem setting, the bias-variance tradeoff provided by best-subset selection may be more or less useful than the tradeoff provided by the lasso. (p.582)

Now, what Hastie et al. are referring to is that “best-subset selection” naturally has a very low bias, but, in noisy data, methods such as the Lasso can often deliver better results because they trade-in some bias for a substantial reduction in variance. Indeed, this type of tradeoff is the central motivation of this article, as we compare the Lasso to another technique entailing efficient variance reduction: RFs. Before proceeding to the simulation study, let me also offer a very brief description of Lasso and RF.

The Lasso belongs to the class of Shrinkage Methods with the special property that, for an appropriate choice of the penalty term, coefficients are shrunken exactly to zero. Since the penalty term shrinks the size of all coefficients, a relaxed Lasso has been developed, where control of non-zero coefficients is separated from overall shrinkage.

The defining feature of RFs is the aggregation of decorrelated decision trees, which is particularly effective for the reduction in variance. While RF, unlike the Lasso, does not naturally perform variable selection, Genuer et al. (2010) developed a stepwise ascending variable introduction strategy for this purpose and implemented it in the very good VSURF package, available on the R-CRAN library.

Simulation Study

Given the dependence of the bias-variance trade-off on the problem structure, it is key for our simulation study to emulate our real-world application, i.e. a cross-country study of economic growth. Besides a (very) small sample size (there exist roughly 100 countries in the world), this critically relates to the quality of data we have for any given year: censuses are only conducted infrequently, are not globally standardized, strongly rely on data imputations — issues which are exacerbated for less prosperous countries.

We make data quality the focal point of our comparison and instrumentalise it through the Signal-to-Noise Ratio (SNR). Intuitively, if there is a lot more noise (error) than signal (our explanatory variables), it becomes difficult to identify the effect of the signal. We follow Hasties et al. (2017) and consider ten values of SNR ranging from 0.05 to 6 on a log scale — for comparison the corresponding values of the “Proportion of Variance Explained” are reported, as well. Importantly, as the authors emphasise, studies are often too optimistic about signal clarity and argue that, in the real world, observational data tends to have log(SNR) < 1!

Log Scale of Signal-to-Noise Ratio
Image by author.

As a result of this SNR scale, a simulation study calculates 10 x #sim synthetic data sets and applies the Lasso, relaxed Lasso and RF to each of them. The data sets are sparse so that only 5 out of 50 covariates influence the outcome. A Toeplitz matrix is used for the variance-covariance Matrix, so that variable’s relative position determines the correlation structure. Note that we choose a linear DGP, which needs to be taken into account when comparing the results of parametric to non-parametric methods. Finally, the number of simulations is set to 100.

Now that we are familiar with the conceptual setup of the simulation studies, we can have a look at the results. If you’re interested in the code or more detailed explanations, including on the control of SNR, I cordially invite you to check out the original project on GitHub.

Simulation 1 — The Baseline

Results Dashboard of Simulation 1
Image by author
  • RF consistently closer to true sparsity, with very low variance at all levels of SNR
  • Good retention frequency of Lasso methods a by-product of overly dense estimates and hides the number of null-models

In the baseline simulation, the general degree of correlation is moderate (0.5 to direct neighbours) without special correlation between the five true predictors. The graph on the top left plots the retention frequency, i.e. how many true predictors were identified. Not surprisingly, all three models become better at identifying the significant variables when the signal becomes clearer. However, the relative underperformance of RF between SNR 0.25 and 1.22 is misleading: looking at the graph at the top right illustrates the number of nonzero coefficients predicted by the model, in other words, its estimated density. Here we can see that the high retention frequency of the Lasso methods is a product of their strong over-estimation of density.

Indeed, while the relaxed Lasso, thanks to its ability to separate shrinkage and control of sparsity, eventually moves closer to the true sparsity, for the realistic levels of SNR it still selects twice as many covariates, on average. Looking at the graph at the bottom right, which zooms in on this “real-world window”, we see that the RF density is firmly centred around the true number, while at the lower end of SNR even the relaxed Lasso selects many null models (i.e. where the model identifies no covariates as significant).

Finally, as is customary, the MSE is also reported (bottom left). We can see that for low levels of SNR, the MSE is astronomical and rapidly decreases with higher data quality. Importantly, we might wonder why a non-parametric method apparently outperforms linear methods despite the fact that these are correctly specified. However, for RF we only report the VSURF Out-of-Bag MSE, which is biased downwards because ranking and selection are based on the same observations (Genuer et al., 2015). The main takeaway from this graph is then that RF and Lasso are both similarly sloped in their MSE reduction.

Simulation 2— High Covariance

Results Dashboard of Simulation 2
Image by author
  • RF very robust to high collinearity, unlike Lasso

Since high collinearity is a common issue in macro-economic analyses, better understanding the methods’ relative performance will be important for interpreting the results of the application. That is why, in the second simulation study, we make all true predictors neighbours, thereby giving them a relatively high-correlation structure. Importantly, given the nature of our Toeplitz matrix, this also means that there now also exists a set of insignificant variables which is strongly correlated to all true predictors. This set-up targets the well-known weakness of the Lasso of choosing the correct variables if correlation is high (see, e.g. James et al., 2013). In contrast, for the RF, Genuer et al. (2010) find that the variable importance ordering is very robust to the presence of collinearity.

Indeed, the performance of RF is nearly unchanged from the first study, where the violin plot actually indicates a reduction in variance at the low level of the realistic SNR window. On the other hand, we see that the standard Lasso generated more null models and has a lower increase in retention frequency. Similarly, the relaxed Lasso also generates many null models but also demonstrates cases where it inflates models with up to 20 variables. As a result, even though both its retention frequency as well as average density closely match the RF, their performance only converges for high levels of SNR.

The original project contains several additional studies illustrating further aspects of the techniques’ bias-variance trade-offs.

Application

Armed with these insights, we now bring the three techniques to the task with the famous “millions” data set from Sala-I-Martin (1997) [available here]. It consists of 63 macro variables for a cross-section of 134 countries in the 1970s, matching the observations-to-variables ratio of our simulation studies. After standardizing and replacing missing values with the sample mean, we are left with 119 countries and 62 predictor variables.

With the iconic title “I just ran two million regressions”, Sala-I-Martin (1997) set out to develop a faster and more flexible method for identifying relevant macro-economic growth variables than the “extreme bounds test”, which was very popular at the time. He himself proposed to estimate and select variables based on the cumulative distribution function, which we also include in the results table below.

Millions Data Results Table
Millions Data Results Table (Image by author).

Between the four methods, 20 different variables are selected as predictors of 1970s GDP growth rates. While, at first glance, we might find this substantial disagreement surprising, closer inspection of the estimation results points to the presence of high collinearity (discussed in detail in the original project). Remembering the results of the second simulation study, while the rate of recovery will be relatively low for all four methods, we should be comparatively confident in the density estimated by the RF. As such, while the parsimony of the CDF and relaxed Lasso might seem appealing, this suggests that their estimates are too sparse.

For the sake of comparison, the results from the Lasso methods are ranked by the absolute value of the estimated coefficient. We can see that four variables are not only shared between all or Lasso&RF (marked in green) but are also commonly assigned a high degree of importance: although the random forest approach only assigns a minor role to religion, it also counts “Equipment Investment” (i.e. investment into mechanization) among the group of very important drivers of economic growth. The same is true for the measure of the openness of the economy, which is also given very high importance by all four methods. In contrast, Sala-I-Martin’s CDF method fails to select primary school enrollment, which is assigned primacy by the Lasso methods and ranks third for the RF. However, the apparent instability of the results cautions against placing too much faith in these ranks: for the Lasso, the variables are so difficult to select that randomness in the cross-validation process caused the inclusion of “Life Expectancy” in the relaxed Lasso but not the standard Lasso.

Taking a step back, we see that there exists a small set of (relatively) strong predictors, but many weak ones. From a macroeconomic perspective, what stands out when comparing the four subset selections is the lack of political variables in favour of social and economic ones: while the CDF, standard Lasso and random forest each select one political variable, there is no congruence. Interestingly, even “Revolution and Coups”, which relates to violent conflict, is only selected by the standard Lasso and still only assigned the lowest rank — running contrary to the very successful literature on institutional economics.

Conclusion

Thank you for staying to the end. I hope this article gave you a hands-on example of a simple application of ML to Macroeconomics. While, of course, our simple example with the “Millions” data set does not warrant any causal interpretations — an obvious next step would be time-series data — we were able to see that variable selection by random forest can be more robust to some of the problems of empirical Macroeconomics. If you can think of more macroeconomic applications or have even worked on some, I’d love to hear them, so please feel free to share them below.

Stay safe & stay in touch!

References

  • Fan, J. and Lv, J. (2009). A Selective Overview of Variable Selection in High Dimensional Feature Space (Invited Review Article). Statistica Sinica, 20 (1), pp.101–148.
  • Genuer, R., Poggi, J.-M. and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31 (14), pp.2225–2236. [Online]. Available at: https://doi:10.1016/j.patrec.2010.03.014.
  • Genuer, R., Poggi, J.-M. and Tuleau-Malot, C. (2015). VSURF: An R Package for Variable Selection Using Random Forests. The R Journal, 7 (2), pp.19–33. [Online]. Available at: https://doi:10.32614/RJ-2015-018.
  • Hastie, T., Tibshirani, R. and Tibshirani, R. (2017). Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons. Statistical Science, 35 (4), pp.579–592. [Online]. Available at: https://doi:10.1214/19-STS733.
  • James, G. et al. (2013). An introduction to statistical learning. Springer.
  • Sala-I-Martin, X. X. (1997). I Just Ran Two Million Regressions. The American Economic Review, 87 (2), pp.178–183.

--

--

Politics and Economics enthusiast with a growing passion for Open Source Learning.