What if we didn’t have to compromise between interpretability and performance?

Benchmarking Machine Learning algorithms in small data

Published in

Towards Data Science

5 min readJul 16, 2021

In the last couple of years, I have been reading quite a bit about the tradeoff between interpretability and performance (which, by the way, reminds me a lot of the bias-variance tradeoff). I have seen many graphs showing where standard machine learning algorithms stand with respect to both measures, like the one shown below:

on one side of the spectrum, linear models are inherently simple and interpretable, and can only yield good performance when data fits this hard constraint of linearity
on the other side of the spectrum, we find ensemble algorithms such as random forests, or even further deep neural networks, which get good performance for most of the data but at the cost of being difficult to interpret without additional strategies (e.g. with LIME or SHAP).

Main machine learning algorithms in the space of performance vs interpretability

One type of algorithms is always missing from those graphs, as well as from reference books and material on Machine Learning, namely symbolic regression algorithms. Rafael Ruggiero wrote an article about it recently called The Forgotten Machine Learning algorithm, so that tells everything!

Symbolic Regression

So, what is this algorithm that is mostly known to the Evolutionary Algorithm community? It consists in building models as mathematical expressions, which can also be represented as expression trees. The following example shows the mathematical formula of the model, which involves a selection of the input features X, their modification into new features with the use of unary operators (sine, square root and logarithm), and their interactions with the use of binary operators (addition and multiplication).

As it is a very large search space and a difficult optimization problem to search for relevant new features and how to combine them, traditional machine learning is not really equipped for dealing with it and most of the algorithms for symbolic regression rely on genetic programming. Among the possible reasons why symbolic regression are not well known to the ML community, we can cite the following ones: slower computational time, lack of benchmark relevant to ML, performance not so good, high number of hyperparameters…

However, in the last few years, things have been started to change:

there has been several benchmarks in regression comparing different symbolic regression algorithms with more traditional ones (linear models, decision trees, random forests, multi-layer perceptron), one of which on a large number of ML datasets [1]
a couple of recently published symbolic regression algorithms show that they can achieve performance that are significatively equivalent to random forests and close to gradient boosting. These algorithms are Zoetrope Genetic programming (ZGP) [2] and Feature Engineering Automation Tool (FEAT) [3].

Performance of symbolic regression algorithms

Before talking again about interpretability, let’s have a look at how symbolic regression algorithms perform with respect to more traditional machine learning ones.

The following graphs has been generated from results of a benchmark performed on almost 100 datasets in regression with up to 3000 observations, from the Penn Machine Learning Benchmark database. They show the repartition of normalized root-mean squared error (NRMSE) and R2 over 20 runs of each algorithm for all datasets. The algorithms are ordered by average (from low to high in NRMSE as we want it to be as close to 0 as possible, and from high to low for R2 which gives a score of 1 for a perfect fit).

From these graphs, especially the one on R2 scores, we see that 4 algorithms are clearly better than the others, namely: Gradient Boosting, Zoetrope Genetic Programming (ZGP), Random Forests, and Feature Engineering Automation Tool (FEAT). So the take-home message here is that good performance can be also achieved with symbolic regression.

Performance of symbolic regression and machine learning algorithms in normalized mean squared error (NRMSE, top) and R2 scores (bottom) ordered by average over all datasets. The best algorithms are on the left.

Note that here the benchmark was run with default parameters for most methods, and with cross-validation for linear methods in order to be as fair as possible.

Interpretability vs performance

We now go back to our initial question: how do these algorithms behave in the space of interpretability vs performance measures? And more precisely, do we always have to make a compromise between the two? The answer is no, thanks to symbolic regression algorithms, and we are going to look at it next.

The following graph has been generated from the same benchmark as before. It displays the performance in median R2 vs the interpretability for some symbolic regression algorithms and standard machine learning ones. The interpretability here is measured as the number of nodes in the trees (for symbolic regression or decision trees) or the number of coefficients (for linear models or multi-layer perceptron).

It shows that the 3 top symbolic regression algorithms have the ability to achieve performance that are as good as random forests, with interpretability that is close to linear models, yielding models that can be understood by humans.

Note the logarithmic scale for interpretability: it means that symbolic regression algorithms can achieve models that are up 1000 times “simpler” (in terms of number of nodes) with no loss in performance. Also, they can display interactions between variables very clearly, which is very useful in some applications. In the example of an expression tree given above, an interaction has been found between the variable X9 and the square root of X3.

Actual performance vs interpretability for symbolic regression and machine learning algorithms

Full details of the benchmark are given in [2], and the benchmark code for these results is available on the following Gitlab repository: https://gitlab.devenv.mydatamodels.com/publications/bench-zgp-symbolic-regression.

Conclusion

So in the end, it is possible to obtain models that have both good performance and good interpretability, using symbolic regression. Depending on the problem (and its size!), it can sometimes be worth sacrificing a bit of performance (compared e.g. to deep neural networks) in order to get models that can be interpreted directly, with no need to resort to additional methods for understanding how the predictions are obtained from a model.

References

[1] Orzechowski, P., La Cava, W., & Moore, J. H. (2018). Where are we now? A large benchmark study of recent symbolic regression methods. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 1183–1190).

[2] Boisbunon, A., Fanara, C., Grenet, I., Daeden, J., Vighi, A., & Schoenauer, M. (2021). Zoetrope Genetic Programming for Regression. In Genetic and Evolutionary Computation Conference (GECCO’21), July 10–14, 2021. https://doi.org/10.1145/3449639.3459349

[3] La Cava, W., Singh, T. R., Taggart, J., Suri, S., & Moore, J. H. (2019). Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations (ICLR). https://openreview.net/pdf?id=Hke-JhA9Y7