
Towards analytical descriptions of data rather than mere black boxes
Less of a black box than regular neural networks, and offering models that not only predict data but also rationalize how they achieve their predictions, symbolic regression (i.e. finding equations that relate variables) is far less popular than other machine learning and mathematical modeling techniques. However, as I present here, more and more applications of symbolic regression are becoming real, especially as new methods are developed that integrate physics-inspired constraints, data projection and preprocessing, and new ways to converge into simple, meaningful equations.
Index in a nutshell
- Introduction
- Two novel methods for physics-constrained symbolic regression
- Modern practical applications of symbolic regression in chemical and biological sciences
Introduction
Symbolic regression consists in identifying a mathematical expression that fits a dataset of input and output values. There are many different ways to approach the problem and achieve such analytic expressions connecting inputs to outputs, some more related to Machine Learning techniques, others exploiting genetic algorithms, some are based on pre-set rules while others exploit the space of possible equations more broadly, etc. Finding analytical equations that connect variables is far from trivial, especially when one seeks simple equations that look as "elegant" as those usually derived by theoretical means. For example, the distance traveled by an object experiencing constant acceleration and starting with no speed is simply d = a t² / 2. Or just to cite another example, radioactive decay leads to an exponential decrease of the radiation intensity of a material: I = I(0) exp(-kt). In these examples, given a set of distances traveled at increasing times, or a set of radiation intensities vs time, symbolic regressions would be expected to retrieve the respective equations. But of course, other functions may fit the data as well, not necessarily describing the simple physics behind the data.
Modeling data with symbolic regression has some advantages over modeling data with regular regressions, neural networks, or other mathematical tools. To me the two most important advantages are:
- The possibility of not only Modeling the data and being able to predict new outputs from new inputs, as with other modeling methods, but also actually grasping why the variables are related in a given way. The answer to that why can ultimately help to propose what are the physical, biological, or chemical, financial, etc. (depending on the subject) behind the association. For example, exponential decays are often associated to single-step, first-order events whose rate is proportional to a density (an intensity of radiation, a number of cells in a culture, etc.). Over time, as the starting density decreases so does its rate of change, which in an integrated form leads to the exp-based equation. Likewise, from elementary integration of accelerations as the second derivative of position over time, one can obtain the equation for distance traveled at constant acceleration.
- If the obtained equation is correct, the model has probably far more extrapolation power than most other mathematical models especially those based on neural networks, which are often excellent to interpolate but can barely extrapolate just a bit outside of their training domains.
In other words, results from a good symbolic regression is more interpretable, less of a black box, and likely more powerful.
For these advantages to actually realize, an important point is to maximize the chances of getting physically reasonable equations. For this, many methods incorporate physics-constrained navigation of the "equation search space". On the other hand, one needs to sample the equation search space vastly, especially for high-dimensional problems on which traditional symbolic regression fails more easily. To maximize sampling, simplicity, and physical accuracy, scientists are now building new symbolic regressors by combining machine learning and evolutionary algorithms with physics-based constraints on the equations.
I won’t go into further details about symbolic regression, because there are excellent articles here at TDS Editors (for example this one by Rafael Ruggiero) and also at WikiPedia (here) among other resources. Rather, I focus here on actual, modern examples of scientific applications of symbolic regression, which as Rafael’s article’s title says it is somewhat of a "forgotten" method, probably shadowed by regular machine learning methods:
I first recap two important, recent articles that already present applications but are focused mostly on the theory behind the most modern methods for equation discovery from data. Then in the following section I move to some very interesting applications, one actually from my own research.
First, two novel methods for physics-constrained symbolic regression with broad applications
First article: Udresku and Tegmark 2020
The first article I want to highlight is this one published in 2020 in Science Advances (article here, open access). Its title is "AI Feynman: A physics-inspired method for symbolic regression" and as you see it makes reference to the famous physicist Richard Feynman, as well… the methods and software developed here could discover all 100 physics equations in a book by that author.
The core idea of the work is relatively simple: to build their new symbolic regression algorithm they combine neural network fitting with a set of physics-inspired constraints and equation features. One key component in this work was the realization that the functions appearing in physics (and essentially most scientific applications) often have certain properties that kind of constrain the equation search space. This is described in detail in the paper, but let’s overview them: Variables and coefficients have physical units that must be consistent (and for example logarithms are often applied on ratios or similar arguments where units are canceled out). Polynomials usually achieve low degrees (but not necessarily limited to say square powers: see square roots, the dependency of light scattering on wavelength/10⁴, etc.). Most equations consist of either single or a few terms, each of which usually contains no more than two arguments. The equations are continuous and smooth, at least in their reasonable domains; and they are most often symmetric to some or all their variables. Last, variables are usually grouped in small groups or just grouped separately in different terms.
The algorithm presented in this paper takes into account all these points within different blocks, and works recursively by splitting the problem into smaller ones and running on these pieces separately. Some key highlights of the method:
First, it is notable that the method does involve stages of brute-force equation search; however, by running on split parts of the bigger problem it can better sample the search space.
Second, the work also uses a special way to reduce the chances of overfitting, by defining the winning function ("winning" referring to a competition of various alternative answers) as in previous works.
Third, as it runs, the program utilizes regular neural networks to create interpolated data that helps the fitting procedures required to assess equations and also those that test for symmetry, smoothness, and term separability.
Fourth interesting point, the method uses data transformation right on the inputs (for example, inputting x² instead of just x) based on common mathematics found in normal equations, to accelerate equation discovery.
The program described in this paper is available from GitHub. In the tests the authors report, it takes the program from tens of seconds to tens of minutes to converge the symbolic regressions on small numbers of data points. The program was able to recover all 100 equations in Feynman’s book, using only elementary functions (+, −, ∗, /, sqrt, exp, log, sin, cos, arsin, and tanh) and small rational numbers as well as e and π. By adding simulated noise to the datasets, the authors also tested the robustness of the method and program.
Here’s the original paper in Science Advances:
You can get more practical with this program by reading this TDS article by Daniel Shapiro, PhD:
Second article: Reinbold et al 2021
The second article I want to highlight was published in Nature Communications in 2021 (open access article here). Like AI-Feynman, the method described in this article also exploits various features of real physical equations as constraints for the equation search problem, while seeking parsimonious models that balance accuracy and simplicity. Moreover, since the authors tackled problems where derivatives are very important, they implemented weak formulations of differential equations to reduce noise sensitivity and eliminate dependence on inaccessible variables. The main reason for employing to weak formulations is to turn the differential equation into an integral equation, to bypass the burden on evaluating derivatives. For more about weak forms, see this great blog.
In their paper, the authors show the application of this method to experimental measurements of the velocity field in a turbulent fluid. From this only input data, the method allows the reconstruction of inaccessible variables such as the pressure that drives the flow. Although as presented it is applied only to this test case, the method presumably works for other kinds of problems too.
Here’s the original paper in Nature Communications:
Robust learning from noisy, incomplete, high-dimensional experimental data via physically…
Now here they are: modern practical applications of symbolic regression in chemical and biological sciences
Looking for "symbolic regression" in all the titles and abstracts of papers in PubMed (the worlds’ largest online library on natural sciences) returns 76 results as of September 30, 2021. This doesn’t include articles that mention the term only inside their main texts; and is limited to papers in the domains of biology and chemistry but leaves out those in the computer sciences, so it is a good proxy for papers where symbolic regression represented an important application of the work reported.
There’s one article published in 1997, which talks about symbolic regression as a method to discover a "function of one variable". And then there’s a void all the way until 2011 when a paper used symbolic regression to find an equation that describes glomerular filtration rates, a metric of renal function that is useful for renal transplants. The obtained equation was demonstrated superior to other equations existing at that time.
From 2011, the number of publications with "symbolic regression" in their title or abstract began to increase smoothly, with 16 for year 2021 as of September 30:

Now let’s look at some of the cases I found most interesting, that illustrate actual applications of modern symbolic regression methods.
This 2020 paper in _Nature Communications_ used symbolic regression on data about oxygen evolution activities of various perovskites, to understand what variables are best predictors of the activities and through which equations. With this work the authors could identify a simple descriptor that is the ratio of two factors often used in that field of research to characterize the perovskite compositions, and equations to model the activities from this descriptor. This symbolic model led to the discovery of a series of new oxide perovskite catalysts with improved activities, that the authors synthesized and characterized to confirm their high activities. The symbolic regressions were carried out with gplearn, a Python library that extends scikit-learn with this functionality. The paper is here:
Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite…
And this is the gplearn library used in that work for symbolic regression (and other very interesting methods) in python:
Welcome to gplearn’s documentation! – gplearn 0.4.1 documentation
The next paper, Phys Rev E 2021, is at the interface of method development for symbolic regression and actual applications to discovering physical laws from distorted video. The article presents a method for unsupervised learning of equations of motion for unlabeled objects in raw video. Imagine a relatively static scene on which an object moves, and you want to get the equation of motions of that object without even labeling or purposely tracking it. I chose this article for presentation here because it exemplifies a nice integration of image analysis, pre-processing, projections to low dimensions, and symbolic regression itself.
In this method, an autoencoder first maps each frame of the video into a low-dimensional latent space that simplifies motions. This acts as a pre-regression, that is then fed into Pareto-optimal symbolic regression to find the differential equations that describe the motion of the object. The pre-regression step can model the coordinates of unlabeled moving objects even when the video is distorted, as would happen in real-life videos. The use of latent space dimensions helps to avoid topological problems, and can be removed later through principal component analysis. Last, by minimizing the overall motions the method can automatically discover also an inertial frame thus reducing the distortions of the final motions (that would happen say due to a moving camera or background) and hence facilitating the obtention of simple equations.
This paper, presenting a quite novel way of analyzing videos, is here:
Symbolic regression: Discovering physical laws from distorted video – PubMed
Or here is a preprint at arXiv:
Symbolic Pregression: Discovering Physical Laws from Distorted Video
The next paper, Bioessays 2019, addresses the modeling of ecological dynamics, i.e. how populations of various species sharing a habitat evolve over time in relation to each other and to abiotic factors. The work combines symbolic regression with a set of plausible ecological functional responses to reverse-engineering ecosystem dynamics from time-dependent data of organismal abundances. Given the input data, the procedure returns sets of candidate differential equations that describe it, which are then analyzed for their meaning in terms of concepts about ecology. We can identify two main advantages of using symbolic regression as discussed by the authors. First, the resulting differential equations obtained can potentially be interpreted to understand the underlying ecological mechanisms of the ecosystem, such as the type of ecological interactions among species, for example between pairs of predator and prey species. The second important point the authors stress is that the methodology seems to perform well even in the case of limited or poorly informative data, likely because they provide candidate starting pieces of equations themselves thus limiting the search to meaningful equations only that can be fitted with low-quality or sparse data.
This paper, with this interesting approach, is here:
Revealing Complex Ecological Dynamics via Symbolic Regression – PubMed
Another paper, by me in Molecular Biotechnology 2021, used symbolic regression to model the effects of mutations on protein thermal stability and at the same time understand how different factors modulate the effect on stability. As the paper show, the problem is very complicated to model, partly because of the limited amount of data available. But one specific mutation, from amino acid Valine to amino acid Alanine, counted with 47 documented entries in the dataset. Symbolic regression of this data using three factors of the wild type (Valine) amino acid in the context of its structure, namely its relative solvent accessibility, secondary structure, and flexibility as quantified from atomic B-factors, returned this equation:
ΔTm (°C) = SS – SS / (8.58 RSA – 0.89) + 13.56 RSA – 7.35
which fitted with a correlation coefficient r of 0.68 and a mean error RMS of 3.3 °C. The equation shows no relevant effect of the flexibility, as the model works well with secondary structure (SS) and relative solvent accessbility (RSA) only. In fact the model’s key terms are the offset of -7.35, which implies a globally destabilizing effect, and a strong modulation on RSA by a factor of +13.56: the more exposed the amino acid, the stronger its positive effect on stability. This makes perfect physics sense because Valine is a hydrophobic amino acid, and as such it prefers to remain hidden form water, so when it is highly exposed (high RSA) its replacement by the less hydrophobic Alanine results in stabilization.
The paper also demonstrates that the symbolic regression produces better results than the alternative of using multiple lienar regression on the same factors, especially regarding the shape (slope) of the correlation plots:

This paper, featuring an application of simple symbolic regression, is here:
Reviewing Challenges of Predicting Protein Melting Temperature Change Upon Mutation Through the…
Symbolic regression in that work was carried out with TuringBot, an easy-to-use program:
Some additional interesting works
One of the last papers from the same group that developed AI Feynman describes AI Poincaré, a symbolic regression system that autodiscovers conserved quantities using trajectory data from dynamical systems. With tests on five physics Hamiltonians the program could discover their conserved quantities, periodic orbits, phase transitions, and breakdown timescales, without any domain knowledge or even a physical model of how the trajectories were produced.
In the spirit of guided symbolic regression searches like some examples above, this other paper feeds terms from equations of the theory underlying exciton energetics to model analytic representations of the exciton binding energy.
This review provides an overview of symbolic regression as applied to materials sciences, with some examples in this domain and in engineering problems. Along the way, the review refers to a 2009 Science paper that presents symbolic regression as a way to discover equations for natural laws from experimental data, which is a foundation for probably all the other works I presented here.
If you have some interesting articles to share that use symbolic regression, let me know in the comments. I hope to be soon seeing more of this exciting, very useful technique.
Liked this article and want to tip me? [Paypal] -Thanks!
I am a nature, science, technology, programming, and DIY enthusiast. Biotechnologist and chemist, in the wet lab and in computers. I write about everything that lies within my broad sphere of interests. Check out my lists for more stories. Become a Medium member to access all stories by me and other writers, and subscribe to get my new stories by email (original affiliate links of the platform).