The world’s leading publication for data science, AI, and ML professionals.

Building bespoke stochastic process models using Rcpp

Advanced model building with R using Rcpp

Thoughts and Theory

Modeling Biological Systems in Time and Space

The scientific literature contains a vast "zoo" of elegant mathematical models for analyzing complex biological systems. Implementing them in R or Python can lead to some interesting, or indeed unexpected, numerical challenges.

Fig1. Spatial (1-D) logistic growth model estimated using Rcpp. Original application was for advantageous alleles spreading spatially across a landscape due to R. A. Fisher (in 1937) (see main text for details). Image by author.
Fig1. Spatial (1-D) logistic growth model estimated using Rcpp. Original application was for advantageous alleles spreading spatially across a landscape due to R. A. Fisher (in 1937) (see main text for details). Image by author.

1. The Malthusian growth model (1798) and friends

Modeling complex biological systems has a long history. An exponential population growth model was presented in 1798 (Malthusian growth), with many historically famous scientists adding their own models to the literature, and which of course continues today. These models cover everything from hereditary and genetic evolution – Fisher and his model for waves of advantageous alleles in 1937 – to models of the spread of infectious diseases (too numerous to mention, see COVID-19 literature), through to contributions only made possible by the most recent advances in technology, e.g., genome sequencing technologies enabling the development of predator-prey models of microbial populations in humans.

Many of these models are mathematically elegant and form concise descriptions of highly complex processes and typically focus on a small number of key parameters or features. What is impressive is how often "classical" models (perhaps with a slight modification) are still used today in real applications. This suggests that such models continue to capture the essence of the complex biological systems which surround us today – at least as we currently understand them. For example, even models as seemingly simple as logistic growth, which dates from around 1838 (discovered by Verhulst), frequently appear in recent (2020) peer reviewed scientific journal articles on the spread of COVID-19.

Before we progress to the real content of this article a quick remark on the use of "Advanced model building" in the subtitle. The models used in the following examples require some bespoke numerical implementation, which arguably makes this less a beginner topic. That said, none of the coding or mathematics is advanced and all the code for running the models and generating the plots presented is self-contained and available on GitHub here

2. Models, software, data fitting

The two models we consider here are typically referred to as mathematical models (as opposed to regression/statistical) but our ultimate purpose is statistical – Data Science -parameter estimation and prediction from real data. A useful analogy is with PK/PD modeling where the systematic part – the expected value of the biological system over time – is defined using a mathematical model, which is nested inside a statistical model, giving a non-linear mixed effects model formulation. The same concept applies here but we do not consider only time.

The two models we investigate describe the evolution of a biological system across two continuous dimensions – in the examples here, time and space, and time and probability. The models are defined below and are classical models from the scientific literature. Each model is expressed in terms of a partial differential equation (PDE). PDEs are used because these models describe the evolution of a dynamical system across two continuous dimensions. PDEs are widely used for this purpose, with the other major biological application being modeling across time and age. The latter is particularly important for any process where age is a driver, for example infectious disease modeling, the onset of dementia, demographic modeling or indeed any biological growth process.

Fig 2. Stochastic logistic growth model estimated using Rcpp. Surface is the probability density for a stochastic growth process as it evolves over time from an initial point source. Image by author.
Fig 2. Stochastic logistic growth model estimated using Rcpp. Surface is the probability density for a stochastic growth process as it evolves over time from an initial point source. Image by author.

2.1 Models

Model 1 – spatial logistic growth

The first model – Fig1 – is an extension of classical logistic growth extended to one spatial dimension, and often referred to as the "Fisher equation", and is defined as:

The first equation defines the evolution of the process over space and time, the following three equations define the necessary boundary and initial conditions needed to solve the equation. n_0(x) is the initial starting value of the system at time 0. Image by author.
The first equation defines the evolution of the process over space and time, the following three equations define the necessary boundary and initial conditions needed to solve the equation. n_0(x) is the initial starting value of the system at time 0. Image by author.

This model defines the growth (or spread) of some organism or population across space and time, with n(x,t) being the density of the organism at location x and time t. Space here is one-dimensional, so spread is along a single axis. In Fig1 the initial source is at position x=40 and the axis of spread is constrained to be from x=0 through x=100, and t=0 through 150 time units.

As per classical logistic growth, this model is non-linear and density dependent where the speed of growth depends on the population size, and has three parameters, r = rate of growth, K=carrying capacity of the population and D=diffusion coefficient. K determines the maximum height of n(x,t), in Fig 1 this is K=50, and D determines how wide the spatial spread is, D=1 in Fig 1, and r=0.2.

Model 2 – stochastic logistic growth

The second model – Fig2 – is also an extension of classical logistic growth but rather than extended to evolve across a spatial dimension, it is instead extended to evolve stochastically over time. This model is mathematically more complex and is defined as:

The first equation defines the model - see logistic growth - and then a Brownian motion diffusion term (dW) to add randomness. The second equation is the Fokker-Planck equation derived from the first equation, and the third and fourth equations are initial and boundary conditions, L_l and L_u, the lower and upper limits of n. The initial condition - last equation - is the Dirac delta function, this is particularly problematic (see main text). Image by author.
The first equation defines the model – see logistic growth – and then a Brownian motion diffusion term (dW) to add randomness. The second equation is the Fokker-Planck equation derived from the first equation, and the third and fourth equations are initial and boundary conditions, L_l and L_u, the lower and upper limits of n. The initial condition – last equation – is the Dirac delta function, this is particularly problematic (see main text). Image by author.

T[here](https://en.wikipedia.org/wiki/Fokker%E2%80%93Planck_equation) are a number of different possible variants for a stochastic version of classical logistic growth (see here for a discussion) and this is one of the simpler models. The second equation is derived (easily, just dropping in terms – see here) from the first equation and describes the evolution of f(n,t)=the probability distribution of the size of the population over time. This is the equation we wish to solve. The boundary conditions for this model are also more complex.

To help clarify the differences in purpose and interpretation between model 1 and model 2:

  • Model 1 answers the question: "How many organisms are present in the population between spatial position x=10 and x=20 after 10 time units of evolution from an initial source?"
  • Model 2 answers the question: "What is the probability that the size of the population after 100 time units of evolution from an initial source is greater than 80?"

2.2 Software -numerical work with Rcpp

To first give a statistical/empirical context, the solution of model 1 for a given choice of parameters and initial conditions could be used as the mean trajectory in a non-linear regression model (e.g. with Gaussian errors) across space and time. The solution to the PDE for any given set of parameters is just a non-linear function of x and t. A complication is that as we can’t analytically solve this PDE, we can’t write out this (solution) function in a closed form, so it needs to be numerically computed for every set of parameter choices. For example, inside a fitting algorithm, e.g., non-linear least squares, the algorithm needs to call a function which solves the PDE every time it searches for better fitting parameter estimates.

Model 2 is already a stochastic model with a likelihood function and can (in theory) be directly fitted to available data given a suitable optimization routine (e.g., maximum likelihood estimation). An alternative is to provide the solution to the PDE – which is a probability distribution— to a Markov chain Monte Carlo (MCMC) sampling engine (e.g., r-nimble) for a Bayesian analysis. But the concept is identical to model 1; at each step of the fitting-to-data process (maximum likelihood or Bayesian) a solution is needed to the relevant PDE for any given set of parameters and initial conditions.

Rcpp

Image by author.
Image by author.

There are a number of methodologies for numerically solving PDEs and arguably the easiest to implement is the method of lines (MOL), which is what we use here and this was implemented using Rcpp. The MOL approach discretizes a PDE into a system of ODEs by using finite difference approximations for the spatial derivatives (in our examples the "x" and "n" dimensions respectively). The key programming task is preparing ragged 2-d vectors which will contain suitable repeating patterns of coefficients. This is not particularly taxing to setup, but it helps to write out long hand on paper first. The number of "rows" in the 2-d vectors depends on the grid size required (which is a parameter in the MOL which determines accuracy).

One aspect which needs bespoke attention is coding the boundary conditions. The boundary conditions are the first and last entries in the finite difference scheme and need to be adjusted to the type of boundary condition in the model. An excellent guide as to how to create a finite difference scheme and in particular how to calculate the terms needed for even complex boundary conditions is explained here.

The last part needed in MOL is an ODE solver, ideally something fast with adaptive error control/step sizing. In these examples the odeint solver from the Boost library was used with minimal adaptations to an existing example from the odeint documentation.

The Rcpp library was used for implementing the MOL solver, it was a good choice as there is already a boost package for Rcpp, called BH, and the actual coding of a finite difference scheme is straightforward even with minimal C++ knowledge. For performance reasons using a compiled language is important here, as if these models are to be fitted to data then the MOL solver may need to be called a great many times.

2.3 Code snippets

The full code for the models and plots is available here on GitHub. To give a flavor of the coding needed, below are two gists. The first illustrates one way to implement a finite difference scheme, which in effect involves rolling out a repeating set of coefficients into vectors. These vectors are then used later when defining the ODEs to be solved. This is the second gist.

The gist below numerically defines model 2 – a system of ODEs which correspond to the PDE for model 2. Note the vectors vec2d and vec2dh which are those created similar to the above gist and define the coefficients for the finite difference scheme. Of particular note is the upper boundary condition – the last line of code – which is hand coded to meet the mathematical requirements of the model definition.

2.4 Modelling and numerical challenges

Below are two additional plots, one each from model 1 and model 2, showing how the numerical solutions for each model evolve over time and space/probability. The solutions look smooth and intuitively reasonable – and indeed have been checked against Mathematica’s PDE solver. However, earlier attempts were not so successful and it was some work to get to these correct solutions.

Fig3. Model 1 - Fisher equation - as Fig1, but now only up to t=50 time units and r=0.1, shows early phase of population growth across distance axis. Image by author.
Fig3. Model 1 – Fisher equation – as Fig1, but now only up to t=50 time units and r=0.1, shows early phase of population growth across distance axis. Image by author.

Finite difference schemes are widely used, and known to be reliable and accurate (provided a fine enough grid is used). This was also the experience here, but there were three areas of unexpected – although perhaps this should say expected, with hindsight – areas of difficulty: 1. the initial conditions; 2. the boundary conditions; and 3. understanding the actual properties of the model.

Fig 4. Model 2 - Stochastic logistic growth model - as Fig 2. but different viewpoint looking ahead from time t=0. Image by author.
Fig 4. Model 2 – Stochastic logistic growth model – as Fig 2. but different viewpoint looking ahead from time t=0. Image by author.

1. Initial conditions

  • PDEs describe smooth functions, so the initial condition (at t=0) must be smooth across the second dimension, i.e., a smooth function across the space axis at time t=0. This means that evolving from a point source at t=0 is not strictly possible. Rather, a narrow smooth function should be used, for example a Gaussian density with very small variance. The practical challenge here is choosing this density/function. If it’s too narrow then the finite difference scheme will not give reliable results, whereas if it’s too wide then the scenario being modeled is not evolution from a point source. This is particularly problematic in model 2 as when fitting these models to data they evolve from a known initial observation – hence the Dirac delta function as the initial condition.

2. Boundary conditions

  • Numerical errors can propagate from a boundary condition resulting in unreliable results across the entire finite difference scheme. This does not affect simple boundary conditions – as in model 1, but it can certainly impact boundary conditions such as the upper boundary in model 2. For example, in model 2 the upper boundary condition was correct but the finite difference scheme gave nonsensical results. This was fixed by algebraically simplifying the boundary condition – the correctness of the expression was unchanged. This was a subtle issue as the original expression did not look numerically problematic, but seems to have been destabilizing the finite difference scheme.

3. Model properties

  • In Fig 2 the parameters used were alpha=0.1 and beta=100, so in the long term we expect the value of n to average around 100. This was indeed the case when the initial point source was a narrow Gaussian distribution centered at n(0)=10. However, if this same Gaussian distribution is moved to be centered at n(0)=4 then for these same parameter choices the finite difference scheme results become nonsensical. Move it a little further away from the origin, and everything is again fine. This same behavior also occurred when using Mathematica’s PDE solver. With some investigation this should not have been a surprise on reading the relevant literature. This model is bounded below by zero – the value of n(t) cannot be less than zero, which means that the lower boundary condition is f(n=0,t)= 0, i.e., no probability can accumulate at n=0. So in practice this model is not appropriate for modeling a small population or a population/organism in the very early phase of growth/spread, as in these cases die-out (or n approaching zero) may be a realistic possibility. Identifying in advance when this situation is likely to occur for a given set of parameters and initial conditions is the practical challenge.

3. Fitting to data and closing remarks

The ultimate goal is to be able to fit these models to real data sets – inside an additional statistical model (non-linear mixed models etc). PDE models are clearly very rich in terms of shape and flexibility, and we have only examined two of the plethora of models across space and time, and space and probability, available in the scientific literature. It is also true that there are some numerical challenges when working with PDEs.

One of the biggest obstacles is how to robustly automate the process for solving a PDE such that it can be embedded inside a data fitting algorithm, such as maximum likelihood or MCMC. We solved the PDEs for model 1 and model 2 for just several sets of parameter values and initial conditions, which required some trial and error and some human decision making. The challenge is how to devise an algorithm to do this – which makes sensible choices when faced with unreliable results, and indeed how to first determine that the results produced are unreliable.

On the positive side, model 1 looks relatively tractable in terms of next steps, and this may be similar for other PDEs, for example for those with time and age dimensions. More complex models – particularly those describing time and probability – probabilistic evolution of processes, such as model 2 do appear rather more difficult.


Related Articles