
Sometimes I wonder if I am the only one that after seeing a bunch of points tries to draw a curve that somehow follows the trend. I think I am not alone. Everyone has this sort of intuition. It is easy when there are just a few points and the curve I draw is just a straight line. But it keeps getting harder every time I add more points or when the curve I am looking for differs from a straight line. In this case, a curve fitting process can solve all my problems. I admit it is exciting to enter a bunch of points and find a curve that matches the trend "perfectly". But how does this work? Why fitting a line is not the same as fitting a strange-shaped curve. Everyone is familiar with linear least squares but, what happens when the expression we are trying to match is not linear? I asked myself those questions and that led me on a journey of math articles, stack overflow posts [1], some deep mathematical expressions (at least for me!), and an interesting story about the discovery of an algorithm. This is my attempt to explain all that in the simplest and yet efficient way I can…
The problem
In some cases, a linear regression is not enough. Sometimes it is necessary to adjust a series of data to a non-linear expression. In these cases, ordinary least squares won’t work for us and we need to resort to different methods. The first time I encountered this situation was when I was trying to fit 2D data to a function that looked like this:

Fortunately, there were many ways in which I could automatically find the best value for Beta. Anyone familiar with nlinfit from MATLAB or with the curve_fit function of SciPy knows that this non-linear regression process is straightforward once you have a mathematical expression for the model. All these libraries work similarly, they use an iterative algorithm in which they try to find the parameter or combination of parameters that minimizes the differences between the observed data and the model’s response. Let’s use some equations to express this.
Let’s assume that we have a function f that depends on an independent variable x and a group of parameters a. This is y= f(x,a). This function is modeling a process from which we already know the output ŷ. The objective is to find a group of parameters a that result in the closest y possible to ŷ. One way of measuring how close we are to ŷ is to calculate the sum of the squares of the residuals. A residual is defined as the difference between y and ŷ at each point. This can be expressed as:

In this case, the subscript i is referred to the data point we are analyzing. If we are trying to adjust a curve with 100 data points, then we need to calculate the residual for each one of those points. In the end, we will have an r1, r2, r3, and so on, until we reach r100 in this particular example. The sum of the squares of the residuals corresponds to:

Finding a combination of parameters a that generate the lowest possible value of S means that the parameters a are the best possible match between the y calculated from our model and the ŷ values.
A graphical way of showing the problem
The following plot shows some data points in red and the model response in purple. If we want to measure how this model adjusts to the data points we can calculate the differences between the data points (ŷ) and the model response (y) and then sum the squares of these differences (residuals). This idea can be extrapolated to functions that contain multiple independent variables (x1,x2,…xn).

The solution (Derivatives)
A common way of finding a minimum value of a function is to calculate its derivative with respect to a particular variable. In this case, we want to find the value of a that minimizes the function S. This can be written as:

The subscript j means that there could be multiple values of a since the function f depends on the independent variable x and one or more parameters a1, a2,…,aM. In this case, we would need to partially derive the function with respect to each one of the parameters. The minimum value of a function is found where the value of its derivative equals zero. So, our previous equation would end up like this:

Note how I expanded ri just to remind you that this residual is just the difference between the calculated and the real value. At this point, it is important to have a graphical explanation of the derivatives and why when they equal zero we can say that we found a minimum (or maximum).
A graphical explanation for the minimization of a function using derivatives
A derivative can be defined as a measure of how a function changes with respect to its arguments. One of the most simple examples we can find is a function of the type y=mx. The derivative of this function with respect to x (dy/dx) is m. This means that with every small change in x, the output (y) changes m times. So the derivative of this function represents how much y changes after a change in x. Visually this can be seen as the slope of a tangent line at a particular point in a function.
The next figure shows a function that is completely different from the straight-line we mentioned before. In that function of the type y=mx, the ratio of the change in y with respect to x was always the same regardless of the value of x. In this case, this ratio changes according to x. You can see how each of the points that are shown in the figure have a different slope (m) for the tangent line. This slope represents the derivative of the function evaluated at a particular point. One way of finding minimum and maximum values of a function is to search for the place where the slope is zero. In this case, an x of 24.5 will give us a minimum value while an x of 10 will give us a maximum.

It might feel like things are starting to get complicated at this point but bear with me. It is not as hard as it looks! If you are familiar with calculus and derivatives you will know that the result of deriving the last equation is:

The term df(xi,aj)/daj corresponds to the derivative of the function f with respect to each of the parameters a. At this point, it is worth remembering that our model can contain multiple parameters a and we need to find the derivative of the function f in relation to each of these parameters.
Note that this calculation is performed for each of the points that we have in the data. That is why our function f depends on xi and aj: we have i values of x and j values of a. We can compile all these derivatives into one single term that is known as Jacobian. What a strange name! A Jacobian is a matrix that contains all the first-order partial derivatives of a function with respect to each of its parameters.
Remember that the subscript i represents a particular data point. If our data consists of 100 points then this Jacobian would have 100 rows and 3 columns because we have 3 parameters.
If we use the concept of the Jacobian to re-write the last equation we found for dS/da. We will have:

Note how I am using matrices to express this equation. I have removed the sum and now both the Jacobian and the residual are written using matrices. Remember all these equations are solved for all data points simultaneously so using matrices is really convenient. At this point, I will show you two ways in which we can solve this equation and find the parameters that better adjust the initial equation f.
A little bit more of math
The gradient descent
You might have heard this name before. Gradient descent is an optimization algorithm used to find local minimums of a function. The idea behind it is not difficult to understand. Because the function we are trying to minimize is differentiable we know the gradient at any point. This means that we know which is the direction we need to take to keep going down. At each iteration, we move a little bit closer to the function’s minimum. Two important aspects of the Gradient Descent method are the initial guess and the size of the steps we take at each iteration. The efficiency of this method is highly dependable on these two things.
What does this have to do with non-linear regressions? Well, we could use the gradient descent method to find the minimum value of the function S. In that case each of the steps we take towards the minimum point can be expressed as:

This hGD is added to the initial estimation of the parameters and this process is repeated until we find a minimum or we exceed the maximum number of iterations. The α that appears in the last equation is used to increase or decrease the size of the step we are taking. As I mentioned before, the performance of the Gradient Descent method has a lot to do with the size of the steps as well as the initial guess.
Gauss-Newton method
The gradient descent method is widely known and used but it can be quite slow depending on the number of parameters. An alternative is the Gauss-Newton method which, similar to gradient descent, is an iterative procedure in which we take multiple steps until we approach the right solution. In this case, we get a new combination of parameters through:

Where hGN represents the step we are taking in the Gauss-Newton method. How can we know the value of hGN at each iteration?
In the Gauss-Newton method, the function f is approximated using a first-order Taylor expansion which means that

Remember we said that the term dfi(a)/daj is also known as Jacobian, so the previous equation can be also written as:

If we use this expression to substitute f(an) by f(an+1) we end up with:

Which can be re-organized as:

And the step is calculated using the following equation:

The following chart applies to both methods. In both cases, it is necessary to specify an initial guess for the parameters as well as a stopping criterion. In this case, the stopping criteria consist of a maximum number of iterations or a minimum value for the squared error.

Levenberg-Marquardt method or damped least squares
Note that hGD and hGN equations are quite similar and this has a lot to do with the Levenberg-Marquardt method. This method switches between the gradient descent and the Gauss-Newton depending on how close we are to a solution. The Levenberg-Marquardt method is expressed as:

In the previous equation, I represents an identity matrix and λ is known as the damping factor. This parameter is what allows the change between a Gauss-Newton or a gradient descent update. When λ is small, the method takes a Gauss-Newton step, and when λ is large the step taken follows the gradient descent method. Generally, the first value of λ is large so that the first steps are in the gradient descent direction [2]. The logic behind this is that the Gauss-Newton method is more effective towards the final iterations whereas the gradient descent is useful at the beginning of the process where the algorithm is still far from the ideal solution.
As you can see, the Levenberg-Marquardt algorithm is a combination of gradient descent and Gauss-Newton algorithms. Because of this, the efficiency of Levenberg-Marquardt algorithm is also highly dependable on the selection of the initial guess as well as the damping factor [3]. In addition, the increments and reductions of the damping factor also affect the performance of the algorithm. At each iteration, the damping factor is multiplied or divided by a factor depending on how good the previous iteration was. Generally, lambda is increased by a factor of 2 and reduced by a factor of 3.

As an interesting historical side note, Levenberg proposed this method in 1944 but it was not until 19 years later that his method earned notoriety due in part to the reference made in a paper by Marquardt. Levenberg published the algorithm while working in Frankford Arsenal, an army ammunition plant. In a certain way, it could be said that Marquardt re-discovered the damped least square method and that is why today both names are used as a reference. Pujol [4] has a complete description of the work that was done by Levenberg and Marquardt and how each of them contributed separately to the algorithm we know today.
The code
The Levenberg-Marquardt algorithm can be implemented in many forms [5]. In this case, I am presenting a very simple way of using this algorithm using a Python notebook. I am also comparing my results against the results of the curve_fit function from Scipy. This function has a more robust implementation of the algorithm that will outperform the one that I will show you. However, I think that this code is a good starting point for anything more complex and to understand what is happening "under the hood". Although the example that is shown in this notebook is referred to a 2-dimensional problem, the logic behind the algorithm can be applied to multiple cases.
The example that is included in this notebook is referred to declination curve analysis (DCA) which is a common methodology used in the petroleum engineering world. The notebook contains a brief explanation of the DCA and some examples. You can access this notebook here, on my GitHub repository.
The conclusion
Every day it is easier to perform calculations that years ago could have taken a long time. However, it is always important to understand where all these calculations come from. Making linear and non-linear regression is the basis for many other things that can be done in data analytics and machine learning. Today, when everyone is looking at these fields trying to find answers or more efficient ways of performing processes, it is important to know how the fundamentals work.
References
- How does the Levenberg–Marquardt algorithm work in detail but in an understandable way?. Stack overflow post. 07/15/09
- Gavin, H. (2020). The Levenberg-Marquardt algorithm for nonlinear least squares curve-fitting problems. Department of Civil and Environmental Engineering. Duke University.
- Mittrapiyanuruk, P. A Memo on How to Use the Levenberg-Marquardt Algorithm for Refining Camera Calibration Parameters. Robot Vision Laboratory. Purdue University.
- Pujol, J. (2007) The solution of nonlinear inverse problems and the Levenberg-Marquardt method. Geophysics, vol 72, no.4
- Moré, Jorge. (1977). The Levenberg-Marquardt algorithm: implementation and theory. 7th Dundee Biennial Conference on Numerical Analysis at the University of Scotland.