If you’re looking to get into machine learning, linear regression is a great starting point to understand the basics of machine learning. Linear regression is a very simple but versatile tool for machine learning. It’s widely used in both industry and research, so if you’re looking to apply your machine learning knowledge and skills in a professional or academic environment, it’s key to have a strong handle on it.
Linear regression is one of the few machine learning applications that could have a closed-form solution.
Closed-form solutions should always be used instead of iterative algorithms if they’re available, as it’s the most direct way to find the optimal solution.

What is a closed-form solution?
A closed-form solution is an equation which can be solved in terms of functions and mathematical operations. Wolfram MathWorld has a very extensive and theoretical definition of it, but it’s convoluted.
A simple way to think about it is whether it is possible to solve your optimization problem with some old-school calculus instead of a full machine learning algorithm. Instead of allowing your algorithm to iteratively improve the optimization function, like when you employ gradient descent, genetic algorithms, etc., you take the derivative of your loss minimization function, set it equal to zero, and find the optimal value for the weights and bias (also known as the m and b of your y = mx + b equation).
This is not possible to do in all scenarios, but I’ll outline when it can be used later on. If it is possible, the plain math solution will actually lead you straight to the optimal solution. Let’s dig in a little more on how closed-form solutions work.

Closed-form explanation
An optimization problem is closed-form solvable if it is differentiable with respect to the weights w and the derivative can be solved, but that is only true in the case of a minimum/maximum optimization problem.
In order to figure out whether that is the case, you take the optimization equation, derive it with respect to the weights w, set it equal to zero, and solve for w.
The Derivation of the Closed-Form Solution for Linear Regression

In machine learning, we often use 2D visualizations for our poor, little human eyes and brains to better understand. However, we almost never have 2D data. Although the graphs used throughout the article are 2D, this solution applies to problems with multi-dimensional inputs and outputs.
Given a set of N input vectors of D dimension (in the visualization, D=1 so we have one-dimensional input data) and a set of output vectors of S dimension (in the visualization, S=1), we’re looking to find a mapping function from inputs to targets.
Observations:

Targets:

Mapping Function:

Since we’re using linear regression, the function f is linear. Any linear function is of the form y = mx + b. A one-dimensional mapping like y = mx + b means a single x value outputs a single y value, like if y = 2x+ 3, when x = 2, y = 7. In that case, our function looks like f: 1 ->1. In a real-world scenario however, we have multi-dimensional data, so the formulation is a little bit different. We are looking to map multi-dimensional inputs to multi-dimensional outputs.
Let’s take this one step at a time and first map from a multi-dimensional input to a one-dimensional output, so f: D ->1. In that case, the input x becomes a vector, and we need the weight factor m to be a vector of weights rather than a scalar value. Our mapping function evolves to be: f: D ->1:

If we go one step further and make the leap to multi-dimensional input mapping to multi-dimensional output, we need to find a mapping f: D ->S, where D is the dimension of our input and S that of the output. One way to think of this is to find the weight vector m and bias b for each dimension of the output. It’s as if we’re stacking a bunch of these functions:

where i = 1, 2, …, S. If we formally stack these mappings together, the bias b is simply the stacked biases bi and the weight matrix W can be formulated by stacking all the m vectors so

We rarely have just one sample in a dataset. We usually, as mentioned above, have a set of N data samples. We can add all these data samples into two matrixes, one for input and output.


We want to pass all these through the function at the same time, which means we need to extend the function to have the form

The function at this step is:

Surprisingly, it is fairly simple to get to that formulation. We just need to absorb the bias. We’ll change the matrix X by adding a column of 1s to the end creating:

and the matrix W by adding the bias vector as the last row, so W becomes

Our new dimensions are:


We’ve successfully absorbed the bias and XW approximates Y.
Now we need to formulate the optimization. Since we want to get XW as close as possible to Y, the optimization is the minimization of the distance between f(X) = XW and Y or:

This is essentially the least-squares loss function, which classically looks like:

but we’ve adapted it to work with matrices. The ½ is only there to make our lives easier when we get to the derivative. We use the least-squares strategy because when analyzing linear regression, if y = 3, it doesn’t matter if f(x) = 1 or f(x) = 5, as they both create the same amount of error.
As I mentioned at the start, to find a closed-form solution, we now need to take the mapping and derive it.



Hooray! We’ve successfully derived our mapping function with respect to the weights W. Now all we have to do is set it to zero and solve for W in order to find the optimal solution.

And that’s it! W* represents the optimal weights.
For those who really love their theoretical linear algebra, feel free to dig into the bewildering world of pseudo-inverses. The pseudo-inverse is:

There’s also a formal proof to show why we can invert the first term, as this cannot be assumed. First, we assume X is full-rank. X is a matrix of dimensions N x D. Full-rank means that there are D or N linearly independent columns and rows, whichever is smaller. N is our sample size and D is the number of features of the input data. For it to work, the number of samples N has to be larger than the number of features D. We only need to prove that the columns are linearly independent. We can assume that because if they weren’t, we have features in the dataset which do not add any information, like if we had both Celsius and Fahrenheit temperatures as features of each data point. Those two values are linearly dependent, so we can just throw one of those columns out. If X is full rank, by definition of rank,

this also full rank and quadratic. Therefore, it can be inverted!
When to use closed-form solution
A closed-form solution should be used whenever possible, as it provides the best solution. In machine learning, we’re typically looking to minimize an error function or an equivalent problem, and a problem has a closed-forms solution if the function being optimized is differentiable and the differentiation can be solved for the variable we’re optimizing on. Although machine learning turns a lot of heads, sometimes a bit of basic linear algebra is a cleaner solution.
There are a couple of weaknesses regarding this approach, the most obvious of which is that we can only solve linear problems. This severely limits the set of problems which can be solved using this method.
However, you can use basis functions to apply closed-form solutions to non-linear data. A basis function is a function which does not have any learnable parameters but allows you to transform the data into a different space.
Closed-form solutions are a simple yet elegant way to find an optimal solution to a linear regression problem. In most cases, finding a closed-form solution is significantly faster than optimizing using an iterative optimization algorithm like gradient descent.
Open for freelance technical writing and ghostwriting work: [email protected]. I focus on Data Science and frontend programming.