The world’s leading publication for data science, AI, and ML professionals.

Comprehensive Study of Least Square Estimation (Part 1)

Ordinary, Constrained, Multi-objective, and Nonlinear least square.

Photo by Lukasz Szmigiel on Unsplash
Photo by Lukasz Szmigiel on Unsplash

The least-square estimation is one of the most widely used techniques used in machine learning, signal processing, and statistics. It is the common way of solving the linear regression used widely to model continuous outcomes.

It can be modeled as an MMSE estimator or a Bayes estimator with a quadratic cost. I have already written comprehensive articles about various estimators used in machine learning and signal processing. Please check the following article for more information.

Essential Parameter Estimation Techniques in Machine Learning and Signal Processing

I have also made a series of YouTube videos to go more in-depth about various estimation techniques. Please check and subscribe to my YouTube channel for more information.

This is a three-part article, discussing all essentials about least-square Estimations. Here is the outline of this article:

Outline

  • Ordinary Least Square (OLS) Estimation
  • Solution of OLS problem
  • Orthogonality Principle
  • OLS Using QR factorization (Advanced Reader)
  • Least Square Problem for Matrices
  • Multi-Objective Least-Square (Part 2)
  • Constrained Least-Square (Part 3)
  • Nonlinear Least-Square (Part 4)

In this article, I will cover the least square in-depth, in future articles I am going in-depth about multi-objective, constraint, and nonlinear least-square. I will also link to my YouTube videos on the least-square topic for more information and more thorough explanations.


Ordinary Least-Square (OLS)

Assume A is an m x n matrix where m is greater than or equal to n (A is a tall matrix) then the following overdetermined system of the equation either has a unique or no solution.

Figure 1: Linear Equation
Figure 1: Linear Equation

The unique solution occurs if b is a linear combination of columns of A, meaning that if you consider A to be written as a column representation then b will be in column space of A. Another way to explain this is that there are scaler x₁, x₂, . . ., xₙ such that the following equation has a solution.

Figure 2: Condition for Linear Independence
Figure 2: Condition for Linear Independence

Note: To know why if the solution exists, then it is unique, consider the assumption for the least-square problem. It states that A is tall and has linearly independent columns, then it has a left inverse. Then multiply the left and right side of equation 1 by the left inverse of A and you will obtain the unique solution. The rest of this article, explains how to find this solution.

Note: To know more about the overdetermined, undetermined system of equations, tall, wide matrices, column and row interpretation of the matrices please refer to the following videos:

If Ax = b has no solution then our goal is to find the x such that it minimizes the norm of the distance between Ax and b.

Figure 3: Least Square Problem Formulation
Figure 3: Least Square Problem Formulation

Column Interpretation of the least-square:

For the column interpretation, we are trying to find a linear combination of the column of A that is closest to vector b. Mathematically this can be represented as follows:

Figure 4: The Column Representation of the Least-square
Figure 4: The Column Representation of the Least-square

Row Interpretation of the least-square:

For the row interpretation, we are trying to minimize the sum of m residuals corresponding to m linear equations.

Figure 5: The Row Representation of the Least-Square
Figure 5: The Row Representation of the Least-Square

r is a vector of residuals where each component of r corresponds to a residual in a linear equation. For example:

Figure 6: The Row Representation of the Least-Square
Figure 6: The Row Representation of the Least-Square

Solution of the least-square problem

Here I will introduce two methods to solve the OLS problem. These two methods are equivalent.

Method 1: Component-wise Notation

Figure 7: Solution of the Least-Square
Figure 7: Solution of the Least-Square

This method is based on writing the objective function J in terms of its components and then differentiating the objective function with respect to x and set it to zero.

Figure 8
Figure 8

Method 2: Matrix-vector Notation This method is based on writing the Euclidean norm of a vector as a product of vector transpose and itself.

Figure 9: Least-Square Solution Using Matrix-Vector Notation
Figure 9: Least-Square Solution Using Matrix-Vector Notation

The following equation, which gives rise to the solution of the least-square problem is called the normal equation.

Figure 10: Normal Equation
Figure 10: Normal Equation

Methods 1 and 2 produce identical results however method 1 is in element-wise format but method 2 gives a more compact solution.


Interpretation of the Solution

The (AᵗA)⁻¹A (Product of A transpose, A is called gram matrix and is an invertible matrix if A has linearly independent columns) is called the left pseudo-inverse of a matrix A. This means:

Note: If A is a left-invertible matrix (has linearly independent columns) the least square solution exists and is unique.

Note: To learn more about gram matrix, left invertible, and right invertible matrices, please refer to the following videos:

Orthogonality Principle:

I discussed the orthogonality principle in the "Essential Parameter Estimation technique article" for the generic estimators, please refer to the following resources for more information

Essential Parameter Estimation Techniques in Machine Learning and Signal Processing

But here I will introduce the orthogonality principle in the context of the least-square estimation. The orthogonality principle states that residual is orthogonal to the optimal estimator.

For example, consider the following figure:

Orthogonality Principle
Orthogonality Principle

b is the vector we would like to estimate, r is the residual, and Ax is the estimator.

Figure 11: Orthogonality Condition
Figure 11: Orthogonality Condition

Now to check if the least-square solution is an optimal solution we proceed as follows: If x is a least-square solution then the normal equation holds meaning that:

Therefore, the orthogonality principle will be as follows:

Where the final result is due to the normal equation. The above equation states that if x is a least-square solution then Ax is orthogonal to the residuals and therefore Ax is the optimal estimator of the vector b.


OLS Using QR factorization (Advanced topic)

This section requires exposure to the QR factorization and this is why I classify it as a more advanced topic. For information and a step-by-step explanation of the QR methods please watch the following videos:

We will first decompose matrix A using the QR factorization by letting A = QR, where Q is an orthogonal and R is an upper triangular matrix. Then we replace the A in the OLS solution with its QR decomposition equivalent.

Figure 12: Least Square Solution Using QR Factorization
Figure 12: Least Square Solution Using QR Factorization

The last line represents what would be the solution using the QR factorization. QR factorization offers an efficient method for solving the least-square using the following algorithm:

  • Find the QR factorization of matrix A, namely A = QR.
  • Compute the product of Q transpose and b.
  • Solve the following equation using the back substitution method (since R is an upper triangular matrix).

Least Square Problem for Matrices

Assume A is a m x n matrix, X is a n x p, and B is a m x p matrix. Then the goal is to minimize the Frobenius norm of Ax-B. The problem is similar to the least square problem for vectors after expressing matrix X and B in terms of their column representations.

Note: The Euclidean or Frobenius norm of a matrix is defined as follows:

where aᵢⱼ is the element of iᵗʰ row and jᵗʰ column of matrix A. aᵢ is the iᵗʰ columns of matrix A. What the above equation says is that the Frobenius norm of a matrix is the sum of the Euclidean norm of each column.Therefore the result is simplified as follows:

As it can be seen,||AX-B||² (Square of the Frobenius norm) can be written as the sum of p ordinary least square objective functions, but each objective function can be minimized independently. Therefore practically we are solving p independent least square problems to find the p optimal solutions corresponding to the p columns of matrix X. Mathematically this can be represented as follows:

In a more compact form the solution can be represented as follows:

Make sure to check out the second and the third parts after finishing part 1. Part 2 discusses the multi-objective least-square. Part 3 discusses the constrained least square problem.

Comprehensive Study of Least Square Estimation (Part 2)

Comprehensive Study of Least Square Estimation (Part 3)

Conclusion

In this article, I discussed the ordinary least square problem, the row and column interpretation of the solution, the orthogonality principle, and the OLS solution using the QR factorization. In the next article, I will go over the multi-objective and constrained least-square problems. I have discussed OLS before in my previous article about "the essential parameter estimation techniques" (The link to that article is available is given here). I also have a complete series of videos on numerical linear algebra and Optimization in my YouTube channel, in which I discussed all these topics in detail.


Related Articles