The world’s leading publication for data science, AI, and ML professionals.

Essential Parameter Estimation Techniques in Machine Learning and Signal Processing

Bayesian vs Frequentistcs Parameter Estimation

Getting Started

Photo by Jose Llamas on Unsplash
Photo by Jose Llamas on Unsplash

Parameter estimation plays a vital role in Machine Learning, statistics, communication system, radar, and many other domains. For example, in a digital communication system, you sometimes need to estimate the parameters of the fading channel, the variance of AWGN (additive white Gaussian noise) noise, IQ (in-phase, quadrature) imbalance parameters, frequency offset, etc. In machine learning and statistics, you constantly need to estimate and learn the parameters of the probability distributions. For example, in Bayesian and causal networks, this corresponds to estimating the CPT (conditional probability table) for discrete nodes and the mean and the variance for the continuous nodes. In this article, I will discuss essential parameter estimation techniques used widely in machine learning, AI, signal processing, and digital communication.

Following is the outline for this article:

Outline

  • Frequentists vs. Bayesian
  • Maximum Likelihood (ML) Estimation
  • Maximum a Posteriori (MAP) Estimation
  • Minimum Mean Square Error (MMSE) Estimation
  • Least Square (LS) Estimation
  • Bayes Estimator
  • Properties of the Estimators
Photo by Nature Uninterrupted Photography on Unsplash
Photo by Nature Uninterrupted Photography on Unsplash

Frequentists vs Bayesian Approach

Frequentists and Bayesian are two well-known schools of thought in statistics. They have different approaches on how to define statistical concepts such as probability and how to perform parameter estimation.

Frequentists define probability as a relative frequency of an event in the long run, while Bayesians define probability as a measure of uncertainty and belief for any event.

Furthermore, the frequentists assume the parameter θ in a population is fixed and unknown. They only use data to construct the likelihood function to estimate the unknown parameter. Bayesians, on the other hand, consider the parameter θ to be a random variable with an unknown distribution. They use both the prior probability and data to construct the posterior distribution.

To better understand the difference between these two, consider the well-known Baye’s law:

P(θ) is the prior belief you have about the parameter before collecting (observing) any data, P(X|θ) is the likelihood function (probability of observing the data given the parameter), P(θ|X) is the posterior distribution (belief about the parameter θ after you observe the data), and P(X) is the probability of data. The Bayesian approach is more computationally intensive than the frequentists approach, thanks to the denominator of the Bayes law. This is because that integral is usually done in high dimensional, and it may not either have a closed-form solution or be very complex to compute. Another problem with the Bayesian approach is the subjective prior (P(θ)) since, in most problems in the real world, one has no idea what would be the best prior belief. However, the Bayesian approach lets you incorporate the prior belief into your model, which could be beneficial if, for example, due to the domain knowledge, you have a good model for the prior probability.

Photo by Ryan Hafey on Unsplash
Photo by Ryan Hafey on Unsplash

Maximum Likelihood (ML) Estimation

ML estimation tries to find the estimate of the parameter θ by maximizing the likelihood function.

Assume we have i.i.d random samples x₁,x₂, . . .,xₙ that follow a distribution f(x₁,x₂, . . .,xₙ;θ), which depends on the unknown parameter θ. The goal is to estimate this unknown quantity such that it maximizes the probability of observing this random sample. Following is how you formulate this problem using the maximum likelihood:

  1. Construct the likelihood function L(θ|x) = f(x₁,x₂, . . .,xₙ;θ)
  2. Use the property of i.i.d random samples to break the joint PDF to the product of n marginal PDF.
  1. Take a logarithm (usually a natural log) to change the product to summation. This does not change the optimal estimator since the logarithm is a monotonic function.

Where LL(θ;x) represents the log-likelihood function.

  1. Differentiate the log-likelihood function with respect to θ and set it to zero.
  1. The estimator will only be a function of observed data.
Photo by Jon Flobrant on Unsplash
Photo by Jon Flobrant on Unsplash

Maximum a Posteriori (MAP) Estimation

MAP estimation tries to find the estimate of the parameter θ by maximizing the posterior distribution. Recall the Bayes law again but this time we are not trying to compute the exact value of posterior. This is important since we do not need to worry about the denominator because it is independent of the parameter. Therefore, all is needed is to maximize the product of likelihood and the prior probability.

Remember, P(X|θ) is the same as the likelihood function. Therefore, the MAP estimate is the same as the ML estimate with the inclusion of the prior probability. Following is how you formulate the problem using the MAP:

  1. Construct the posterior distribution.
  2. Use the property of i.i.d random sample to simplify the posterior distribution.

Note: Slight abuse of notation to use letter P for the density function.

  1. Take the logarithm (Usually a natural log) to further simplify the above relationship.
  1. Differentiate the above equation with respect to θ and set it to zero

Note: MAP and ML estimate is the same if θ follows the uniform distribution. What this means intuitively is that all values of θ have equal weight; therefore, knowing the distribution of θ does not give us any more useful distribution.

Example 1: Consider a communication system that the transmitted signal X ~ N(0,σₓ²) (Gaussian distribution with zero mean and variance of σₓ²). The received signal Y can be modeled as follows:

Where n ~ N(0,σₙ²). We would like to find the MAP and ML estimates for transmitted signal X.

ML Estimation:

Y is a received or observed message. You can think of Y as a noisy version of X. This is a standard problem in the communication system. We never know what message is transmitted. (otherwise, there is no point in designing the receiver, error correction codes, etc.) To find the ML estimate, we need to follow the steps outlined above. First, construct the likelihood function: P(Y|X=x). Then find the log-likelihood expression and then differentiate respect to x and set it to zero.

C₁ denotes the constant terms that do not depend on x (We do not care about them since they are zero after differentiation).

Interpretation: The ML estimate of x is the observed message y. This means under the maximum likelihood estimation, the best estimate for the transmitted signal is the received noisy signal.

MAP Estimation:

First, we need to construct the posterior distribution by multiplying the likelihood and prior together (Remember, the denominator is not important since it is not a function of the parameter).

The maximum value of the posterior occurs when the exponent is minimized. Differentiating respect to x and set it to zero will result in the following:

Interpretation: The MAP estimate of x is linearly proportional to the received signal y. If the variance of signal is infinity (becomes very large), then the normal distribution becomes the uniform distribution, and the MAP and ML estimate becomes the same.

To understand this concept better, let’s look at the simulation of the estimated x under ML and MAP as the variance of the transmitted signal is changing while the variance of the noise is constant at σₙ² = 10.

As the above table and figure show as the variance of the signal increases, the Gaussian distribution becomes more similar to a uniform distribution, and ML and MAP estimates become closer to each other. For example, when the signal variance is 10 times noise variance (The red curve) the ML and MAP estimates are almost identical.

Photo by Jason Blackeye on Unsplash
Photo by Jason Blackeye on Unsplash

Minimum Mean Square Error (MMSE) Estimation

MMSE is one of the most well-known estimation techniques used widely in machine learning and Signal Processing. For example, Kalman and Wiener filters are both examples of MMSE estimation.

In MMSE the objective is to minimize the expected value of residual square, where residual is the difference between the true value and the estimated value. The expected residual square is also known as MSE (Mean Square Error)

Following is a procedure to solve any MMSE estimation:

  1. Define the estimator.
  2. Construct the MSE (Expected residual square).
  3. Differentiate the MSE with respect to the parameter and set it to zero.
  4. Plug the MMSE estimator in part 3 in the MSE expression to find the minimum residual square.

Example 2: For the first example, we would like to find the estimate and the MSE of the random variable X using a constant y.

Interpretation: The best constant estimator of X is the expected value of X (μ). The minimum MSE using the optimal estimator is the variance of X.

Most of the time, we are interested in finding the MMSE estimator after observing some data or evidence. For example, assume we are interested in finding the best estimator for random variable Y after observing random variable X. It can be shown that the MMSE estimator in this case is:

Recall that E[Y|X=x] is a function of x and in general, it is nonlinear and can be very complex. Therefore, in practice, we mainly consider the class of linear MMSE estimators like the following:

To find the optimal a₁, a₂ and the MSE follow steps 1–4 outlined above.

Now there are two equations and two unknowns, which can be solved in many different ways. For example, we can formulate the problem as follows:

Solving the above matrix equation results in the following estimates for the coefficients:

Substituting the optimal coefficients in the E[ϵ²] expression will result in the optimal (minimum) MSE:

Interpretation: If X and Y are independent, then the covariance between them is zero and so is the estimate of a₁ and the problem is changed to estimating a random variable with a constant (a₂). The above results justify this since the estimate of a₂ = E[Y] and E[ϵ²] = var(Y) (Same results as estimating a random variable with a constant). ρ is a correlation coefficient between X and Y and can vary between -1 and 1. If two random variables are strongly correlated (either +1 or -1) then the MSE is zero, which means one variable can perfectly estimate another one.

Orthogonality Principle:

The orthogonality principle states that the estimator and residual are perpendicular to each other.

Source: http://fourier.eng.hmc.edu/e59/lectures/signalsystem/node9.html
Source: http://fourier.eng.hmc.edu/e59/lectures/signalsystem/node9.html

In the above figure X hat is an estimator of the X and X tilde is the residual. Mathematically this can be represented as follows:

For example, applying the orthogonality principle to the linear MMSE estimator results in:

This is the exact expression as the derivative of MSE with respect to a₁.

Note: The residual is perpendicular to the plane containing the estimator therefore it is perpendicular to every vector in that plane. This is the reason why in the above equation we can replace a₁X + a₂ with X. However it is also valid to use a₁X + a₂ instead of X.

Photo by MIKHAIL VASILYEV on Unsplash
Photo by MIKHAIL VASILYEV on Unsplash

Least Square (LS) Estimation

Least square (LS) estimation is the most common estimation techniques used in communication systems to estimate the channel response, machine learning as a loss or cost function for regression and classification problems, and optimization to find the best line or hyperplane that fits data the best. Regardless of the application, the steps to solve the least square problem are roughly identical. The best way to explain these steps is through an example.

Example 3: Consider you are given a dataset (xₖ, yₖ) k = 1,2, . . . n and you are asked to find the line that best describes the relationship between yₖ and xₖ.

Following are the steps to formulate the least square problem:

  1. Define the model depending on your data. For example, y = ax + b +ϵ where b is the y when x = 0 or the y-intercept, a is the slope, and ϵ (residual) is modeling the random fluctuation and is distributed with mean 0 and variance σ².
  2. Formulate the least square problem by minimizing the expected residual square.

But wait a minute this is exactly the problem formulation for the MMSE discussed in detail in the last section.

  1. Differentiate the objective function with respect to a and b and set it to zero.

This is the same result as a linear MMSE. Therefore

The above figure shows data and the least square fitted line calculated using the estimates of a and b.

Note: The least-square estimator is a special case of the MMSE (namely the linear MMSE).

Least Square Approximation

Now, what if X is an m-dimensional, Y is an n-dimensional vector. Then A is m by n matrix. The least-square approximation is concerned with finding the best solution for Y = AX. This is an important problem since sometimes Y is not in column space of A and therefore Y =AX does not have a solution which means we are trying to find the solution that minimizes the norm of the difference between Y and AX.

Note: The following relationships hold true:

Recall, the transpose of a scalar quantity is itself and this is the reason why the last relationship is true.

Now to find the optimal X, we need to differentiate J with respect to X and set it to zero.

Note: If the product of A transpose and A is invertible then we expect the estimator to be unique and it is called the least square solution. If the inverse does not exist then we can replace it with the pseudoinverse. The last equation above to get the estimate of X is also known as the normal equation.

Photo by Gilly Stewart on Unsplash
Photo by Gilly Stewart on Unsplash

Bayes Estimator

In Bayesian estimation, the parameter θ is modeled as a random variable (Recall Bayesian vs Frequentists section in this article) with a certain probability distribution. The MMSE, LAE (Least Absolute Error), and MAP are all special types of Bayes estimators.

Define the cost or loss function C as a cost of choosing the estimator instead of the true parameter (Think of it as how much you lose if you use the estimator instead of the true parameter). In the Bayes estimation, we minimize the expected loss function given observed data x. This can be defined mathematically as follows:

P(θ|x) is the posterior distribution (Refer to the first section in this article) and can be calculated using the Bayes law.

The procedure to solve the Bayes estimation problems is as follows:

  1. Use the suitable cost function C.
  2. After defining the cost function and simplifying the integrand, differentiate the expression, and set it to zero to find the estimator.

The cost function can take many different forms however the most well-known cost functions are the quadratic and the absolute functions. For the rest of this section, we will derive the Bayes estimator for the quadratic, absolute, and 0–1 cost functions.

Quadratic Cost Function

If the cost function is quadratic, C is replaced by a quadratic function and follow the procedure outlined above.

Interpretation: The estimator is the conditional expectation of the parameter given the data or the posterior mean which is the identical result as the MMSE. Therefore the MMSE is the Bayes estimator when the cost function is quadratic.

Absolute Cost Function

In this case, C is replaced by the absolute error function. The estimator is calculated similarly:

Therefore, for the posterior distribution, the integral from -∞ to θ is equal to the integral from ∞ to θ. However, we also know that the integral over the entire domain would be 1 (Maximum value of probability is 1). Therefore:

Interpretation: The best estimator under the absolute error cost function is the median of the posterior distribution. This is what one-half represents in the above equation. The estimator under this cost function is known as LAE (Least Absolute Error) estimators.

Zero-One Cost Function

In this case, C is 1 in some interval and zero otherwise.

Interpretation: Minimizing J is equivalent to maximizing the posterior distribution. Therefore the estimator of 0–1 loss function is the mode of the posterior distribution (The value of θ that maximizes the posterior distribution). This is exactly what MAP estimation does. This means MAP estimator is a Bayes estimator when the cost function is 0–1.

Photo by v2osk on Unsplash
Photo by v2osk on Unsplash

Properties of the Estimators

Estimators possess some properties that distinguish them from each other.

Unbiased Estimator:

An estimator is said to be an unbiased estimator of parameter θ if its expected value is equal to θ. Mathematically this is represented as follows:

The bias of an estimator is defined to be:

Cramer-Rao Bound

it is sometimes referred to as CRLB (Cramer-Rao lower bound) and is the lower bound for a variance of an estimator. The lower is the variance of an estimator the more certain one can be about the range of possible values it can take. CRLB is calculated as follows:

Intuition: Following curves represent the Gaussian density functions with the mean 0 and different variances. As these curves represent the lower the variance is the narrower is the density functions and the higher is the confidence interval, hence the estimate of the parameter is more accurate.

All the above curves have negative curvatures, which means they all have negative slopes at any point on the curve. However, the green curve (σ² = 1) has a much sharper rate of change of the slope compared to the blue and red curves. This is the intuition on why there is a second derivative with respect to θ in the CRLB formulation.

Consistent Estimator

An estimator is said to be consistent if it converges to the true parameter in distribution as the number of samples approaches infinity.

Intuitively consistency implies that the estimator gets more concentrated around the θ as the sample size increases.

Bias-Variance Trade-off

Bias-variance trade-off is one of the most well-known concepts used in machine learning and statistics. The idea is that you can express the MSE as the sum of bias square and the variance.

Therefore for a fixed MSE lowering the bias leads to increasing the variance, hence there is a trade-off between these two quantities.

Minimum Variance Unbiased Estimator

If an estimator is unbiased and has the lowest variance among all other unbiased estimators of θ then it is called MVUE. According to the Bias-Variance trade-off for the MVUE the MSE is equal to the variance of the estimator since the bias is zero.

Example 4: Consider x₁,x₂, . . ., xₙ are n i.i.d random samples drawn from a normal distribution with mean μ and σ² variance. Find the maximum likelihood estimate of μ and verify if the estimator is unbiased, consistent, and satisfies the CRLB?

Follow the steps outline in the ML section:

Therefore the ML estimate of μ is the sample mean. Now we need to compute the bias and CRLB as well as check if the estimator is consistent.

Bias:

Therefore the sample mean is an unbiased estimator.

Consistency:

To check the consistency we need to find the distribution of the sample mean.

Therefore the sample mean is normally distributed with the mean μ and variance of σ² divided by n. Now to check for consistency, we let n goes to infinity which causes the variance to go to zero hence, the sample mean converges to a constant which is μ.

Variance and CRLB:

To find the CRLB proceed as follows:

  1. Compute the log-likelihood.
  2. Compute the second derivate of the log-likelihood function with respect to the parameter.
  3. Take the expectation of part 2, then invert the result and multiply by -1.

Based on the above maximum likelihood calculation, we already have the result for the first derivative of log-likelihood (Denoted by LL).

Therefore the CRLB and variance are both the same. This means that the sample mean has the lowest variance among all the unbiased estimators of the μ. Since the sample mean is unbiased and achieves the CRLB, then it is MVUE.

MSE:

According to the bias-variance trade-off since the bias is zero then MSE is equal to CRLB (or the variance in this case).

Conclusion

In this article, I discussed the difference between the Bayesian and the Frequentists approaches. I discussed some well-known estimation mechanism used widely in machine learning and signal processing. Furthermore, I examined some important properties of estimators.


Related Articles