Absolute difference between sample and population standard deviation plotted against sample size (Image by Author)

The Consistent Estimator

A guide for the regression modeler

Sachin Date
Towards Data Science
9 min readJul 24, 2021

--

A consistent estimator is one which produces a better and better estimate of whatever it is that it’s estimating, as the size of the data sample it is working upon goes on increasing. This improvement continues to the limiting case when the size of the data sample becomes as large as the population, at which point the estimate becomes equal to the true value of the parameter.

Consistency is one of the properties of an estimator, along with other properties such as bias, mean squared error, and efficiency.

Let’s illustrate the concept with a real world data set. We’ll use the same data that we used in the previous article, namely, ocean surface temperatures in the Northeast Atlantic:

North East Atlantic Real Time Sea Surface Temperature data set downloaded from data.world under CC BY 4.0

Let’s load the data set into memory using the Python based Pandas library, and we’ll clean it by removing all the missing values:

import pandas as pddf = pd.read_csv('NE_Atlantic_Sea_Surface_Temperatures.csv', header=0, infer_datetime_format=True, parse_dates=['time(UTC)'])df = df.dropna()

Let’s print out the data set:

print(df)
The cleaned up data set with all NaN rows removed (Image by Author)

The cleaned up data set is almost 800K data points, which is pretty big. Let’s consider this data set as the population of values. So in this case, we can say that we have access to the population, although in real life, we would always have to deal with a sample of values, and we would never know the full extent of the population.

Let’s calculate and printout the mean and standard deviation of the ‘population’:

pop_mean = df['sea_surface_temperature(degree_C)'].mean()
pop_stddevs = df['sea_surface_temperature(degree_C)'].std(ddof=0)
print('Population mean (mu)='+str(pop_mean))
print('Population variance (sigma^2)='+str(pop_stddevs))

Here are the values:

Population mean (mu)=11.94113359335031
Population variance (sigma^2)=1.3292003777893815

Sampling with replacement

For illustrating the concept of estimator consistency, we’ll draw out a randomly selected sample [y_1, y_2, …y_i,…,y_n] of size 100 (i.e. n=100) from this ‘population’ of values. We will draw the sample using a technique called sampling with replacement. It means that we will randomly draw out the first data point y_1, note down its value and put it back into the population. We’ll repeat this procedure for all n values.

Sampling with replacement can yield duplicates and therefore, it’s not always a practical sampling technique. For instance, imagine that you are selecting volunteers for a clinical trial. If you use sampling with replacement, you could in theory enroll the same person multiple times which is clearly absurd. However, if your population of values is very large, even after doing replacement, choosing the same data point more than once is an extremely rare possibility.

The big advantage of using sampling with replacement is that it ensures that each variable y_i of the sample can be considered an independent, identically distributed (i.i.d.) random variable, an assumption that can simplify a lot of analysis. Although i.i.d. variables are practically impossible to encounter in real life, ironically, the i.i.d. assumption constitutes several foundational results in statistical science.

After our little detour into sampling land, let’s get back to our discussion on estimator consistency.

For each sample [y_1, y_2, …y_i,…,y_n], we’ll use the following formulae of sample mean (y_bar) and sample deviation (s) as estimators of the population mean µ and standard deviation σ:

Estimators of the population mean and standard deviation (Image by Author)

We’ll calculate the estimates of population mean (µ) and standard deviation (σ) using the above formulae on a sample of size100. Next, we’ll increase the sample size by 100 and repeat the estimation of µ and σ, and we’ll continue to do this until the sample size n approaches the population size N=782668.

Here’s the Python code:

from tqdm import tqdm
increment = 100
#Define two arrays to store away the means and standard deviations for various sample sizes
sample_means_deltas = []
sample_stddevs_deltas = []
#Increase the sample size by 100 in each iteration, and use tqdm to show a progress bar while we are at it
for
sample_size in tqdm(iterable=range(10, len(df), increment), position=0, leave=True):
#Select a random sample of size=sample_size, with replacement
random_sample = df.sample(n=sample_size, replace=True)
#Calculate the sample mean
y_bar = random_sample['sea_surface_temperature(degree_C)'].mean()
#Calculate and store the absolute diff between sample and population means, and sample and population standard deviations
sample_means_deltas.append(abs(y_bar - pop_mean))

s = random_sample['sea_surface_temperature(degree_C)'].std()
sample_stddevs_deltas.append(abs(s-pop_stddevs))#Plot |y_bar-mu| versus sample_size
plt.plot(sample_means_deltas)
plt.xlabel('Sample size (n)')
plt.ylabel('|y_bar-mu|')
plt.show()#Plot |s-sigma| versus sample_size
plt.plot(sample_stddevs_deltas)
plt.xlabel('Sample size (n)')
plt.ylabel('|s-sigma|')
plt.show()

We see the following two plots:

Absolute difference between sample and population means plotted against sample size (Image by Author)
Absolute difference between sample and population standard deviation plotted against sample size (Image by Author)

In both cases, observe that the absolute difference between the estimate and the true value of the parameter progressively reduces as the sample size increases.

Also, notice that the absolute value of the difference between the sample and population means does not become zero even if the sample size (n) equals the population size N=782668. This might seem counter-intuitive. Why would the sample mean not be exactly equal to the population mean if the sample is as large as the population? The answer lies in recollecting that we are using the sampling with replacement technique for generating a sample. When this technique is used on finite sized populations, the ‘replacement’ aspect of this technique will cause the sample to have several duplicate values, even when the sample size is equal to the population size. Therefore, even in the case of n=N, the sample is never identical to the population.

The Consistent Estimator

It’s not just happenstance that the estimators for the population mean and standard deviation seem to converge to the corresponding population values as the sample size increases.

We can prove that they would always converge to the population values.

Before we prove that, let’s recollect what a consistent estimator is:

A consistent estimator is one which produces a better and better estimate of whatever it is that it’s estimating, as the size of the data sample it is working upon goes on increasing. This improvement continues to the limiting case where the estimate becomes equal to the true value of the parameter when the size of the data sample becomes as large as the population.

We can express consistency in probabilistic terms as follows:

Specification of the sample average as a consistent estimator y_bar of the population mean µ (Image by Author)

In the above equation, we are saying that, no matter how infinitesimally tiny you choose some positive value ε, as the sample size n tends to , the probability P() of the absolute difference between the average of n sample values y_n and the population mean µ being greater than ε is zero.

A thought experiment

One can visual the above equation using a thought experiment as follows:

Choose some tiny positive value of ε. Say ε=0.01.

  1. Start with a randomly selected (with replacement) sample of n values. Compute its average y_n, subtract it from the population mean µ, take the absolute value of the difference and store it away. Repeat this procedure a thousand times to yield one thousand values of absolute differences |y_bar(n)-µ|.
  2. Divide these 1000 differences into two sets of values as follows: the first set S1 contains differences that are less than or equal to 0.01, i.e. |y_bar(n)-µ| ε. The second set S2 contains values that greater than 0.01 i.e. |y_bar(n)-µ| > ε.
  3. Compute the probability of the absolute difference being greater than 0.01. This is simply the size of the second set divided by 1000. i.e.
    P(|y_bar(n)-µ| > ε) = sizeof(S2)/1000
  4. Now increase the sample size by 100, and repeat steps 1, 2, 3 to recalculate the probability P(|y_bar(n)-µ| > ε).
  5. What you’ll find is that:
    As sample size n increases, the probability P(|y_bar(n)-µ| > ε) reduces and it gets closer and closer to zero

You’ll find that no matter how small you choose ε, you’ll still see P(|y_bar(n)-µ| > ε) approaching zero as n increases.

The general condition of consistency for any estimator

For any estimator θ_cap(n) used to estimate the population level parameter θ, θ_cap(n) is a consistent estimator of θ iff:

The general condition of consistency for any estimator (Image by Author)

The average-of-n-sample-values estimator is a consistent estimator of the population mean

Recollect that at the beginning of the article we said that we can show that y_bar is a consistent estimator of µ.

To show this, we will first introduce the Bienaymé–Chebyshev inequality which proves the following fascinating result that applies to a wide variety of probability distributions:

Consider a probability distribution, such as the following Poisson distribution having mean µ and standard deviation σ. In the sample Poisson distribution shown below, µ=20 and σ = sqrt(20)=4.47213.

A Poisson distribution with mean rate=20 (Image by Author)

The Bienaymé–Chebyshev inequality says that the probability of the random variable X attaining a value that more than k standard deviations away from the mean µ of the probability distribution of X is at most 1/k². It’s expressed as follows:

Bienaymé–Chebyshev inequality (Image by Author)

Continuing with our example of a Poisson distributed variable X with mean µ=20 and standard deviation σ = sqrt(20)=4.47213, we get the following table:

For the Poisson random variable X ~ Poisson(µ=20), at most W% of values are greater than k standard deviations away from the mean µ (Image by Author)

k does not have to be an integer. For example, if x=26, its separation from the mean 20 is |X-µ|/σ=|20–26|/4.47213=1.34164 times the standard deviation. So, as per the Bienaymé–Chebyshev inequality, at most 100/(1.34164)² = 56% of values in the Poisson distribution of X, would be greater than 26.

So how does all this help us in proving that the average-of-n-sample-values mean y_bar is a consistent estimator of the population mean µ?

Let’s state once again what we want to prove, alongside the Bienaymé–Chebyshev inequality:

(Image by Author)

Let’s make the following substitution in the Bienaymé–Chebyshev inequality:

Set random variable X to the sample mean (Image by Author)

If the random variable X is set as the sample mean y_bar, then the mean of the sample mean is simply the population mean µ and the variance of the sample mean can be shown to be σ²/n.

In the equation of Bienaymé–Chebyshev inequality, when we substitute y_bar for X, we keep µ intact and we replace standard deviation σ with σ/sqrt(n).

Bienaymé–Chebyshev inequality after replacing X with y_bar and σ with σ/sqrt(n) (Image by Author)

Now let’s make a second substitution:

(Image by Author)

We get the following:

(Image by Author)

Now we increase the sample size n to the point where it equals the theoretically infinite population size:

(Image by Author)

Solving the limit yields the result we were looking for, namely that the limit of the probability is zero as the sample size becomes arbitrarily large, thereby proving that the average-of-n-sample values mean is a consistent estimator of the population mean.

Applicability to regression modeling

A regression model is usually trained on a sample which is the training data set. After it is trained, the model’s parameters acquire a set of fitted values β_cap. If you train the model on another randomly selected sample of the same size, chances are, the trained model will acquire another set of fitted values β’_cap. Training on a third sample data set wil yield a third set of fitted parameter values β’’_cap and so on. Thus, the fitted coefficients of a regression model β_cap is actually a vector of random variables which have a mean and a standard deviation. Practically, a regression model cannot be trained on the entire population of values. So β_cap cannot ever attain the true population level values β of the coefficients. This is where the connection with consistency comes in. If the model were to be trained on a randomly selected sample of larger and larger size, the estimation procedure of β is said to be consistent if P(|β_cap — β| > ε) = 0 as the size of the training data set tends to infinity.

References and Copyrights

Data set

North East Atlantic Real Time Sea Surface Temperature data set downloaded from data.world under CC BY 4.0

Images

All other images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

Thanks for reading! If you liked this article, please follow me to receive tips, how-tos and programming advice on regression and time series analysis.

--

--

In-depth explanations of statistical models. Get the intuition behind the equations.