Empirical Distribution: Everything You Need To Know

Approximate distribution of the data, Dirac Delta function and much more

Shubham Panchal
Towards Data Science

--

Image Source: Author

Empirical distribution is a word that you might have observed in a number of statistics textbooks, but I discovered it in the book Probabilistic Machine Learning : An Introduction by Kevin Patrick Murphy.

The book provides mathematical treatments to a number of topics that are otherwise stated as ‘common sense’ in most online blogs and videos. As my interest lies in the Math behind ML ( and Statistics ), I started reading this book and soon discovered the empirical distribution,

Image Source: A snapshot from the draft of the book ( the draft of the book is available openly on https://probml.github.io/pml-book/book1.html

At the first glance, for a statistics beginner like me, I found this definition completely unintuitive. I was aware of that delta function ( the Dirac Delta function ) but from that quantum physics perspective and its use to describe the PDF of a discrete random variable seemed impossible. PDFs were described for continuous random variables but statisticians had invented PDFs for discrete stuff too.

This story would provide mathematical background on empirical distributions and doesn’t contain much real-world examples.

If you’re in the same situation as mine, and you wish to explore more on empirical distributions, go and complete the story!

1. Let’s Start With A Dictionary

I have developed this habit of looking for words in a dictionary whenever I encounter a new concept. For empirical distributions, the word ‘empirical’ clears means,

Definition of empirical

1: originating in or based on observation or experience

The first definition from Merriam-Webster is of great use, considering the perspective of statistics. We often refer observations or experience as ‘data’ in the world of statistics and computer science. The word empirical means something that is concerned with observations or data and is not a theoretical construct.

In practical use of statistics, we usually don’t have an infinite amount of data, What we have is a tiny fraction of data, popularly called samples, and we need to infer everything from it. So, things inferred from that tiny fraction of data often use the word empirical which precisely matches the meaning.

Next word, distributions, depict the probability of occurrence of events. We model the quantity of interest ( the quantity that we wish to analyze ) as an random variable and we say that the random variable follows some particular probability distribution. In case of discrete random variables, probability distributions are characterized by probability mass functions ( PMFs ). But in this story, we’re interested in probability density or probability density functions that are defined for continuous random variables.

Consider a quality control ( QC ) test happening in a juice industry. The QC team cannot check the quality of each and every juice carton, so they’ll take a bunch of samples ( 10–15 juice cartons ). Using these samples, they’ll try to measure the quality of the whole unit or batch ( maybe 10,000 juice cartons ). The whole unit/batch from where the samples are taken is called a population in statistics lingo.

2. Our Goal

Let’s get straight into our goal.

Our goal is to approximate the PDF from a given finite number of samples. The PDF is a continuous thing that we’ll approximate from a finite number of samples ( discrete in sense )

The resulting approximate PDF would characterize the distribution of the samples and not the true data distribution, and that’s the reason we’ll call it an empirical distribution.

In order to model the true distribution, you’d need an infinite amount of samples

For this transformation, from the ‘discrete’ word to the ‘continuous’ world, we’ll be using the Dirac Delta function. Finally, what we’ll get will be called the generalized probability density function.

3. Dirac Delta Function

Things might go fast from here, so I hope you’ll be follow.

Consider a step-function whose step is at x = 0, like,

(1) The step function s and its plot. Image Source: Author

As you may observe, this function is not continuous at x=0 as the limits from both sides don’t match,

(2) The right-hand side and left-hand side limits of the step function s. Image Source: Author

In order to make the function continuous at x=0, why not convert the ‘step’ to a ‘ramp’, that starts from -a and ends at a, where a is some real number?

(3) The ramp function s_a and its plot. This function is continuous at x=0. Image Source: Author

If you didn’t understand how we got that term (1/2a)(x+a), in (3), try calculating the equation of that inclined line using the slope-point form.

If we calculate the limits just as we did in (2), you’ll discover that this new ramp function is continuous at x=0. As our function has now become continuous, we can try differentiating it w.r.t. x,

(4) The derivative of the ramp function s_a and its plot. Image Source: Author

But note, continuity does not imply differentiability. The converse is true though.

Here come’s the interesting part. Take a limit such that a approaches 0, and we get the Dirac Delta function,

(5)
(5) The Dirac Delta function ( spike-function ) and its plot. Note, the bulge at the bottom of the ‘spike’ is intentional and should not exist at all, by the definition. Image Source: Author

This function is weird, and that’s the reason it is not a ‘function’ by definition. It is like an infinite mass centered at x=0. We’ll discuss more points in a few moments. Next, we’ll integrate the function obtained in (4) from negative to positive infinity. This step may look dumb as we performed differentiation in (4) and now we’re performing integration,

(6) The Dirac Delta function integrates to 1 from negative to positive infinity. Image Source: Author

This does look similar to what we do with a probability density function,

Image Source: Author

As the result of integration in (6) doesn’t really depend on a, we can write,

Image Source: Author

Let’s take a break and understand more about the Dirac Delta function. In order to move that spike anywhere on the X-axis, we can make a slight change,

(7) The Dirac Delta function with parameter a that decides the location on the ‘spike’ on the X-axis. Image Source: Author

You can easily derive this expression by modifying initial step function in (1). In the quantum physics world, when we make a measurement, say of the electron’s position, we’re indeed collapsing the electron’s wave function. The wave function gives us the probability density of the electron’s position. But when we make a measurement, we know its position precisely and hence the distribution of the electron’s position takes the form of a Dirac Delta function. The spike is at the position x_a where we’ve observed the electron while taking a measurement,

(8) The collapse of the wave function. For more details, have a look at this video. Image Source: Author

The Dirac Delta function could also be observed as a limiting case when the variance of a Gaussian distribution approaches zero. The spike occurs at x = μ when μ is the mean of the Gaussian Distribution,

(9) The shape of Gaussian distribution ( mean = 1 ) as the variance approaches zero. The formation of the ‘spike’ is clearly observed. Image Source: Author

4. Generalized PDF ( our goal )

Let’s get back to our goal of constructing a PDF for our discrete random variables. Let us consider a discrete random variable X with probability mass function p( X ),

(10) The values random variable X can take. Image Source: Author

For any realization ( value that random variable X can take ) x_i, consider the following expression,

(11) Probability mass function of X multiplied by the Dirac Delta function. Image Source: Author

As we can think, the spike of the Delta function is shifted to x = x_i, but it is also scaled by a factor of p( X=x_i ) that returns the probability of random variable X taking the value x_i. Note, that the above expression represents a continuous entity. Next, we add all such expressions for all x_i,

(12) Expression 11 summed over all x_i. Image Source: Author

This expression resembles the addition of all those scaled spikes and looks like,

(13) The PDF of variable X and its plot with ‘spikes’ at all x. Image Source: Author

Well, this is our generalized PDF or the PDF derived from the data. The resulting probability distribution is called the empirical distribution. But still this thing doesn’t integrate to 1, which is a property for all PDFs. If X can take any value with an equal probability, then,

(14) The new PDF integrates to one by choosing the appropriate scaling factor. Image Source: Author

And that’s how you’ve got a PDF for the discrete random variable. We do have a corresponding CDF that looks similar to the step function we discussed earlier,

(15) The corresponding CDF of variable X. Image Source: Author

Empirical distributions not have practical applications as such, but they are of great use for proving various statements that concern with the distribution of the data.

The End

I hope you enjoyed the story and the proof. I’m glad that I made empirical distributions crystal in my as well as the readers’ minds. Share your thoughts in the comments, and have a nice day ahead!

--

--