Empirical Distribution: Everything You Need To Know
Approximate distribution of the data, Dirac Delta function and much more
Empirical distribution is a word that you might have observed in a number of statistics textbooks, but I discovered it in the book Probabilistic Machine Learning : An Introduction by Kevin Patrick Murphy.
The book provides mathematical treatments to a number of topics that are otherwise stated as ‘common sense’ in most online blogs and videos. As my interest lies in the Math behind ML ( and Statistics ), I started reading this book and soon discovered the empirical distribution,
At the first glance, for a statistics beginner like me, I found this definition completely unintuitive. I was aware of that delta function ( the Dirac Delta function ) but from that quantum physics perspective and its use to describe the PDF of a discrete random variable seemed impossible. PDFs were described for continuous random variables but statisticians had invented PDFs for discrete stuff too.
This story would provide mathematical background on empirical distributions and doesn’t contain much real-world examples.
If you’re in the same situation as mine, and you wish to explore more on empirical distributions, go and complete the story!
1. Let’s Start With A Dictionary
I have developed this habit of looking for words in a dictionary whenever I encounter a new concept. For empirical distributions, the word ‘empirical’ clears means,
Definition of empirical
1: originating in or based on observation or experience
The first definition from Merriam-Webster is of great use, considering the perspective of statistics. We often refer observations or experience as ‘data’ in the world of statistics and computer science. The word empirical means something that is concerned with observations or data and is not a theoretical construct.
In practical use of statistics, we usually don’t have an infinite amount of data, What we have is a tiny fraction of data, popularly called samples, and we need to infer everything from it. So, things inferred from that tiny fraction of data often use the word empirical which precisely matches the meaning.
Next word, distributions, depict the probability of occurrence of events. We model the quantity of interest ( the quantity that we wish to analyze ) as an random variable and we say that the random variable follows some particular probability distribution. In case of discrete random variables, probability distributions are characterized by probability mass functions ( PMFs ). But in this story, we’re interested in probability density or probability density functions that are defined for continuous random variables.
Consider a quality control ( QC ) test happening in a juice industry. The QC team cannot check the quality of each and every juice carton, so they’ll take a bunch of samples ( 10–15 juice cartons ). Using these samples, they’ll try to measure the quality of the whole unit or batch ( maybe 10,000 juice cartons ). The whole unit/batch from where the samples are taken is called a population in statistics lingo.
2. Our Goal
Let’s get straight into our goal.
Our goal is to approximate the PDF from a given finite number of samples. The PDF is a continuous thing that we’ll approximate from a finite number of samples ( discrete in sense )
The resulting approximate PDF would characterize the distribution of the samples and not the true data distribution, and that’s the reason we’ll call it an empirical distribution.
In order to model the true distribution, you’d need an infinite amount of samples
For this transformation, from the ‘discrete’ word to the ‘continuous’ world, we’ll be using the Dirac Delta function. Finally, what we’ll get will be called the generalized probability density function.
3. Dirac Delta Function
Things might go fast from here, so I hope you’ll be follow.
Consider a step-function whose step is at x = 0, like,
As you may observe, this function is not continuous at x=0 as the limits from both sides don’t match,
In order to make the function continuous at x=0, why not convert the ‘step’ to a ‘ramp’, that starts from -a and ends at a, where a is some real number?
If you didn’t understand how we got that term (1/2a)(x+a), in (3), try calculating the equation of that inclined line using the slope-point form.
If we calculate the limits just as we did in (2), you’ll discover that this new ramp function is continuous at x=0. As our function has now become continuous, we can try differentiating it w.r.t. x,
But note, continuity does not imply differentiability. The converse is true though.
Here come’s the interesting part. Take a limit such that a approaches 0, and we get the Dirac Delta function,
This function is weird, and that’s the reason it is not a ‘function’ by definition. It is like an infinite mass centered at x=0. We’ll discuss more points in a few moments. Next, we’ll integrate the function obtained in (4) from negative to positive infinity. This step may look dumb as we performed differentiation in (4) and now we’re performing integration,
This does look similar to what we do with a probability density function,
As the result of integration in (6) doesn’t really depend on a, we can write,
Let’s take a break and understand more about the Dirac Delta function. In order to move that spike anywhere on the X-axis, we can make a slight change,
You can easily derive this expression by modifying initial step function in (1). In the quantum physics world, when we make a measurement, say of the electron’s position, we’re indeed collapsing the electron’s wave function. The wave function gives us the probability density of the electron’s position. But when we make a measurement, we know its position precisely and hence the distribution of the electron’s position takes the form of a Dirac Delta function. The spike is at the position x_a where we’ve observed the electron while taking a measurement,
The Dirac Delta function could also be observed as a limiting case when the variance of a Gaussian distribution approaches zero. The spike occurs at x = μ when μ is the mean of the Gaussian Distribution,
4. Generalized PDF ( our goal )
Let’s get back to our goal of constructing a PDF for our discrete random variables. Let us consider a discrete random variable X with probability mass function p( X ),
For any realization ( value that random variable X can take ) x_i, consider the following expression,
As we can think, the spike of the Delta function is shifted to x = x_i, but it is also scaled by a factor of p( X=x_i ) that returns the probability of random variable X taking the value x_i. Note, that the above expression represents a continuous entity. Next, we add all such expressions for all x_i,
This expression resembles the addition of all those scaled spikes and looks like,
Well, this is our generalized PDF or the PDF derived from the data. The resulting probability distribution is called the empirical distribution. But still this thing doesn’t integrate to 1, which is a property for all PDFs. If X can take any value with an equal probability, then,
And that’s how you’ve got a PDF for the discrete random variable. We do have a corresponding CDF that looks similar to the step function we discussed earlier,
Empirical distributions not have practical applications as such, but they are of great use for proving various statements that concern with the distribution of the data.
The End
I hope you enjoyed the story and the proof. I’m glad that I made empirical distributions crystal in my as well as the readers’ minds. Share your thoughts in the comments, and have a nice day ahead!