The world’s leading publication for data science, AI, and ML professionals.

Kernel Density Estimation explained step by step

Intuitive derivation of the KDE formula

Photo by Marcus Urbenz on Unsplash
Photo by Marcus Urbenz on Unsplash

Introduction

To get a sense of the data distribution, we draw probability density functions (PDF). We are pleased when data fit well to a common density function, such as normal, Poisson, geometrical, etc. Then, the maximum likelihood approach can be used to fit the density function to the data.

Unfortunately, the data distribution is sometimes too irregular and does not resemble any of the usual PDFs. In such cases, the Kernel Density Estimator (KDE) provides a rational and visually pleasant representation of the data distribution.

I’ll walk you through the steps of building the KDE, relying on your intuition rather than on a rigorous mathematical derivation.

The Kernel Function

The key to understanding KDE is to think of it as a function made up of building blocks, similar to how different objects are made up of Lego bricks. The distinctive feature of KDE is that it employs only one type of brick, known as the kernel (‘one brick to rule them all‘). The key property of this brick is the ability to shift and stretch/shrink. Each datapoint is given a brick, and KDE is the sum of all bricks.

KDE is a composite function made up of one kind of building block referred to as a kernel function.

The kernel function is evaluated for each datapoint separately, and these partial results are summed to form the KDE.

The first step toward KDE is to focus on just one data point. What would you do if asked to create a PDF for a single data point? To begin, take x = 0. The most logical approach is to use a PDF that is peaking precisely over that point and decaying with distance from it. The function

would do the trick.

However, because PDF is supposed to have a unit area under the curve, we must rescale the result. Therefore, the function has to be divided by the square root of 2π and stretched by a factor of √2 (3Blue1Brown provides an excellent derivation of these factors):

Ultimately, we arrive at our Lego brick, known as the Kernel function, which is a valid PDF:

This Kernel is equivalent to a Gaussian distribution with zero mean and unit variance.

Let’s play with it for a while. We’ll start by learning to shift it along the x axis.

Take a single data point xᵢ – the i-th point belonging to our dataset X. The shift can be accomplished by subtracting the argument:

To make the curve wider or narrower we can just throw a constant h (the so called kernel bandwidth) in the argument. It is usually introduced as a denominator:

However, the area under the kernel function is multiplied by h as a result. Therefore, we have to restore it back to the unit area by dividing by h:

You can choose whatever h value you want. Here’s an example of how it works.

The higher the h, the wider the PDF. The smaller the h, the narrower the PDF.

Kernel Density Estimator

Consider some dummy data to see how we can expand the method to multiple points.

# dataset
x = [1.33, 0.3, 0.97, 1.1, 0.1, 1.4, 0.4]

# bandwidth
h = 0.3

For the first data point, we simply use:

We can do the same with the second datapoint:

To get a single PDF for the first two points, we must combine these two separate PDFs:

Because we added two PDFs with unit area, the area under the curve becomes 2. To get it back to one, we divide it by two:

Although the complete signature of function f could be used for precision:

we’ll just use f(x) to make the notation unclutter.

This is how it works for two datapoints:

And the final step toward KDE is to take into account n datapoints

The Kernel Density Estimator is:

Let’s have some fun with our rediscovered KDE.

import numpy as np
import matplotlib as plt

# the Kernel function
def K(x):
    return np.exp(-x**2/2)/np.sqrt(2*np.pi)

# dummy dataset
dataset = np.array([1.33, 0.3, 0.97, 1.1, 0.1, 1.4, 0.4])

# x-value range for plotting KDEs
x_range = np.linspace(dataset.min()-0.3, dataset.max()+0.3, num=600)

# bandwith values for experimentation
H = [0.3, 0.1, 0.03]
n_samples = dataset.size

# line properties for different bandwith values
color_list = ['goldenrod', 'black', 'maroon']
alpha_list = [0.8, 1, 0.8]
width_list = [1.7,2.5,1.7]

plt.figure(figsize=(10,4))
# iterate over bandwith values
for h, color, alpha, width in zip(H, color_list, alpha_list, width_list):
    total_sum = 0
    # iterate over datapoints
    for i, xi in enumerate(dataset):
        total_sum += K((x_range - xi) / h)
        plt.annotate(r'$x_{}$'.format(i+1),
                     xy=[xi, 0.13],
                     horizontalalignment='center',
                     fontsize=18,
                    )
    y_range = total_sum/(h*n_samples)
    plt.plot(x_range, y_range, 
             color=color, alpha=alpha, linewidth=width, 
             label=f'{h}')

    plt.plot(dataset, np.zeros_like(dataset) , 's', 
             markersize=8, color='black')

plt.xlabel('$x$', fontsize=22)
plt.ylabel('$f(x)$', fontsize=22, rotation='horizontal', labelpad=20)
plt.legend(fontsize=14, shadow=True, title='$h$', title_fontsize=16)
plt.show()

Here we use the gaussian kernel, but I encourage you to try another kernels. For a review of common families of kernel functions, see this paper. However, when the dataset is large enough, the type of kernel has no significant effect on the final output.

KDE with Python libraries

The seaborn library employs KDE to offer nice visualizations of data distributions.

import seaborn as sns
sns.set()

fig, ax = plt.subplots(figsize=(10,4))

sns.kdeplot(ax=ax, data=dataset, 
            bw_adjust=0.3,
            linewidth=2.5, fill=True)

# plot datapoints
ax.plot(dataset, np.zeros_like(dataset) + 0.05, 's', 
        markersize=8, color='black')
for i, xi in enumerate(dataset):
    plt.annotate(r'$x_{}$'.format(i+1),
                 xy=[xi, 0.1],
                 horizontalalignment='center',
                 fontsize=18,
                )
plt.show()

Scikit learn offers the KernelDensity function to do a similar job.

from sklearn.neighbors import KernelDensity

dataset = np.array([1.33, 0.3, 0.97, 1.1, 0.1, 1.4, 0.4])

# KernelDensity requires 2D array
dataset = dataset[:, np.newaxis]

# fit KDE to the dataset
kde = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(dataset)

# x-value range for plotting KDE
x_range = np.linspace(dataset.min()-0.3, dataset.max()+0.3, num=600)

# compute the log-likelihood of each sample
log_density = kde.score_samples(x_range[:, np.newaxis])

plt.figure(figsize=(10,4))
# put labels over datapoints
for i, xi in enumerate(dataset):
    plt.annotate(r'$x_{}$'.format(i+1),
                 xy=[xi, 0.07],
                 horizontalalignment='center',
                 fontsize=18)

# draw KDE curve
plt.plot(x_range, np.exp(log_density), 
         color='gray', linewidth=2.5)

# draw boxes representing datapoints
plt.plot(dataset, np.zeros_like(dataset) , 's', 
         markersize=8, color='black')

plt.xlabel('$x$', fontsize=22)
plt.ylabel('$f(x)$', fontsize=22, rotation='horizontal', labelpad=24)
plt.show()

The Scikit learn solution has the advantage of being able to be used as a generative model to generate synthetic data samples.

# Generate random samples from the model
synthetic_data = kde.sample(100)

plt.figure(figsize=(10,4))

# draw KDE curve
plt.plot(x_range, np.exp(log_density), 
         color='gray', linewidth=2.5)

# draw boxes representing datapoints
plt.plot(synthetic_data, np.zeros_like(synthetic_data) , 's', 
         markersize=6, color='black', alpha=0.5)

plt.xlabel('$x$', fontsize=22)
plt.ylabel('$f(x)$', fontsize=22, rotation='horizontal', labelpad=24)
plt.show()

Conclusions

To summarize, KDE enables us to create a visually appealing PDF from any data without making any assumptions about the underlying process.

The distinguishing features of KDE’s:

  • this is a function made up of a single type of building blocks termed kernel function;
  • this is a nonparametric estimator, which means that its functional form is determined by the datapoints;
  • the shape of the generated PDF is heavily influenced by the value of the kernel bandwidth h;
  • to fit to the dataset, no optimization technique is required.

The application of KDE to multidimensional data is simple. But this is a topic for another story.


Unless otherwise noted, all images are by the author.


References

[1] S. Węglarczyk, Kernel density estimation and its application (2018), ITM web of conferences, vol. 23, EDP Sciences.

[2] Y. Soh, Y. Hae, A. Mehmood, R. H. Ashraf, I. Kim: Performance Evaluation of Various Functions for Kernel Density Estimation (2013), Open Journal of Applied Sciences, vol. 3, pp. 58–64.


Related Articles