Kernel Density Estimation

Kernel Density Estimation, KDE

KDE is a useful non-parameteric estimation of a samples underlying distribution. Being non-parametric means that no assumptions about the samples distribution are made.

The KDE is generated by placing a kernel, e.g. a small Gaussian distribution, over each data point and then summing over all the kernels. Consider a sample below:

# Import numpy to generate a sample
import numpy as np

# Generate a sample of two gaussian distributions
X = np.concatenate((np.random.normal(0, 1, 80),
                    np.random.normal(8, 1, 20)))[:, np.newaxis]

# Lets visualize its histogram
_=plt.hist(X, density=True)
Sample generated as the sum of two independent gaussian distributions of mean 0 and 8.

Using sklearn's KernelDensity a KDE can be estimated as:

# Import KDE from sklearn
from sklearn.neighbors import KernelDensity

# Fit a kernel density with gaussian kernel and bandwidth 0.5
kde = KernelDensity(kernel='gaussian', bandwidth=0.5)
kde.fit(X)

# Score samples for range
X_range = np.linspace(-5, 10, 1000)[:, np.newaxis]
estimated_dens = kde.score_samples(X_range)

# And plot it!
plt.plot(X_range[:,0], np.exp(estimated_dens))
Estimated empirical distribution of the sample above.

Where kernel can be changed for different distributions to sum up for each data point and bandwidth for different width of the kernel.

Last updated

Was this helpful?