Kernel Density Estimation

Kernel Density Estimation, KDE

KDE is a useful non-parameteric estimation of a samples underlying distribution. Being non-parametric means that no assumptions about the samples distribution are made.

The KDE is generated by placing a kernel, e.g. a small Gaussian distribution, over each data point and then summing over all the kernels. Consider a sample below:

# Import numpy to generate a sample
import numpy as np

# Generate a sample of two gaussian distributions
X = np.concatenate((np.random.normal(0, 1, 80),
                    np.random.normal(8, 1, 20)))[:, np.newaxis]

# Lets visualize its histogram
_=plt.hist(X, density=True)

Using sklearn's KernelDensity a KDE can be estimated as:

# Import KDE from sklearn
from sklearn.neighbors import KernelDensity

# Fit a kernel density with gaussian kernel and bandwidth 0.5
kde = KernelDensity(kernel='gaussian', bandwidth=0.5)
kde.fit(X)

# Score samples for range
X_range = np.linspace(-5, 10, 1000)[:, np.newaxis]
estimated_dens = kde.score_samples(X_range)

# And plot it!
plt.plot(X_range[:,0], np.exp(estimated_dens))

Where kernel can be changed for different distributions to sum up for each data point and bandwidth for different width of the kernel.

PreviousBacktick Knowledge Base NextTests

Last updated 5 years ago

Was this helpful?