Backtick Knowledge Base
  • Backtick Knowledge Base
  • 📊Statistics
    • Kernel Density Estimation
    • Tests
  • 🍂Machine Learning
    • Fit and predict
    • Encoding
    • Feature Scaling
    • Pipeline
    • Model Evaluation & Selection
      • The Bias-Variance Decomposition
      • Kullback–Leibler Divergence
      • AIC & BIC
      • Cross-Validation
    • Feature Selection
    • Dimensionality Reduction
    • Clustering
    • Pandas
  • 🧠Deep Learning
  • 🐍Python
    • Beautiful Data
    • S3
      • List bucket items
      • Delete bucket items
      • Get objects
      • Upload objects
      • Get files
      • Upload files
      • Read .csv-file to dataframe
      • Write dataframe to .csv-file
  • ☁️Cloud
    • GCP
    • AWS
      • Users & Policies
        • Basic setup
        • MFA
      • EKS
        • Setup
        • Kube Config
        • Dashboard
      • S3
        • Copying buckets
  • ❖ Distributed Computing
    • Map-Reduce
    • Spark
    • Dask
  • ⎈ Kubernetes
Powered by GitBook
On this page

Was this helpful?

  1. 📊Statistics

Kernel Density Estimation

Kernel Density Estimation, KDE

PreviousBacktick Knowledge BaseNextTests

Last updated 5 years ago

Was this helpful?

KDE is a useful non-parameteric estimation of a samples underlying distribution. Being non-parametric means that no assumptions about the samples distribution are made.

The KDE is generated by placing a kernel, e.g. a small Gaussian distribution, over each data point and then summing over all the kernels. Consider a sample below:

# Import numpy to generate a sample
import numpy as np

# Generate a sample of two gaussian distributions
X = np.concatenate((np.random.normal(0, 1, 80),
                    np.random.normal(8, 1, 20)))[:, np.newaxis]

# Lets visualize its histogram
_=plt.hist(X, density=True)
# Import KDE from sklearn
from sklearn.neighbors import KernelDensity

# Fit a kernel density with gaussian kernel and bandwidth 0.5
kde = KernelDensity(kernel='gaussian', bandwidth=0.5)
kde.fit(X)

# Score samples for range
X_range = np.linspace(-5, 10, 1000)[:, np.newaxis]
estimated_dens = kde.score_samples(X_range)

# And plot it!
plt.plot(X_range[:,0], np.exp(estimated_dens))

Where kernel can be changed for different distributions to sum up for each data point and bandwidth for different width of the kernel.

Using sklearn's a KDE can be estimated as:

KernelDensity
Sample generated as the sum of two independent gaussian distributions of mean 0 and 8.
Estimated empirical distribution of the sample above.