Backtick Knowledge Base
  • Backtick Knowledge Base
  • 📊Statistics
    • Kernel Density Estimation
    • Tests
  • 🍂Machine Learning
    • Fit and predict
    • Encoding
    • Feature Scaling
    • Pipeline
    • Model Evaluation & Selection
      • The Bias-Variance Decomposition
      • Kullback–Leibler Divergence
      • AIC & BIC
      • Cross-Validation
    • Feature Selection
    • Dimensionality Reduction
    • Clustering
    • Pandas
  • 🧠Deep Learning
  • 🐍Python
    • Beautiful Data
    • S3
      • List bucket items
      • Delete bucket items
      • Get objects
      • Upload objects
      • Get files
      • Upload files
      • Read .csv-file to dataframe
      • Write dataframe to .csv-file
  • ☁️Cloud
    • GCP
    • AWS
      • Users & Policies
        • Basic setup
        • MFA
      • EKS
        • Setup
        • Kube Config
        • Dashboard
      • S3
        • Copying buckets
  • ❖ Distributed Computing
    • Map-Reduce
    • Spark
    • Dask
  • ⎈ Kubernetes
Powered by GitBook
On this page
  • Entropy
  • Kullback-Leibler Divergence

Was this helpful?

  1. Machine Learning
  2. Model Evaluation & Selection

Kullback–Leibler Divergence

The measure of relative entropy, i.e. how one distribution is different from a reference distribution. Very useful in its own as a theoretical foundation for AIC.

PreviousThe Bias-Variance DecompositionNextAIC & BIC

Last updated 4 years ago

Was this helpful?

gives an excellent explanation and example of Kullback-Leibler Divergence for the interested reader. Count Bayesie explains Kullback-Leibler Divergence, KL divergence, as a measure of how much information is lost when using an approximation. KL Divergence is therefore every important for machine learning as all models fitted are approximations of the true underlying relationship between the features and target variables.

Entropy

Before explaining KL divergence, a short recap of entropy might be needed. Remember, entropy is a measure of how much information is gained in observing the outcome of a random variable. The entropy for a random variable X is defined as:

H(X)=−∑iP(xi)logP(xi)H(X) =-\sum_iP(x_i)log P(x_i) H(X)=−i∑​P(xi​)logP(xi​)

Kullback-Leibler Divergence

Kullback-Leibler divergence is a modification of the entropy formula

KL(P∣∣Q)=∑x∈χP(x)logP(X)Q(X)KL(P||Q)=\sum_{x\in\chi}P(x)log\frac{P(X)}{Q(X)}KL(P∣∣Q)=x∈χ∑​P(x)logQ(X)P(X)​

Note, assuming P is the reference distribution that this is the expected log-difference between the reference and approximation:

KL(P∣∣Q)=Ep[logP(x)−logQ(x)]KL(P||Q)=E_p[log P(x)-logQ(x)]KL(P∣∣Q)=Ep​[logP(x)−logQ(x)]

which by the linearity of expectation can be written as

KL(P∣∣Q)=Ep[logP(x)]−Ep[logQ(x)]KL(P||Q)=E_p[log P(x)]-E_p[logQ(x)]KL(P∣∣Q)=Ep​[logP(x)]−Ep​[logQ(x)]

If P, i.e. the reference or true distribution, is known the KL divergence is useful to select the best representative approximation Q. Sadly, in machine learning the real distribution is rarely known.

The KL divergence can be rewritten to compare two different approximations directly:

KL(P∣∣Q0)−KL(P∣∣Q1)=Ep[logQ0(x)]−Ep[logQ1(x)]KL(P||Q_0)-KL(P||Q_1)=E_p[log Q_0(x)]-E_p[logQ_1(x)]KL(P∣∣Q0​)−KL(P∣∣Q1​)=Ep​[logQ0​(x)]−Ep​[logQ1​(x)]

Removing the expectation with the original distributions term. However, the expected value is still with respect to the original distribution P. In the next section AIC is introduced as an approximation that can be used without knowing P.

🍂
Count Bayesie