Kullback–Leibler Divergence

The measure of relative entropy, i.e. how one distribution is different from a reference distribution. Very useful in its own as a theoretical foundation for AIC.

PreviousThe Bias-Variance Decomposition NextAIC & BIC

Last updated 4 years ago

Was this helpful?

Kullback–Leibler Divergence

The measure of relative entropy, i.e. how one distribution is different from a reference distribution. Very useful in its own as a theoretical foundation for AIC.

gives an excellent explanation and example of Kullback-Leibler Divergence for the interested reader. Count Bayesie explains Kullback-Leibler Divergence, KL divergence, as a measure of how much information is lost when using an approximation. KL Divergence is therefore every important for machine learning as all models fitted are approximations of the true underlying relationship between the features and target variables.

Entropy

Before explaining KL divergence, a short recap of entropy might be needed. Remember, entropy is a measure of how much information is gained in observing the outcome of a random variable. The entropy for a random variable X is defined as:

H(X) =-\sum_iP(x_i)log P(x_i)

Kullback-Leibler Divergence

Kullback-Leibler divergence is a modification of the entropy formula

KL(P||Q)=\sum_{x\in\chi}P(x)log\frac{P(X)}{Q(X)}

Note, assuming P is the reference distribution that this is the expected log-difference between the reference and approximation:

KL(P||Q)=E_p[log P(x)-logQ(x)]

which by the linearity of expectation can be written as

KL(P||Q)=E_p[log P(x)]-E_p[logQ(x)]

If P, i.e. the reference or true distribution, is known the KL divergence is useful to select the best representative approximation Q. Sadly, in machine learning the real distribution is rarely known.

The KL divergence can be rewritten to compare two different approximations directly:

KL(P||Q_0)-KL(P||Q_1)=E_p[log Q_0(x)]-E_p[logQ_1(x)]

Removing the expectation with the original distributions term. However, the expected value is still with respect to the original distribution P. In the next section AIC is introduced as an approximation that can be used without knowing P.

PreviousThe Bias-Variance Decomposition NextAIC & BIC

Last updated 4 years ago

Was this helpful?

Entropy

H(X) =-\sum_iP(x_i)log P(x_i)

Kullback-Leibler Divergence

Kullback-Leibler divergence is a modification of the entropy formula

KL(P||Q)=\sum_{x\in\chi}P(x)log\frac{P(X)}{Q(X)}

Note, assuming P is the reference distribution that this is the expected log-difference between the reference and approximation:

KL(P||Q)=E_p[log P(x)-logQ(x)]

which by the linearity of expectation can be written as

KL(P||Q)=E_p[log P(x)]-E_p[logQ(x)]

The KL divergence can be rewritten to compare two different approximations directly:

KL(P||Q_0)-KL(P||Q_1)=E_p[log Q_0(x)]-E_p[logQ_1(x)]