Kullback–Leibler Divergence
The measure of relative entropy, i.e. how one distribution is different from a reference distribution. Very useful in its own as a theoretical foundation for AIC.
Last updated
Was this helpful?
The measure of relative entropy, i.e. how one distribution is different from a reference distribution. Very useful in its own as a theoretical foundation for AIC.
Last updated
Was this helpful?
gives an excellent explanation and example of Kullback-Leibler Divergence for the interested reader. Count Bayesie explains Kullback-Leibler Divergence, KL divergence, as a measure of how much information is lost when using an approximation. KL Divergence is therefore every important for machine learning as all models fitted are approximations of the true underlying relationship between the features and target variables.
Before explaining KL divergence, a short recap of entropy might be needed. Remember, entropy is a measure of how much information is gained in observing the outcome of a random variable. The entropy for a random variable X is defined as:
Kullback-Leibler divergence is a modification of the entropy formula
Note, assuming P is the reference distribution that this is the expected log-difference between the reference and approximation:
which by the linearity of expectation can be written as
If P, i.e. the reference or true distribution, is known the KL divergence is useful to select the best representative approximation Q. Sadly, in machine learning the real distribution is rarely known.
The KL divergence can be rewritten to compare two different approximations directly:
Removing the expectation with the original distributions term. However, the expected value is still with respect to the original distribution P. In the next section AIC is introduced as an approximation that can be used without knowing P.