AIC & BIC

AIC and BIC provide an analytical way, i.e. without using an extra dataset, to estimate how well a model generalizes.

Akaike Information Criterion, AIC, and Bayesian Information Criterion, BIC, are two criteria that represent the tradeoff between the training error and model complexity. The model complexity, in this case, is the effective number of parameters used, i.e. the amount of non-zero parameters fitted by the model.

AIC and BIC differ in their theoretical background but look similar and can be used similarly with two main differences: AIC tends to find the best generalizing model and BIC tends to find the "true" model. This is because BIC punished more complex models slightly harder than AIC.

Akaike Information Criterion

Recall the Kullback-Leibler Divergence discussed in the last page:

KL(P||Q)=E_p[\text{ log } P(x)]-E_p[\text{ log }Q(x)]

Where P is a reference distribution and Q an approximation distribution of P. To further illustrate the point of AIC lets make Q dependent on a set of parameters $\theta,$ i.e. $Q(X|\theta)$ . Also, in machine learning the P-distribution is fixed and we often search for the best set of parameters $\theta$ or even functions Q. As such the best approximation, for a function Q, can be written as:

argmin_\theta =-E_p[\text{ log }Q(X|\theta)]

While simpler, it is still not useable as it depends on the expected function w.r.t distribution p. Akaike showed that it can be asymptotically estimated as:

E_p[-2logQ(X|\theta)]\approx2p-2\text{ log }Q(X|\theta)

where p is the number of the effective number of parameters. Which motivates the AIC information criteria:

AIC = 2p-2\text{ log }L(\theta,X)

where L is the likelihood function for the data given the parameters theta.

For smaller datasets AIC corrected, AICC, might be a better choice, due to AIC's asymptotical unbiasedness:

AICC = \frac{2p^2+2p}{n-p-1} + AIC

where n is the sample size.

Bayesian Information Criterion

BIC is motivated as the maximization of the log-likelihood as:

BIC=d \text{ log } p-2\text{ log }L(\theta,X)

where d is the dimensionality of the feature space. As such BIC punishes larger models harder than AIC.

Note that both criteria are asymptotically unbiased meaning that they might not be applicable to small datasets. If your dataset is small or intermediate consider using AICC.

Further note that values of AIC and BIC can not be compared other than to the exact same dataset, a value on its own has no meaning. On can thus not proclaim an amazing AIC -score for any problem!

PreviousKullback–Leibler Divergence NextCross-Validation

Last updated 5 years ago

Was this helpful?