Backtick Knowledge Base
  • Backtick Knowledge Base
  • 📊Statistics
    • Kernel Density Estimation
    • Tests
  • 🍂Machine Learning
    • Fit and predict
    • Encoding
    • Feature Scaling
    • Pipeline
    • Model Evaluation & Selection
      • The Bias-Variance Decomposition
      • Kullback–Leibler Divergence
      • AIC & BIC
      • Cross-Validation
    • Feature Selection
    • Dimensionality Reduction
    • Clustering
    • Pandas
  • 🧠Deep Learning
  • 🐍Python
    • Beautiful Data
    • S3
      • List bucket items
      • Delete bucket items
      • Get objects
      • Upload objects
      • Get files
      • Upload files
      • Read .csv-file to dataframe
      • Write dataframe to .csv-file
  • ☁️Cloud
    • GCP
    • AWS
      • Users & Policies
        • Basic setup
        • MFA
      • EKS
        • Setup
        • Kube Config
        • Dashboard
      • S3
        • Copying buckets
  • ❖ Distributed Computing
    • Map-Reduce
    • Spark
    • Dask
  • ⎈ Kubernetes
Powered by GitBook
On this page

Was this helpful?

  1. Machine Learning
  2. Model Evaluation & Selection

The Bias-Variance Decomposition

Theoretical decomposition of the prediction error shows the bias-variance dynamics of prediction errors.

PreviousModel Evaluation & SelectionNextKullback–Leibler Divergence

Last updated 4 years ago

Was this helpful?

The page briefly summarizes the bias-variance decomposition of the prediction error. The decomposition illustrates the components of the prediction error and mainly shows the tradeoff between bias and variance, discussed below. This is discussed in greater detail in chapter 7.3.

Given a dataset with features X and target variables Y, assume that there exists a function f such that

Y=f(X)+ϵY = f(X) + \epsilon Y=f(X)+ϵ

where 𝜖 is said to be the unexplainable part with

E(ϵ)=0,Var(ϵ)=σϵ2E(\epsilon) =0, Var(\epsilon)=\sigma_\epsilon^2 E(ϵ)=0,Var(ϵ)=σϵ2​

further assuming that the target variables are real-valued and using the squared-error loss the error, given an observation x0 for a hypothetical function f-hat can be written as:

Err(x0)=E[(y0−f^(x0))2∣x0]=σϵ2+E[f^(x0)−f(x0)]2+E[f^(x0)−E[f^(x0)]]2Err(x_0) = E[(y_0-\hat{f}(x_0))^2|x_0]\\=\sigma^2_\epsilon + E[\hat{f}(x_0)-f(x_0)]^2+E[\hat{f}(x_0)-E[\hat{f}(x_0)]]^2Err(x0​)=E[(y0​−f^​(x0​))2∣x0​]=σϵ2​+E[f^​(x0​)−f(x0​)]2+E[f^​(x0​)−E[f^​(x0​)]]2

Note that by the definition of bias we get

biasθ=E[θ^−θ]  ⟹  E[f^(x0)−f(x0)]2=biasf^2bias_\theta=E[\hat{\theta}-\theta] \implies \\ E[\hat{f}(x_0)-f(x_0)]^2 = bias_{\hat{f}}^2biasθ​=E[θ^−θ]⟹E[f^​(x0​)−f(x0​)]2=biasf^​2​

And by the definition of variance:

Var[θ^]=E[θ^−E[θ^]]2Var[\hat\theta] =E[\hat\theta-E[\hat\theta]]^2 Var[θ^]=E[θ^−E[θ^]]2

for any estimator of 𝜃, it is given that

σϵ2+E[f^(x0)−f(x0)]2+E[f^(x0)−E[f^(x0)]]2=σϵ2+biasf2+variancef\sigma^2_\epsilon + E[\hat{f}(x_0)-f(x_0)]^2+E[\hat{f}(x_0)-E[\hat{f}(x_0)]]^2 \\ =\sigma^2_\epsilon + bias_f^2 + variance_fσϵ2​+E[f^​(x0​)−f(x0​)]2+E[f^​(x0​)−E[f^​(x0​)]]2=σϵ2​+biasf2​+variancef​

which means that the prediction error for an observation is the sum of the unexplainable variance, the squared bias and variance of the hypothetical function f-hat.

As seen in the equations above, the bias is the systematic error by an estimator. High bias models are models that make rigid assumptions and don't allow for a flexible fit to the data. Models with higher complexity allow for a more flexible fit to the data and as such have a lower bias. By fitting better to the data, the fit might vary from sample to sample thus increasing the variance of the fitted model. There is a trade-off between bias and variance, where the perfect model captures just enough complexity of the data to reduce bias without capturing the noise and increasing its variance.

Note that increasing complexity lowers the training error of a model while increasing the variance and increases the prediction error. Therefore, great care must be taken when selecting and designing your model.

🍂
The Elements of Statistical Learning