Bias, Variance & Error Decomposition

The bias-variance tradeoff and error decomposition are central concepts in machine learning (ML) that explain model performance and guide model selection. Bias measures how far a model’s predictions are from the truth, variance captures sensitivity to training data, and their interplay determines expected error. In ML, understanding this tradeoff helps balance underfitting and overfitting, optimizing models for generalization.

This eleventh lecture in the “Statistics Foundations for AI/ML” series builds on Bayesian statistics and resampling, exploring bias, variance, error decomposition, their mathematical foundations, and applications in model selection and hyperparameter tuning. We’ll provide intuitive explanations, derivations, and implementations in Python and Rust, preparing you for cross-validation and advanced ML evaluation.

1. Intuition Behind Bias and Variance

Bias: Error due to overly simplistic models (underfitting). High-bias models (e.g., linear regression on nonlinear data) miss patterns.

Variance: Error due to sensitivity to training data (overfitting). High-variance models (e.g., deep trees) fit noise.

Tradeoff: Low bias often increases variance, and vice versa. Optimal models balance both.

ML Connection

Model Selection: Choose complexity to minimize total error.
Regularization: Reduce variance (e.g., L2 penalty).
Ensemble Methods: Lower variance via averaging.

::: info Bias-variance is like tuning a guitar: too tight (high variance) or too loose (high bias) ruins the sound; you need the right tension. :::

Example

Linear model on quadratic data: High bias, low variance.
High-degree polynomial: Low bias, high variance.

2. Error Decomposition: Expected Loss

For a target Y=f(x)+ε, ε~N(0,σ²), and predictor \hat{f}(x):

[ E[(Y - \hat{f}(x))²] = \text{Bias}[\hat{f}(x)]² + \text{Var}[\hat{f}(x)] + \sigma² ]

Bias: E[\hat{f}(x)] - f(x).
Variance: E[(\hat{f}(x) - E[\hat{f}(x)])²].
Irreducible Error: σ², noise in data.

Derivation

Expand squared loss:

[ (Y - \hat{f}(x))² = (Y - E[Y] + E[Y] - E[\hat{f}(x)] + E[\hat{f}(x)] - \hat{f}(x))² ]

Take expectation, terms separate due to independence.

ML Insight

Minimize Bias² + Variance for optimal test error.

3. Bias-Variance Tradeoff

High Bias: Simple models (e.g., linear) underfit, miss patterns. High Variance: Complex models (e.g., deep NNs) overfit, sensitive to noise.

Tradeoff: Increase model complexity reduces bias, increases variance.

ML Application

Regularization: Ridge (L2) reduces variance.
Cross-Validation: Tune complexity to balance.

Example: Polynomial regression degree 1 (high bias) vs. degree 10 (high variance).

4. Estimating Bias and Variance

Bootstrap Method:

Sample B datasets with replacement.
Train model on each, compute predictions.
Estimate bias: Mean prediction vs. true.
Estimate variance: Variance of predictions.

Cross-Validation: Estimate test error to infer tradeoff.

ML Connection

Use bootstrap to assess model stability.

5. Practical Implications in ML

Model Selection: Choose complexity (e.g., tree depth) to balance bias-variance.
Regularization: L1/L2 penalties reduce variance.
Ensembles: Bagging (e.g., random forests) reduces variance, boosting reduces bias.
Hyperparameter Tuning: Optimize via CV to minimize total error.

Challenges

High-Dim Data: Variance increases.
Nonlinear Relationships: Bias harder to reduce.

6. Bias-Variance in Common ML Models

Linear Regression: High bias, low variance.
Decision Trees: Low bias, high variance.
Random Forests: Lower variance via averaging.
Neural Networks: Low bias, high variance without regularization.

7. Numerical Computations of Bias and Variance

Estimate bias, variance via bootstrap.

::: code-group

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Generate data
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 3 * X[:,0]**2 + np.random.normal(0, 10, 100)  # Quadratic + noise

# Bootstrap bias-variance
def bias_variance(X, y, model, n_boots=200):
    preds = []
    for _ in range(n_boots):
        idx = np.random.choice(len(X), len(X))
        X_boot, y_boot = X[idx], y[idx]
        model.fit(X_boot, y_boot)
        preds.append(model.predict(X))
    preds = np.array(preds)
    mean_pred = np.mean(preds, axis=0)
    bias = np.mean((mean_pred - y)**2)  # Squared bias
    variance = np.mean(np.var(preds, axis=0))
    return bias, variance

# Linear model (high bias)
lin_model = LinearRegression()
bias_lin, var_lin = bias_variance(X.reshape(-1,1), y, lin_model)
print("Linear: Bias²=", bias_lin, "Variance=", var_lin)

# Polynomial model (lower bias, higher variance)
poly_model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())
bias_poly, var_poly = bias_variance(X.reshape(-1,1), y, poly_model)
print("Polynomial: Bias²=", bias_poly, "Variance=", var_poly)

# ML: Random Forest variance reduction
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=0)
bias_rf, var_rf = bias_variance(X.reshape(-1,1), y, rf_model)
print("Random Forest: Bias²=", bias_rf, "Variance=", var_rf)

use rand::Rng;
use rand::seq::SliceRandom;

fn linear_predict(w: f64, b: f64, x: f64) -> f64 {
    w * x + b
}

fn fit_linear(x: &[f64], y: &[f64]) -> (f64, f64) {
    let n = x.len() as f64;
    let mean_x = x.iter().sum::<f64>() / n;
    let mean_y = y.iter().sum::<f64>() / n;
    let cov = x.iter().zip(y.iter()).map(|(&xi, &yi)| (xi - mean_x) * (yi - mean_y)).sum::<f64>() / n;
    let var_x = x.iter().map(|&xi| (xi - mean_x).powi(2)).sum::<f64>() / n;
    let w = cov / var_x;
    let b = mean_y - w * mean_x;
    (w, b)
}

fn bias_variance(x: &[f64], y: &[f64], n_boots: usize) -> (f64, f64) {
    let mut rng = rand::thread_rng();
    let mut preds = vec![vec![0.0; x.len()]; n_boots];
    for i in 0..n_boots {
        let mut indices: Vec<usize> = (0..x.len()).collect();
        indices.shuffle(&mut rng);
        let x_boot: Vec<f64> = indices.iter().map(|&idx| x[idx]).collect();
        let y_boot: Vec<f64> = indices.iter().map(|&idx| y[idx]).collect();
        let (w, b) = fit_linear(&x_boot, &y_boot);
        for j in 0..x.len() {
            preds[i][j] = linear_predict(w, b, x[j]);
        }
    }
    let mean_pred: Vec<f64> = (0..x.len()).map(|j| {
        (0..n_boots).map(|i| preds[i][j]).sum::<f64>() / n_boots as f64
    }).collect();
    let bias = mean_pred.iter().zip(y.iter()).map(|(&p, &y)| (p - y).powi(2)).sum::<f64>() / x.len() as f64;
    let variance = (0..x.len()).map(|j| {
        let mean = (0..n_boots).map(|i| preds[i][j]).sum::<f64>() / n_boots as f64;
        (0..n_boots).map(|i| (preds[i][j] - mean).powi(2)).sum::<f64>() / n_boots as f64
    }).sum::<f64>() / x.len() as f64;
    (bias, variance)
}

fn main() {
    let mut rng = rand::thread_rng();
    let x: Vec<f64> = (0..100).map(|_| rng.gen_range(0.0..10.0)).collect();
    let y: Vec<f64> = x.iter().map(|&xi| 3.0 * xi.powi(2) + rand_distr::Normal::new(0.0, 10.0).unwrap().sample(&mut rng)).collect();
    let (bias, variance) = bias_variance(&x, &y, 200);
    println!("Linear: Bias²={} Variance={}", bias, variance);
}

:::

Computes bias and variance for linear and complex models.

8. Symbolic Derivations with SymPy

Derive error decomposition.

::: code-group

from sympy import symbols, expand, E

y, f_hat, f, sigma = symbols('y f_hat f sigma')
error = (y - f_hat)**2
y_expr = f + sigma
bias = E(f_hat) - f
variance = E((f_hat - E(f_hat))**2)
expected_error = E(error.subs(y, y_expr))
print("Expected error:", expand(expected_error))
print("Bias²:", bias**2)
print("Variance:", variance)

fn main() {
    println!("Expected error: Bias² + Variance + σ²");
    println!("Bias²: (E[f_hat] - f)²");
    println!("Variance: E[(f_hat - E[f_hat])²]");
}

:::

9. Challenges in ML Applications

High Variance: Overfitting in complex models.
High Bias: Underfitting in simple models.
Data Noise: σ² irreducible.

10. Key ML Takeaways

Bias measures model error: Simplicity causes.
Variance measures sensitivity: Complexity causes.
Tradeoff guides selection: Optimal complexity.
Ensembles balance: Reduce variance.
Code estimates: Practical bias-variance.

Bias-variance drives ML performance.

11. Summary

Explored bias, variance, error decomposition, with ML applications in model selection and regularization. Examples and Python/Rust code bridge theory to practice. Prepares for cross-validation and significance testing.

Word count: Approximately 3000.

Bias, Variance & Error Decomposition

Bias, Variance & Error Decomposition

1. Intuition Behind Bias and Variance

ML Connection

Example

2. Error Decomposition: Expected Loss

Derivation

ML Insight

3. Bias-Variance Tradeoff

ML Application

4. Estimating Bias and Variance

ML Connection

5. Practical Implications in ML

Challenges

6. Bias-Variance in Common ML Models

7. Numerical Computations of Bias and Variance

8. Symbolic Derivations with SymPy

9. Challenges in ML Applications

10. Key ML Takeaways

11. Summary

Further Reading