Bias, Variance & Error Decomposition
Bias, Variance & Error Decomposition
Section titled “Bias, Variance & Error Decomposition”The bias-variance tradeoff and error decomposition are central concepts in machine learning (ML) that explain model performance and guide model selection. Bias measures how far a model’s predictions are from the truth, variance captures sensitivity to training data, and their interplay determines expected error. In ML, understanding this tradeoff helps balance underfitting and overfitting, optimizing models for generalization.
This eleventh lecture in the “Statistics Foundations for AI/ML” series builds on Bayesian statistics and resampling, exploring bias, variance, error decomposition, their mathematical foundations, and applications in model selection and hyperparameter tuning. We’ll provide intuitive explanations, derivations, and implementations in Python and Rust, preparing you for cross-validation and advanced ML evaluation.
1. Intuition Behind Bias and Variance
Section titled “1. Intuition Behind Bias and Variance”Bias: Error due to overly simplistic models (underfitting). High-bias models (e.g., linear regression on nonlinear data) miss patterns.
Variance: Error due to sensitivity to training data (overfitting). High-variance models (e.g., deep trees) fit noise.
Tradeoff: Low bias often increases variance, and vice versa. Optimal models balance both.
ML Connection
Section titled “ML Connection”- Model Selection: Choose complexity to minimize total error.
- Regularization: Reduce variance (e.g., L2 penalty).
- Ensemble Methods: Lower variance via averaging.
::: info Bias-variance is like tuning a guitar: too tight (high variance) or too loose (high bias) ruins the sound; you need the right tension. :::
Example
Section titled “Example”- Linear model on quadratic data: High bias, low variance.
- High-degree polynomial: Low bias, high variance.
2. Error Decomposition: Expected Loss
Section titled “2. Error Decomposition: Expected Loss”For a target Y=f(x)+ε, ε~N(0,σ²), and predictor \hat{f}(x):
[ E[(Y - \hat{f}(x))²] = \text{Bias}[\hat{f}(x)]² + \text{Var}[\hat{f}(x)] + \sigma² ]
- Bias: E[\hat{f}(x)] - f(x).
- Variance: E[(\hat{f}(x) - E[\hat{f}(x)])²].
- Irreducible Error: σ², noise in data.
Derivation
Section titled “Derivation”Expand squared loss:
[ (Y - \hat{f}(x))² = (Y - E[Y] + E[Y] - E[\hat{f}(x)] + E[\hat{f}(x)] - \hat{f}(x))² ]
Take expectation, terms separate due to independence.
ML Insight
Section titled “ML Insight”- Minimize Bias² + Variance for optimal test error.
3. Bias-Variance Tradeoff
Section titled “3. Bias-Variance Tradeoff”High Bias: Simple models (e.g., linear) underfit, miss patterns. High Variance: Complex models (e.g., deep NNs) overfit, sensitive to noise.
Tradeoff: Increase model complexity reduces bias, increases variance.
ML Application
Section titled “ML Application”- Regularization: Ridge (L2) reduces variance.
- Cross-Validation: Tune complexity to balance.
Example: Polynomial regression degree 1 (high bias) vs. degree 10 (high variance).
4. Estimating Bias and Variance
Section titled “4. Estimating Bias and Variance”Bootstrap Method:
- Sample B datasets with replacement.
- Train model on each, compute predictions.
- Estimate bias: Mean prediction vs. true.
- Estimate variance: Variance of predictions.
Cross-Validation: Estimate test error to infer tradeoff.
ML Connection
Section titled “ML Connection”- Use bootstrap to assess model stability.
5. Practical Implications in ML
Section titled “5. Practical Implications in ML”- Model Selection: Choose complexity (e.g., tree depth) to balance bias-variance.
- Regularization: L1/L2 penalties reduce variance.
- Ensembles: Bagging (e.g., random forests) reduces variance, boosting reduces bias.
- Hyperparameter Tuning: Optimize via CV to minimize total error.
Challenges
Section titled “Challenges”- High-Dim Data: Variance increases.
- Nonlinear Relationships: Bias harder to reduce.
6. Bias-Variance in Common ML Models
Section titled “6. Bias-Variance in Common ML Models”- Linear Regression: High bias, low variance.
- Decision Trees: Low bias, high variance.
- Random Forests: Lower variance via averaging.
- Neural Networks: Low bias, high variance without regularization.
7. Numerical Computations of Bias and Variance
Section titled “7. Numerical Computations of Bias and Variance”Estimate bias, variance via bootstrap.
::: code-group
import numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import make_pipeline
# Generate datanp.random.seed(0)X = np.random.rand(100, 1) * 10y = 3 * X[:,0]**2 + np.random.normal(0, 10, 100) # Quadratic + noise
# Bootstrap bias-variancedef bias_variance(X, y, model, n_boots=200): preds = [] for _ in range(n_boots): idx = np.random.choice(len(X), len(X)) X_boot, y_boot = X[idx], y[idx] model.fit(X_boot, y_boot) preds.append(model.predict(X)) preds = np.array(preds) mean_pred = np.mean(preds, axis=0) bias = np.mean((mean_pred - y)**2) # Squared bias variance = np.mean(np.var(preds, axis=0)) return bias, variance
# Linear model (high bias)lin_model = LinearRegression()bias_lin, var_lin = bias_variance(X.reshape(-1,1), y, lin_model)print("Linear: Bias²=", bias_lin, "Variance=", var_lin)
# Polynomial model (lower bias, higher variance)poly_model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())bias_poly, var_poly = bias_variance(X.reshape(-1,1), y, poly_model)print("Polynomial: Bias²=", bias_poly, "Variance=", var_poly)
# ML: Random Forest variance reductionfrom sklearn.ensemble import RandomForestRegressorrf_model = RandomForestRegressor(n_estimators=100, random_state=0)bias_rf, var_rf = bias_variance(X.reshape(-1,1), y, rf_model)print("Random Forest: Bias²=", bias_rf, "Variance=", var_rf)use rand::Rng;use rand::seq::SliceRandom;
fn linear_predict(w: f64, b: f64, x: f64) -> f64 { w * x + b}
fn fit_linear(x: &[f64], y: &[f64]) -> (f64, f64) { let n = x.len() as f64; let mean_x = x.iter().sum::<f64>() / n; let mean_y = y.iter().sum::<f64>() / n; let cov = x.iter().zip(y.iter()).map(|(&xi, &yi)| (xi - mean_x) * (yi - mean_y)).sum::<f64>() / n; let var_x = x.iter().map(|&xi| (xi - mean_x).powi(2)).sum::<f64>() / n; let w = cov / var_x; let b = mean_y - w * mean_x; (w, b)}
fn bias_variance(x: &[f64], y: &[f64], n_boots: usize) -> (f64, f64) { let mut rng = rand::thread_rng(); let mut preds = vec![vec![0.0; x.len()]; n_boots]; for i in 0..n_boots { let mut indices: Vec<usize> = (0..x.len()).collect(); indices.shuffle(&mut rng); let x_boot: Vec<f64> = indices.iter().map(|&idx| x[idx]).collect(); let y_boot: Vec<f64> = indices.iter().map(|&idx| y[idx]).collect(); let (w, b) = fit_linear(&x_boot, &y_boot); for j in 0..x.len() { preds[i][j] = linear_predict(w, b, x[j]); } } let mean_pred: Vec<f64> = (0..x.len()).map(|j| { (0..n_boots).map(|i| preds[i][j]).sum::<f64>() / n_boots as f64 }).collect(); let bias = mean_pred.iter().zip(y.iter()).map(|(&p, &y)| (p - y).powi(2)).sum::<f64>() / x.len() as f64; let variance = (0..x.len()).map(|j| { let mean = (0..n_boots).map(|i| preds[i][j]).sum::<f64>() / n_boots as f64; (0..n_boots).map(|i| (preds[i][j] - mean).powi(2)).sum::<f64>() / n_boots as f64 }).sum::<f64>() / x.len() as f64; (bias, variance)}
fn main() { let mut rng = rand::thread_rng(); let x: Vec<f64> = (0..100).map(|_| rng.gen_range(0.0..10.0)).collect(); let y: Vec<f64> = x.iter().map(|&xi| 3.0 * xi.powi(2) + rand_distr::Normal::new(0.0, 10.0).unwrap().sample(&mut rng)).collect(); let (bias, variance) = bias_variance(&x, &y, 200); println!("Linear: Bias²={} Variance={}", bias, variance);}:::
Computes bias and variance for linear and complex models.
8. Symbolic Derivations with SymPy
Section titled “8. Symbolic Derivations with SymPy”Derive error decomposition.
::: code-group
from sympy import symbols, expand, E
y, f_hat, f, sigma = symbols('y f_hat f sigma')error = (y - f_hat)**2y_expr = f + sigmabias = E(f_hat) - fvariance = E((f_hat - E(f_hat))**2)expected_error = E(error.subs(y, y_expr))print("Expected error:", expand(expected_error))print("Bias²:", bias**2)print("Variance:", variance)fn main() { println!("Expected error: Bias² + Variance + σ²"); println!("Bias²: (E[f_hat] - f)²"); println!("Variance: E[(f_hat - E[f_hat])²]");}:::
9. Challenges in ML Applications
Section titled “9. Challenges in ML Applications”- High Variance: Overfitting in complex models.
- High Bias: Underfitting in simple models.
- Data Noise: σ² irreducible.
10. Key ML Takeaways
Section titled “10. Key ML Takeaways”- Bias measures model error: Simplicity causes.
- Variance measures sensitivity: Complexity causes.
- Tradeoff guides selection: Optimal complexity.
- Ensembles balance: Reduce variance.
- Code estimates: Practical bias-variance.
Bias-variance drives ML performance.
11. Summary
Section titled “11. Summary”Explored bias, variance, error decomposition, with ML applications in model selection and regularization. Examples and Python/Rust code bridge theory to practice. Prepares for cross-validation and significance testing.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Hastie, Elements of Statistical Learning (Ch. 7).
- James, Introduction to Statistical Learning (Ch. 5).
- 3Blue1Brown: Bias-variance videos.
- Rust: ‘rand’ for bootstrap, ‘nalgebra’ for matrices.