Skip to content

Linear Regression

Linear regression is a cornerstone of supervised learning, predicting continuous outputs (e.g., house prices) from input features (e.g., size, age). This section provides a deep dive into its theory, derivations, regularization, and evaluation, with a Rust lab using linfa to illustrate practical implementation. We’ll explore "under the hood" details, including computational efficiency and Rust’s role in ML.

Theory

Linear regression models the relationship between a feature vector x=[x1,x2,,xn] and output y as a linear function:

y=w0+w1x1+w2x2++wnxn=w0+wTx

where w0 is the intercept, w=[w1,,wn] are weights, and x is augmented with a 1 for the intercept term. The goal is to find w and w0 that minimize the mean squared error (MSE) over m training examples:

MSE=1mi=1m(yiy^i)2

where y^i=w0+wTxi is the predicted output.

Derivation: Normal Equation

To minimize MSE, we formulate the loss as a function of the parameter vector θ=[w0,w1,,wn]:

J(θ)=1mi=1m(yiθTxi)2

where xi=[1,xi1,,xin] includes the intercept term. In matrix form, for design matrix XRm×(n+1) and target vector yRm:

J(θ)=1m(yXθ)T(yXθ)

To find the optimal θ, take the gradient with respect to θ and set it to zero:

θJ(θ)=2mXT(yXθ)=0

Solving yields the normal equation:

θ=(XTX)1XTy

This closed-form solution is computationally expensive for large m or n due to matrix inversion, costing O(n3). Rust’s nalgebra optimizes such operations with efficient linear algebra routines.

Derivation: Gradient Descent

For large datasets, gradient descent iteratively updates θ to minimize J(θ):

θθηθJ(θ)

where η is the learning rate, and the gradient is:

θJ(θ)=2mXT(Xθy)

Each iteration updates θ based on the error Xθy. Rust’s performance shines here, as linfa leverages fast matrix operations for gradient computation, minimizing memory overhead.

Regularization

Regularization prevents overfitting by penalizing large weights, improving generalization.

  • Ridge Regression: Adds an L2 penalty to the loss:

    J(θ)=1mi=1m(yiθTxi)2+λj=1nwj2

    The normal equation becomes:

    θ=(XTX+λI)1XTy

    where λ controls penalty strength, and I is the identity matrix (excluding w0).

  • Lasso Regression: Uses an L1 penalty, λj=1n|wj|, promoting sparsity by setting some weights to zero. It lacks a closed-form solution, relying on iterative methods like coordinate descent.

Under the Hood: Ridge regularization stabilizes XTX inversion by ensuring it’s positive definite, while Lasso’s sparsity aids feature selection. Rust’s linfa implements these efficiently, leveraging memory safety to avoid errors in iterative updates.

Evaluation

Model performance is evaluated with:

  • Mean Squared Error (MSE): Quantifies prediction error, as above.
  • Root Mean Squared Error (RMSE): MSE, in the same units as y.
  • R-squared (R2): Proportion of variance explained:R2=1(yiy^i)2(yiy¯)2R2=1 indicates perfect fit; R2=0 matches a mean-only model.

Under the Hood: R2 measures model fit relative to a baseline, but high R2 on training data may indicate overfitting, necessitating regularization or cross-validation.

Lab: Linear Regression with linfa

You’ll train linear and ridge regression models on a synthetic dataset (house size, age predicting price), compute predictions, and evaluate performance.

  1. Edit src/main.rs in your rust_ml_tutorial project:

    rust
    use linfa::prelude::*;
    use linfa_linear::LinearRegression;
    use ndarray::{array, Array2, Array1};
    
    fn main() {
        // Synthetic dataset: features (size in sqft, age in years), target (price in $)
        let x: Array2<f64> = array![
            [1000.0, 5.0], [1500.0, 10.0], [2000.0, 3.0], [2500.0, 8.0], [3000.0, 2.0]
        ];
        let y: Array1<f64> = array![200000.0, 250000.0, 300000.0, 350000.0, 400000.0];
    
        // Create dataset
        let dataset = Dataset::new(x.clone(), y.clone());
    
        // Train linear regression
        let model = LinearRegression::default().fit(&dataset).unwrap();
        println!("Linear Intercept: {}, Weights: {:?}", model.intercept(), model.params());
    
        // Train ridge regression (L2 penalty)
        let ridge_model = LinearRegression::ridge().l2_penalty(0.1).fit(&dataset).unwrap();
        println!("Ridge Intercept: {}, Weights: {:?}", ridge_model.intercept(), ridge_model.params());
    
        // Predict and evaluate
        let predictions = model.predict(&x);
        let mse = predictions.iter().zip(y.iter())
            .map(|(p, t)| (p - t).powi(2)).sum::<f64>() / x.nrows() as f64;
        let y_mean = y.iter().sum::<f64>() / y.len() as f64;
        let ss_tot = y.iter().map(|y| (y - y_mean).powi(2)).sum::<f64>();
        let ss_res = predictions.iter().zip(y.iter()).map(|(p, t)| (p - t).powi(2)).sum::<f64>();
        let r2 = 1.0 - ss_res / ss_tot;
        println!("MSE: {}, R^2: {}", mse, r2);
    
        // Test prediction
        let test_x = array![[2800.0, 4.0]];
        let test_pred = model.predict(&test_x);
        println!("Prediction for size=2800, age=4: {}", test_pred[0]);
    }
  2. Ensure Dependencies:

    • Verify Cargo.toml includes:
      toml
      [dependencies]
      linfa = "0.7.1"
      linfa-linear = "0.7.0"
      ndarray = "0.15.0"
    • Run cargo build.
  3. Run the Program:

    bash
    cargo run

    Expected Output (approximate):

    Linear Intercept: 50000, Weights: [120, -1000]
    Ridge Intercept: 51000, Weights: [115, -950]
    MSE: ~1000000, R^2: ~0.98
    Prediction for size=2800, age=4: ~382000

Understanding the Results

  • Dataset: Features (size, age) predict prices, with synthetic data mimicking realistic trends.
  • Model: Linear regression learns weights (e.g., w1120 for size, w21000 for age) and intercept, reflecting feature impacts.
  • Ridge: The L2 penalty slightly reduces weights, improving generalization.
  • Evaluation: Low MSE and high R2 (~0.98) indicate a good fit, but ridge may perform better on unseen data.
  • Prediction: The test case (size=2800, age=4) yields a realistic price, showing model generalization.

Under the Hood: linfa uses optimized linear algebra (via ndarray and BLAS) for the normal equation, avoiding explicit matrix inversion for small datasets. For large datasets, it can switch to gradient descent, leveraging Rust’s performance to handle memory-intensive operations safely. The ridge penalty adds a diagonal term to XTX, ensuring numerical stability.

Next Steps

Continue to Logistic Regression for classification techniques, or revisit Statistics.

Further Reading

  • An Introduction to Statistical Learning by James et al. (Chapter 3)
  • Andrew Ng’s Machine Learning Specialization (Course 1, Week 2)
  • Hands-On Machine Learning by Géron (Chapter 4)
  • linfa Documentation: github.com/rust-ml/linfa