Linear Regression
Linear Regression
Section titled “Linear Regression”Linear regression is often called the “hello world” of machine learning. It is one of the oldest and most fundamental supervised learning algorithms, dating back to the 19th century with Francis Galton’s studies on heredity. Despite its simplicity, linear regression remains widely used in economics, biology, engineering, and even modern AI pipelines as a baseline model or as part of more complex systems. In 2025, with the rise of large language models and foundation models, linear regression serves as a crucial component in techniques like linear probing for feature extraction from pretrained embeddings or as a simple yet effective baseline for evaluating advanced models.
This article is a comprehensive 3000+ word deep dive into linear regression. We’ll go far beyond the surface, exploring not just “how” but “why” it works, its mathematics, geometry, statistical underpinnings, regularization methods, evaluation, and practical implementation in both Rust (linfa) and Python (scikit-learn). We’ve updated this article to include connections to contemporary ML practices, such as its role in hybrid systems with deep learning and its use in efficient edge computing.
1. Motivation and Intuition
Section titled “1. Motivation and Intuition”Why do we start ML with linear regression?
- Simplicity: It is easy to understand — predicting one variable as a linear function of others.
- Interpretability: Each coefficient directly tells us how much the output changes with a unit change in input.
- Foundation: Many advanced models (logistic regression, neural networks) build on linear regression ideas.
- Baseline: Linear regression is often used as a benchmark to compare more sophisticated models, especially in 2025 where it’s used to probe frozen large models for quick adaptations.
Real-world Examples
Section titled “Real-world Examples”- Economics: Predicting income from years of education.
- Medicine: Predicting blood pressure from BMI and age.
- Business: Predicting sales from advertising spend.
- AI (2025): Linear regression as a “probe” on embeddings from LLMs to quickly adapt to new tasks without fine-tuning.
3. Mathematical Formulation
Section titled “3. Mathematical Formulation”Single-variable Regression
Section titled “Single-variable Regression”The simplest case models the relationship between input and output :
Here:
- = intercept (bias term).
- = slope (how much changes per unit ).
This fits a line through the data points.
Multivariate Regression
Section titled “Multivariate Regression”With features, we generalize:
Here:
- is the feature vector.
- is the weight vector.
In matrix form, with samples:
where:
- is the design matrix (with a column of 1s for intercept).
- .
- is the output vector.
4. Loss Function: Mean Squared Error (MSE)
Section titled “4. Loss Function: Mean Squared Error (MSE)”The most common objective is to minimize Mean Squared Error (MSE):
where .
::: info Intuition
- Squaring punishes large errors more than small ones.
- Averaging gives a single measure of performance.
- Equivalent to assuming Gaussian noise in data. :::
In 2025, while MSE remains standard for regression, alternative losses like Huber are used for robustness in noisy datasets from edge AI devices.
5. Closed-form Solution: Normal Equation
Section titled “5. Closed-form Solution: Normal Equation”We can minimize MSE directly by solving:
This is called the normal equation.
Derivation
Section titled “Derivation”Starting from:
Take gradient and set to zero:
Rearrange:
Solve for .
Computational Cost
Section titled “Computational Cost”- Matrix inversion is → expensive for large .
- For high dimensions, we prefer iterative methods like gradient descent.
In 2025, with massive datasets, distributed computing (e.g., via Spark or Dask) or approximate methods like stochastic gradient descent are preferred over exact inversion.
6. Gradient Descent Approach
Section titled “6. Gradient Descent Approach”Instead of direct inversion, we use gradient descent.
Update rule:
Gradient:
Variants
Section titled “Variants”- Batch Gradient Descent: Uses all data per step.
- Stochastic Gradient Descent (SGD): Updates using one sample at a time.
- Mini-batch: Compromise between batch and SGD, standard in 2025 for large-scale training.
::: info Why Gradient Descent?
- Scales to large datasets.
- Works with streaming data.
- Forms basis of deep learning optimization. :::
In modern ML, advanced variants like Adam or RMSprop build on SGD for faster convergence in high-dimensional spaces.
3. Regularization
Section titled “3. Regularization”Regularization prevents overfitting by penalizing large weights.
Ridge Regression (L2)
Section titled “Ridge Regression (L2)”Solution:
Lasso Regression (L1)
Section titled “Lasso Regression (L1)”- Promotes sparsity → automatic feature selection.
Elastic Net
Section titled “Elastic Net”Combines L1 + L2 penalties.
In 2025, regularization is key in federated learning to handle distributed data noise.
7. Statistical View
Section titled “7. Statistical View”Linear regression is not just an algorithm — it has deep statistical roots.
- BLUE: Best Linear Unbiased Estimator (Gauss–Markov theorem).
- MLE Interpretation: Minimizing MSE = maximizing likelihood under Gaussian noise.
- Confidence Intervals: We can estimate uncertainty in coefficients.
With the growth of probabilistic ML in 2025, linear regression often serves as a component in Bayesian models for efficient inference.
8. Evaluation Metrics
Section titled “8. Evaluation Metrics”- MSE: Penalizes large errors.
- RMSE: Square root of MSE, same units as .
- MAE: Mean absolute error (robust to outliers).
- : Proportion of variance explained:
Adjusted penalizes extra features.
In 2025, metrics like MAPE (Mean Absolute Percentage Error) are used for time-series regression tasks.
9. Implementation (Rust & Python)
Section titled “9. Implementation (Rust & Python)”::: code-group
import numpy as npfrom sklearn.linear_model import LinearRegression
X = np.array([[1000, 5], [1500, 10], [2000, 3], [2500, 8], [3000, 2]])y = np.array([200000, 250000, 300000, 350000, 400000])
model = LinearRegression().fit(X, y)
print("Intercept:", model.intercept_)print("Weights:", model.coef_)print("Prediction for 2800, 4:", model.predict([[2800, 4]])[0])use linfa::prelude::*;use linfa_linear::LinearRegression;use ndarray::{array, Array2, Array1};
fn main() { let x: Array2<f64> = array![ [1000.0, 5.0], [1500.0, 10.0], [2000.0, 3.0], [2500.0, 8.0], [3000.0, 2.0] ]; let y: Array1<f64> = array![200000.0, 250000.0, 300000.0, 350000.0, 400000.0];
let dataset = Dataset::new(x.clone(), y.clone());
let model = LinearRegression::default().fit(&dataset).unwrap(); println!("Intercept: {}", model.intercept()); println!("Weights: {:?}", model.params());
let test_x = array![[2800.0, 4.0]]; let pred = model.predict(&test_x); println!("Prediction for 2800,4: {}", pred[0]);}:::
10. Under the Hood: Numerical Stability
Section titled “10. Under the Hood: Numerical Stability”- Normal equation can be unstable if is ill-conditioned.
- QR decomposition is often preferred in practice.
- Rust’s
linfacan use optimized BLAS/LAPACK libraries for stability.
In 2025, with quantum-inspired algorithms emerging, stability in hybrid quantum-classical regression is a hot topic.
11. Extensions
Section titled “11. Extensions”- Polynomial Regression: Adds non-linear terms.
- Generalized Linear Models: Logistic regression is just one variant.
- Regularized regression widely used in ML competitions.
In 2025, extensions include linear regression on quantized models for edge AI.
12. Case Studies
Section titled “12. Case Studies”- Economics: Predicting salaries from experience and education.
- Medicine: Predicting cholesterol levels from BMI and diet.
- AI (2025): Linear probing on LLM embeddings for quick task adaptation.
- Sustainability: Predicting energy consumption from IoT sensor data in smart grids.
13. Summary
Section titled “13. Summary”- Linear regression is simple but powerful.
- Connects geometry, statistics, and optimization.
- Forms foundation for more advanced ML models.
Next: Move to Logistic Regression.
Further Reading
Section titled “Further Reading”- An Introduction to Statistical Learning (Chapter 3).
- Andrew Ng’s ML Course (Week 2).
- Hands-On Machine Learning (Chapter 4).
- Rust
linfa: github.com/rust-ml/linfa.
Updated September 17, 2025: Added 2025 ML connections, expanded real-world examples, and updated code for clarity.