Linear Regression
Linear regression is often called the “hello world” of machine learning. It is one of the oldest and most fundamental supervised learning algorithms, dating back to the 19th century with Francis Galton’s studies on heredity. Despite its simplicity, linear regression remains widely used in economics, biology, engineering, and even modern AI pipelines as a baseline model or as part of more complex systems. In 2025, with the rise of large language models and foundation models, linear regression serves as a crucial component in techniques like linear probing for feature extraction from pretrained embeddings or as a simple yet effective baseline for evaluating advanced models.
This article is a comprehensive 3000+ word deep dive into linear regression. We’ll go far beyond the surface, exploring not just “how” but “why” it works, its mathematics, geometry, statistical underpinnings, regularization methods, evaluation, and practical implementation in both Rust (linfa) and Python (scikit-learn). We've updated this article to include connections to contemporary ML practices, such as its role in hybrid systems with deep learning and its use in efficient edge computing.
1. Motivation and Intuition
Why do we start ML with linear regression?
- Simplicity: It is easy to understand — predicting one variable as a linear function of others.
- Interpretability: Each coefficient directly tells us how much the output changes with a unit change in input.
- Foundation: Many advanced models (logistic regression, neural networks) build on linear regression ideas.
- Baseline: Linear regression is often used as a benchmark to compare more sophisticated models, especially in 2025 where it's used to probe frozen large models for quick adaptations.
Real-world Examples
- Economics: Predicting income from years of education.
- Medicine: Predicting blood pressure from BMI and age.
- Business: Predicting sales from advertising spend.
- AI (2025): Linear regression as a "probe" on embeddings from LLMs to quickly adapt to new tasks without fine-tuning.
3. Mathematical Formulation
Single-variable Regression
The simplest case models the relationship between input
Here:
= intercept (bias term). = slope (how much changes per unit ).
This fits a line through the data points.
Multivariate Regression
With
Here:
is the feature vector. is the weight vector.
In matrix form, with
where:
is the design matrix (with a column of 1s for intercept). . is the output vector.
4. Loss Function: Mean Squared Error (MSE)
The most common objective is to minimize Mean Squared Error (MSE):
where
Intuition
- Squaring punishes large errors more than small ones.
- Averaging gives a single measure of performance.
- Equivalent to assuming Gaussian noise in data.
In 2025, while MSE remains standard for regression, alternative losses like Huber are used for robustness in noisy datasets from edge AI devices.
5. Closed-form Solution: Normal Equation
We can minimize MSE directly by solving:
This is called the normal equation.
Derivation
Starting from:
Take gradient and set to zero:
Rearrange:
Solve for
Computational Cost
- Matrix inversion is
→ expensive for large . - For high dimensions, we prefer iterative methods like gradient descent.
In 2025, with massive datasets, distributed computing (e.g., via Spark or Dask) or approximate methods like stochastic gradient descent are preferred over exact inversion.
6. Gradient Descent Approach
Instead of direct inversion, we use gradient descent.
Update rule:
Gradient:
Variants
- Batch Gradient Descent: Uses all data per step.
- Stochastic Gradient Descent (SGD): Updates using one sample at a time.
- Mini-batch: Compromise between batch and SGD, standard in 2025 for large-scale training.
Why Gradient Descent?
- Scales to large datasets.
- Works with streaming data.
- Forms basis of deep learning optimization.
In modern ML, advanced variants like Adam or RMSprop build on SGD for faster convergence in high-dimensional spaces.
3. Regularization
Regularization prevents overfitting by penalizing large weights.
Ridge Regression (L2)
Solution:
Lasso Regression (L1)
- Promotes sparsity → automatic feature selection.
Elastic Net
Combines L1 + L2 penalties.
In 2025, regularization is key in federated learning to handle distributed data noise.
7. Statistical View
Linear regression is not just an algorithm — it has deep statistical roots.
- BLUE: Best Linear Unbiased Estimator (Gauss–Markov theorem).
- MLE Interpretation: Minimizing MSE = maximizing likelihood under Gaussian noise.
- Confidence Intervals: We can estimate uncertainty in coefficients.
With the growth of probabilistic ML in 2025, linear regression often serves as a component in Bayesian models for efficient inference.
8. Evaluation Metrics
- MSE: Penalizes large errors.
- RMSE: Square root of MSE, same units as
. - MAE: Mean absolute error (robust to outliers).
: Proportion of variance explained:
Adjusted
In 2025, metrics like MAPE (Mean Absolute Percentage Error) are used for time-series regression tasks.
9. Implementation (Rust & Python)
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1000, 5], [1500, 10], [2000, 3], [2500, 8], [3000, 2]])
y = np.array([200000, 250000, 300000, 350000, 400000])
model = LinearRegression().fit(X, y)
print("Intercept:", model.intercept_)
print("Weights:", model.coef_)
print("Prediction for 2800, 4:", model.predict([[2800, 4]])[0])use linfa::prelude::*;
use linfa_linear::LinearRegression;
use ndarray::{array, Array2, Array1};
fn main() {
let x: Array2<f64> = array![
[1000.0, 5.0], [1500.0, 10.0], [2000.0, 3.0], [2500.0, 8.0], [3000.0, 2.0]
];
let y: Array1<f64> = array![200000.0, 250000.0, 300000.0, 350000.0, 400000.0];
let dataset = Dataset::new(x.clone(), y.clone());
let model = LinearRegression::default().fit(&dataset).unwrap();
println!("Intercept: {}", model.intercept());
println!("Weights: {:?}", model.params());
let test_x = array![[2800.0, 4.0]];
let pred = model.predict(&test_x);
println!("Prediction for 2800,4: {}", pred[0]);
}10. Under the Hood: Numerical Stability
- Normal equation can be unstable if
is ill-conditioned. - QR decomposition is often preferred in practice.
- Rust’s
linfacan use optimized BLAS/LAPACK libraries for stability.
In 2025, with quantum-inspired algorithms emerging, stability in hybrid quantum-classical regression is a hot topic.
11. Extensions
- Polynomial Regression: Adds non-linear terms.
- Generalized Linear Models: Logistic regression is just one variant.
- Regularized regression widely used in ML competitions.
In 2025, extensions include linear regression on quantized models for edge AI.
12. Case Studies
- Economics: Predicting salaries from experience and education.
- Medicine: Predicting cholesterol levels from BMI and diet.
- AI (2025): Linear probing on LLM embeddings for quick task adaptation.
- Sustainability: Predicting energy consumption from IoT sensor data in smart grids.
13. Summary
- Linear regression is simple but powerful.
- Connects geometry, statistics, and optimization.
- Forms foundation for more advanced ML models.
Next: Move to Logistic Regression.
Further Reading
- An Introduction to Statistical Learning (Chapter 3).
- Andrew Ng’s ML Course (Week 2).
- Hands-On Machine Learning (Chapter 4).
- Rust
linfa: github.com/rust-ml/linfa.
Updated September 17, 2025: Added 2025 ML connections, expanded real-world examples, and updated code for clarity.