Logistic Regression
Logistic Regression
Section titled “Logistic Regression”Logistic regression is a cornerstone of supervised learning for classification problems, widely used for its simplicity, interpretability, and robustness. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities for discrete classes (e.g., spam vs. not spam, disease vs. no disease). In 2025, logistic regression remains critical in ML pipelines, serving as a baseline, a component in hybrid systems with large language models (LLMs), and a lightweight classifier for edge AI and federated learning.
This article offers a comprehensive 3000+ word exploration of logistic regression, covering theory, mathematics, derivations, optimization, regularization, evaluation, and real-world applications. We’ll provide step-by-step implementations in Python (scikit-learn) and Rust (linfa), ensuring a practical and theoretically rigorous guide aligned with modern ML trends.
1. Motivation and Introduction
Section titled “1. Motivation and Introduction”Logistic regression shines where linear regression falters: predicting probabilities in [0,1]. For example, predicting whether a student passes an exam based on study hours risks non-probabilistic outputs (e.g., -0.3 or 1.2) with linear regression. Logistic regression uses the sigmoid function to map real-valued predictions to probabilities, enabling binary or multiclass classification.
Why Logistic Regression in 2025?
- Baseline: Compares against complex models like transformers.
- Interpretability: Coefficients reveal feature importance.
- Efficiency: Lightweight for edge devices (e.g., IoT).
- LLM Probing: Linear probes on pretrained embeddings for task adaptation.
The logistic function is:
[ \sigma(z) = \frac{1}{1 + e^{-z}} ]
where ( z = \mathbf{w}^T \mathbf{x} + b ). The probability of class 1 is:
[ P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) ]
2. Mathematical Formulation
Section titled “2. Mathematical Formulation”For binary classification (( y \in {0,1} )):
[ \hat{y} = P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}} ]
The decision boundary is where ( P(y=1|\mathbf{x}) = 0.5 ), i.e., ( \mathbf{w}^T \mathbf{x} + b = 0 ), a hyperplane in feature space.
2.1 Log-Odds (Logit Transformation)
Section titled “2.1 Log-Odds (Logit Transformation)”Logistic regression models the log-odds:
[ \text{logit}(p) = \ln \left( \frac{p}{1-p} \right) = \mathbf{w}^T \mathbf{x} + b ]
- Odds: ( \frac{p}{1-p} ).
- Logit: Linear in features, enabling interpretability.
ML Connection
Section titled “ML Connection”- Log-odds linearity underpins logistic regression’s use in linear probing for LLMs.
3. Loss Function: Maximum Likelihood Estimation
Section titled “3. Loss Function: Maximum Likelihood Estimation”Logistic regression optimizes parameters (( \mathbf{w}, b )) via Maximum Likelihood Estimation (MLE). For dataset ( D = { (\mathbf{x}i, y_i) }{i=1}^m ):
[ L(\mathbf{w}, b) = \prod_{i=1}^m P(y_i|\mathbf{x}_i) ]
Probability:
[ P(y_i|\mathbf{x}_i) = \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i} ]
where ( \hat{y}_i = \sigma(\mathbf{w}^T \mathbf{x}_i + b) ).
Log-likelihood (to avoid numerical underflow):
[ \ell(\mathbf{w}, b) = \sum_{i=1}^m \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] ]
Minimize negative log-likelihood (NLL), or binary cross-entropy:
[ J(\mathbf{w}, b) = - \ell(\mathbf{w}, b) ]
Connection to Fisher Information
Section titled “Connection to Fisher Information”The Fisher Information Matrix ( I(\mathbf{w}) ) measures curvature of the log-likelihood, used in natural gradient descent (see lecture on Natural Gradient Descent).
4. Optimization with Gradient Descent
Section titled “4. Optimization with Gradient Descent”Minimize ( J(\mathbf{w}, b) ) using gradient descent:
[ \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} J, \quad b \leftarrow b - \eta \nabla_b J ]
Gradients:
[ \nabla_{\mathbf{w}} J = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) \mathbf{x}_i ]
[ \nabla_b J = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) ]
Variants
Section titled “Variants”- Batch Gradient Descent: Full dataset.
- Stochastic Gradient Descent (SGD): One sample.
- Mini-batch: Common in 2025 for large-scale data.
In 2025, optimizers like Adam or Adagrad often enhance logistic regression training.
5. Regularization
Section titled “5. Regularization”Regularization prevents overfitting:
- L2 (Ridge):
[ J(\mathbf{w}, b) = - \sum \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] + \lambda |\mathbf{w}|_2^2 ]
- L1 (Lasso): Promotes sparsity for feature selection.
[ J(\mathbf{w}, b) = - \sum \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] + \lambda |\mathbf{w}|_1 ]
- Elastic Net: Combines L1 and L2.
In federated learning, regularization stabilizes models across distributed devices.
6. Extension to Multiclass: Softmax Regression
Section titled “6. Extension to Multiclass: Softmax Regression”For ( K ) classes, use softmax:
[ P(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x} + b_k}}{\sum{j=1}^K e^{\mathbf{w}_j^T \mathbf{x} + b_j}} ]
Loss: Categorical Cross-Entropy.
In 2025, softmax regression is used in LLM classification heads.
7. Evaluation Metrics
Section titled “7. Evaluation Metrics”- Accuracy: Fraction of correct predictions.
- Precision/Recall/F1: For imbalanced datasets.
- ROC Curve & AUC: Performance across thresholds.
- Log-Loss: Measures probability calibration.
- Calibration Curve: In 2025, ensures well-calibrated probabilities for decision-making.
8. Numerical Stability and Conditioning
Section titled “8. Numerical Stability and Conditioning”- Ill-Conditioned Features: High correlation in ( \mathbf{X} ) leads to unstable ( \mathbf{w} ). Regularization (ridge) or preconditioning helps.
- Sigmoid Stability: Avoid overflow in ( e^{-z} ) using stable implementations (e.g., log-sum-exp).
- In 2025, libraries like scikit-learn and linfa use optimized BLAS for stability.
See lecture on Numerical Stability.
9. Implementations
Section titled “9. Implementations”Python (scikit-learn)
Section titled “Python (scikit-learn)”::: code-group
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, roc_auc_score, log_loss
# Synthetic datasetX = np.array([[1], [2], [3], [4], [5]])y = np.array([0, 0, 0, 1, 1])
# Train logistic regression with L2model = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)model.fit(X, y)
# Predictpreds = model.predict(X)probs = model.predict_proba(X)[:,1]
print("Predictions:", preds)print("Probabilities:", probs)print("Accuracy:", accuracy_score(y, preds))print("Log-Loss:", log_loss(y, probs))print("AUC:", roc_auc_score(y, probs))use linfa::prelude::*;use linfa_logistic::LogisticRegression;use ndarray::{array, Array1};
fn main() { // Features and labels let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]]; let y = Array1::from_vec(vec![0, 0, 0, 1, 1]);
// Dataset let dataset = Dataset::new(x.clone(), y.clone());
// Train logistic regression with L2 let model = LogisticRegression::default().fit(&dataset).unwrap();
// Predict let test_x = array![[2.5], [3.5]]; let preds = model.predict(&test_x); let probs = model.predict_probabilities(&test_x).unwrap(); println!("Predictions: {:?}", preds); println!("Probabilities: {:?}", probs);}:::
Dependencies (Cargo.toml):
[dependencies]linfa = "0.7.1"linfa-logistic = "0.7.1"ndarray = "0.15.6"10. Case Study: Breast Cancer Dataset
Section titled “10. Case Study: Breast Cancer Dataset”::: code-group
from sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_report, roc_auc_score, CalibrationDisplayimport matplotlib.pyplot as plt
# Load datasetdata = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Scale datascaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)
# Train with L1 regularizationmodel = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)model.fit(X_train, y_train)
# Evaluatey_pred = model.predict(X_test)y_prob = model.predict_proba(X_test)[:,1]print("Classification Report:\n", classification_report(y_test, y_pred))print("AUC:", roc_auc_score(y_test, y_prob))
# Calibration plot (2025 trend)disp = CalibrationDisplay.from_estimator(model, X_test, y_test)plt.title("Calibration Curve")plt.show()use linfa::prelude::*;use linfa_logistic::LogisticRegression;use ndarray::{Array2, Array1};
fn main() { // Placeholder: Load breast cancer dataset (not natively in Rust) let x_train: Array2<f64> = Array2::zeros((455, 30)); // Simplified let y_train: Array1<i32> = Array1::zeros(455); let x_test: Array2<f64> = Array2::zeros((114, 30)); let y_test: Array1<i32> = Array1::zeros(114);
// Dataset let dataset = Dataset::new(x_train, y_train);
// Train with L1 let model = LogisticRegression::default() .max_iterations(1000) .fit(&dataset) .unwrap();
// Evaluate let preds = model.predict(&x_test); println!("Predictions: {:?}", preds);}:::
Note: Rust lacks native breast cancer dataset; use Python for full example or load via CSV.
11. Applications in 2025
Section titled “11. Applications in 2025”- Medical Diagnosis: Predicting disease (e.g., breast cancer).
- Spam Detection: Classifying emails.
- LLM Probing: Linear probes on embeddings for task adaptation.
- Federated Learning: Lightweight classifier on edge devices.
- IoT: Classifying sensor data for anomaly detection.
12. Under the Hood Insights
Section titled “12. Under the Hood Insights”- Discriminative Model: Models ( P(y|\mathbf{x}) ), unlike generative Naive Bayes.
- Linear Separability: Assumes log-odds linearity.
- Deep Learning: Final layer in binary classifiers.
- Numerical Stability: Regularization mitigates multicollinearity.
13. Limitations
Section titled “13. Limitations”- Assumes linear decision boundaries; use kernels or neural nets for nonlinearity.
- Sensitive to multicollinearity; requires feature preprocessing.
- Struggles with high-dimensional sparse data without regularization.
14. Summary
Section titled “14. Summary”Logistic regression remains a workhorse of classification in 2025, balancing simplicity, interpretability, and efficiency. Its theoretical foundations (log-odds, MLE, Fisher Information) and practical applications (LLM probing, federated learning) make it indispensable.
Further Reading
Section titled “Further Reading”- James, Introduction to Statistical Learning (Ch. 4).
- Ng, Machine Learning Yearning.
- Géron, Hands-On ML (Ch. 4).
linfa-logisticdocs: github.com/rust-ml/linfa.
Updated September 17, 2025: Added 2025 trends, expanded applications, and calibration metrics.