Logistic Regression

Logistic regression is a cornerstone of supervised learning for classification problems, widely used for its simplicity, interpretability, and robustness. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities for discrete classes (e.g., spam vs. not spam, disease vs. no disease). In 2025, logistic regression remains critical in ML pipelines, serving as a baseline, a component in hybrid systems with large language models (LLMs), and a lightweight classifier for edge AI and federated learning.

This article offers a comprehensive 3000+ word exploration of logistic regression, covering theory, mathematics, derivations, optimization, regularization, evaluation, and real-world applications. We’ll provide step-by-step implementations in Python (scikit-learn) and Rust (linfa), ensuring a practical and theoretically rigorous guide aligned with modern ML trends.

1. Motivation and Introduction

Logistic regression shines where linear regression falters: predicting probabilities in [0,1]. For example, predicting whether a student passes an exam based on study hours risks non-probabilistic outputs (e.g., -0.3 or 1.2) with linear regression. Logistic regression uses the sigmoid function to map real-valued predictions to probabilities, enabling binary or multiclass classification.

Why Logistic Regression in 2025?

Baseline: Compares against complex models like transformers.
Interpretability: Coefficients reveal feature importance.
Efficiency: Lightweight for edge devices (e.g., IoT).
LLM Probing: Linear probes on pretrained embeddings for task adaptation.

The logistic function is:

[ \sigma(z) = \frac{1}{1 + e^{-z}} ]

where ( z = \mathbf{w}^T \mathbf{x} + b ). The probability of class 1 is:

[ P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) ]

2. Mathematical Formulation

For binary classification (( y \in {0,1} )):

[ \hat{y} = P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}} ]

The decision boundary is where ( P(y=1|\mathbf{x}) = 0.5 ), i.e., ( \mathbf{w}^T \mathbf{x} + b = 0 ), a hyperplane in feature space.

2.1 Log-Odds (Logit Transformation)

Logistic regression models the log-odds:

[ \text{logit}(p) = \ln \left( \frac{p}{1-p} \right) = \mathbf{w}^T \mathbf{x} + b ]

Odds: ( \frac{p}{1-p} ).
Logit: Linear in features, enabling interpretability.

ML Connection

Log-odds linearity underpins logistic regression’s use in linear probing for LLMs.

3. Loss Function: Maximum Likelihood Estimation

Logistic regression optimizes parameters (( \mathbf{w}, b )) via Maximum Likelihood Estimation (MLE). For dataset ( D = { (\mathbf{x}i, y_i) }^m ):

[ L(\mathbf{w}, b) = \prod_{i=1}^m P(y_i|\mathbf{x}_i) ]

Probability:

[ P(y_i|\mathbf{x}_i) = \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i} ]

where ( \hat{y}_i = \sigma(\mathbf{w}^T \mathbf{x}_i + b) ).

Log-likelihood (to avoid numerical underflow):

[ \ell(\mathbf{w}, b) = \sum_{i=1}^m \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] ]

Minimize negative log-likelihood (NLL), or binary cross-entropy:

[ J(\mathbf{w}, b) = - \ell(\mathbf{w}, b) ]

Connection to Fisher Information

The Fisher Information Matrix ( I(\mathbf{w}) ) measures curvature of the log-likelihood, used in natural gradient descent (see lecture on Natural Gradient Descent).

4. Optimization with Gradient Descent

Minimize ( J(\mathbf{w}, b) ) using gradient descent:

[ \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} J, \quad b \leftarrow b - \eta \nabla_b J ]

Gradients:

[ \nabla_{\mathbf{w}} J = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) \mathbf{x}_i ]

[ \nabla_b J = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) ]

Variants

Batch Gradient Descent: Full dataset.
Stochastic Gradient Descent (SGD): One sample.
Mini-batch: Common in 2025 for large-scale data.

In 2025, optimizers like Adam or Adagrad often enhance logistic regression training.

5. Regularization

Regularization prevents overfitting:

L2 (Ridge):

[ J(\mathbf{w}, b) = - \sum \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] + \lambda |\mathbf{w}|_2^2 ]

L1 (Lasso): Promotes sparsity for feature selection.

[ J(\mathbf{w}, b) = - \sum \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] + \lambda |\mathbf{w}|_1 ]

Elastic Net: Combines L1 and L2.

In federated learning, regularization stabilizes models across distributed devices.

6. Extension to Multiclass: Softmax Regression

For ( K ) classes, use softmax:

[ P(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x} + b_k}}{\sum^K e^{\mathbf{w}_j^T \mathbf{x} + b_j}} ]

Loss: Categorical Cross-Entropy.

In 2025, softmax regression is used in LLM classification heads.

7. Evaluation Metrics

Accuracy: Fraction of correct predictions.
Precision/Recall/F1: For imbalanced datasets.
ROC Curve & AUC: Performance across thresholds.
Log-Loss: Measures probability calibration.
Calibration Curve: In 2025, ensures well-calibrated probabilities for decision-making.

8. Numerical Stability and Conditioning

Ill-Conditioned Features: High correlation in ( \mathbf{X} ) leads to unstable ( \mathbf{w} ). Regularization (ridge) or preconditioning helps.
Sigmoid Stability: Avoid overflow in ( e^{-z} ) using stable implementations (e.g., log-sum-exp).
In 2025, libraries like scikit-learn and linfa use optimized BLAS for stability.

See lecture on Numerical Stability.

9. Implementations

Python (scikit-learn)

PythonRust

python

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss

# Synthetic dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])

# Train logistic regression with L2
model = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)
model.fit(X, y)

# Predict
preds = model.predict(X)
probs = model.predict_proba(X)[:,1]

print("Predictions:", preds)
print("Probabilities:", probs)
print("Accuracy:", accuracy_score(y, preds))
print("Log-Loss:", log_loss(y, probs))
print("AUC:", roc_auc_score(y, probs))

rust

use linfa::prelude::*;
use linfa_logistic::LogisticRegression;
use ndarray::{array, Array1};

fn main() {
    // Features and labels
    let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
    let y = Array1::from_vec(vec![0, 0, 0, 1, 1]);

    // Dataset
    let dataset = Dataset::new(x.clone(), y.clone());

    // Train logistic regression with L2
    let model = LogisticRegression::default().fit(&dataset).unwrap();

    // Predict
    let test_x = array![[2.5], [3.5]];
    let preds = model.predict(&test_x);
    let probs = model.predict_probabilities(&test_x).unwrap();
    println!("Predictions: {:?}", preds);
    println!("Probabilities: {:?}", probs);
}

Dependencies (Cargo.toml):

toml

[dependencies]
linfa = "0.7.1"
linfa-logistic = "0.7.1"
ndarray = "0.15.6"

10. Case Study: Breast Cancer Dataset

PythonRust

python

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, CalibrationDisplay
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train with L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
print("Classification Report:\n", classification_report(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_prob))

# Calibration plot (2025 trend)
disp = CalibrationDisplay.from_estimator(model, X_test, y_test)
plt.title("Calibration Curve")
plt.show()

rust

use linfa::prelude::*;
use linfa_logistic::LogisticRegression;
use ndarray::{Array2, Array1};

fn main() {
    // Placeholder: Load breast cancer dataset (not natively in Rust)
    let x_train: Array2<f64> = Array2::zeros((455, 30)); // Simplified
    let y_train: Array1<i32> = Array1::zeros(455);
    let x_test: Array2<f64> = Array2::zeros((114, 30));
    let y_test: Array1<i32> = Array1::zeros(114);

    // Dataset
    let dataset = Dataset::new(x_train, y_train);

    // Train with L1
    let model = LogisticRegression::default()
        .max_iterations(1000)
        .fit(&dataset)
        .unwrap();

    // Evaluate
    let preds = model.predict(&x_test);
    println!("Predictions: {:?}", preds);
}

Note: Rust lacks native breast cancer dataset; use Python for full example or load via CSV.

11. Applications in 2025

Medical Diagnosis: Predicting disease (e.g., breast cancer).
Spam Detection: Classifying emails.
LLM Probing: Linear probes on embeddings for task adaptation.
Federated Learning: Lightweight classifier on edge devices.
IoT: Classifying sensor data for anomaly detection.

12. Under the Hood Insights

Discriminative Model: Models ( P(y|\mathbf{x}) ), unlike generative Naive Bayes.
Linear Separability: Assumes log-odds linearity.
Deep Learning: Final layer in binary classifiers.
Numerical Stability: Regularization mitigates multicollinearity.

13. Limitations

Assumes linear decision boundaries; use kernels or neural nets for nonlinearity.
Sensitive to multicollinearity; requires feature preprocessing.
Struggles with high-dimensional sparse data without regularization.

14. Summary

Logistic regression remains a workhorse of classification in 2025, balancing simplicity, interpretability, and efficiency. Its theoretical foundations (log-odds, MLE, Fisher Information) and practical applications (LLM probing, federated learning) make it indispensable.

Logistic Regression ​

1. Motivation and Introduction ​

2. Mathematical Formulation ​

2.1 Log-Odds (Logit Transformation) ​

ML Connection ​

3. Loss Function: Maximum Likelihood Estimation ​

Connection to Fisher Information ​

4. Optimization with Gradient Descent ​

Variants ​

5. Regularization ​

6. Extension to Multiclass: Softmax Regression ​

7. Evaluation Metrics ​

8. Numerical Stability and Conditioning ​

9. Implementations ​

Python (scikit-learn) ​

10. Case Study: Breast Cancer Dataset ​

11. Applications in 2025 ​

12. Under the Hood Insights ​

13. Limitations ​

14. Summary ​

Further Reading ​