Skip to content

Logistic Regression

Logistic regression is a cornerstone of supervised learning for classification problems, widely used for its simplicity, interpretability, and robustness. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities for discrete classes (e.g., spam vs. not spam, disease vs. no disease). In 2025, logistic regression remains critical in ML pipelines, serving as a baseline, a component in hybrid systems with large language models (LLMs), and a lightweight classifier for edge AI and federated learning.

This article offers a comprehensive 3000+ word exploration of logistic regression, covering theory, mathematics, derivations, optimization, regularization, evaluation, and real-world applications. We’ll provide step-by-step implementations in Python (scikit-learn) and Rust (linfa), ensuring a practical and theoretically rigorous guide aligned with modern ML trends.


Logistic regression shines where linear regression falters: predicting probabilities in [0,1]. For example, predicting whether a student passes an exam based on study hours risks non-probabilistic outputs (e.g., -0.3 or 1.2) with linear regression. Logistic regression uses the sigmoid function to map real-valued predictions to probabilities, enabling binary or multiclass classification.

Why Logistic Regression in 2025?

  • Baseline: Compares against complex models like transformers.
  • Interpretability: Coefficients reveal feature importance.
  • Efficiency: Lightweight for edge devices (e.g., IoT).
  • LLM Probing: Linear probes on pretrained embeddings for task adaptation.

The logistic function is:

[ \sigma(z) = \frac{1}{1 + e^{-z}} ]

where ( z = \mathbf{w}^T \mathbf{x} + b ). The probability of class 1 is:

[ P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) ]


For binary classification (( y \in {0,1} )):

[ \hat{y} = P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}} ]

The decision boundary is where ( P(y=1|\mathbf{x}) = 0.5 ), i.e., ( \mathbf{w}^T \mathbf{x} + b = 0 ), a hyperplane in feature space.

Logistic regression models the log-odds:

[ \text{logit}(p) = \ln \left( \frac{p}{1-p} \right) = \mathbf{w}^T \mathbf{x} + b ]

  • Odds: ( \frac{p}{1-p} ).
  • Logit: Linear in features, enabling interpretability.
  • Log-odds linearity underpins logistic regression’s use in linear probing for LLMs.

3. Loss Function: Maximum Likelihood Estimation

Section titled “3. Loss Function: Maximum Likelihood Estimation”

Logistic regression optimizes parameters (( \mathbf{w}, b )) via Maximum Likelihood Estimation (MLE). For dataset ( D = { (\mathbf{x}i, y_i) }{i=1}^m ):

[ L(\mathbf{w}, b) = \prod_{i=1}^m P(y_i|\mathbf{x}_i) ]

Probability:

[ P(y_i|\mathbf{x}_i) = \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i} ]

where ( \hat{y}_i = \sigma(\mathbf{w}^T \mathbf{x}_i + b) ).

Log-likelihood (to avoid numerical underflow):

[ \ell(\mathbf{w}, b) = \sum_{i=1}^m \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] ]

Minimize negative log-likelihood (NLL), or binary cross-entropy:

[ J(\mathbf{w}, b) = - \ell(\mathbf{w}, b) ]

The Fisher Information Matrix ( I(\mathbf{w}) ) measures curvature of the log-likelihood, used in natural gradient descent (see lecture on Natural Gradient Descent).


Minimize ( J(\mathbf{w}, b) ) using gradient descent:

[ \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} J, \quad b \leftarrow b - \eta \nabla_b J ]

Gradients:

[ \nabla_{\mathbf{w}} J = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) \mathbf{x}_i ]

[ \nabla_b J = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) ]

  • Batch Gradient Descent: Full dataset.
  • Stochastic Gradient Descent (SGD): One sample.
  • Mini-batch: Common in 2025 for large-scale data.

In 2025, optimizers like Adam or Adagrad often enhance logistic regression training.


Regularization prevents overfitting:

  • L2 (Ridge):

[ J(\mathbf{w}, b) = - \sum \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] + \lambda |\mathbf{w}|_2^2 ]

  • L1 (Lasso): Promotes sparsity for feature selection.

[ J(\mathbf{w}, b) = - \sum \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right] + \lambda |\mathbf{w}|_1 ]

  • Elastic Net: Combines L1 and L2.

In federated learning, regularization stabilizes models across distributed devices.


6. Extension to Multiclass: Softmax Regression

Section titled “6. Extension to Multiclass: Softmax Regression”

For ( K ) classes, use softmax:

[ P(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x} + b_k}}{\sum{j=1}^K e^{\mathbf{w}_j^T \mathbf{x} + b_j}} ]

Loss: Categorical Cross-Entropy.

In 2025, softmax regression is used in LLM classification heads.


  • Accuracy: Fraction of correct predictions.
  • Precision/Recall/F1: For imbalanced datasets.
  • ROC Curve & AUC: Performance across thresholds.
  • Log-Loss: Measures probability calibration.
  • Calibration Curve: In 2025, ensures well-calibrated probabilities for decision-making.

  • Ill-Conditioned Features: High correlation in ( \mathbf{X} ) leads to unstable ( \mathbf{w} ). Regularization (ridge) or preconditioning helps.
  • Sigmoid Stability: Avoid overflow in ( e^{-z} ) using stable implementations (e.g., log-sum-exp).
  • In 2025, libraries like scikit-learn and linfa use optimized BLAS for stability.

See lecture on Numerical Stability.


::: code-group

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss
# Synthetic dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])
# Train logistic regression with L2
model = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)
model.fit(X, y)
# Predict
preds = model.predict(X)
probs = model.predict_proba(X)[:,1]
print("Predictions:", preds)
print("Probabilities:", probs)
print("Accuracy:", accuracy_score(y, preds))
print("Log-Loss:", log_loss(y, probs))
print("AUC:", roc_auc_score(y, probs))
use linfa::prelude::*;
use linfa_logistic::LogisticRegression;
use ndarray::{array, Array1};
fn main() {
// Features and labels
let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
let y = Array1::from_vec(vec![0, 0, 0, 1, 1]);
// Dataset
let dataset = Dataset::new(x.clone(), y.clone());
// Train logistic regression with L2
let model = LogisticRegression::default().fit(&dataset).unwrap();
// Predict
let test_x = array![[2.5], [3.5]];
let preds = model.predict(&test_x);
let probs = model.predict_probabilities(&test_x).unwrap();
println!("Predictions: {:?}", preds);
println!("Probabilities: {:?}", probs);
}

:::

Dependencies (Cargo.toml):

[dependencies]
linfa = "0.7.1"
linfa-logistic = "0.7.1"
ndarray = "0.15.6"

::: code-group

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, CalibrationDisplay
import matplotlib.pyplot as plt
# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train with L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
print("Classification Report:\n", classification_report(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_prob))
# Calibration plot (2025 trend)
disp = CalibrationDisplay.from_estimator(model, X_test, y_test)
plt.title("Calibration Curve")
plt.show()
use linfa::prelude::*;
use linfa_logistic::LogisticRegression;
use ndarray::{Array2, Array1};
fn main() {
// Placeholder: Load breast cancer dataset (not natively in Rust)
let x_train: Array2<f64> = Array2::zeros((455, 30)); // Simplified
let y_train: Array1<i32> = Array1::zeros(455);
let x_test: Array2<f64> = Array2::zeros((114, 30));
let y_test: Array1<i32> = Array1::zeros(114);
// Dataset
let dataset = Dataset::new(x_train, y_train);
// Train with L1
let model = LogisticRegression::default()
.max_iterations(1000)
.fit(&dataset)
.unwrap();
// Evaluate
let preds = model.predict(&x_test);
println!("Predictions: {:?}", preds);
}

:::

Note: Rust lacks native breast cancer dataset; use Python for full example or load via CSV.


  • Medical Diagnosis: Predicting disease (e.g., breast cancer).
  • Spam Detection: Classifying emails.
  • LLM Probing: Linear probes on embeddings for task adaptation.
  • Federated Learning: Lightweight classifier on edge devices.
  • IoT: Classifying sensor data for anomaly detection.

  • Discriminative Model: Models ( P(y|\mathbf{x}) ), unlike generative Naive Bayes.
  • Linear Separability: Assumes log-odds linearity.
  • Deep Learning: Final layer in binary classifiers.
  • Numerical Stability: Regularization mitigates multicollinearity.

  • Assumes linear decision boundaries; use kernels or neural nets for nonlinearity.
  • Sensitive to multicollinearity; requires feature preprocessing.
  • Struggles with high-dimensional sparse data without regularization.

Logistic regression remains a workhorse of classification in 2025, balancing simplicity, interpretability, and efficiency. Its theoretical foundations (log-odds, MLE, Fisher Information) and practical applications (LLM probing, federated learning) make it indispensable.


  • James, Introduction to Statistical Learning (Ch. 4).
  • Ng, Machine Learning Yearning.
  • Géron, Hands-On ML (Ch. 4).
  • linfa-logistic docs: github.com/rust-ml/linfa.

Updated September 17, 2025: Added 2025 trends, expanded applications, and calibration metrics.