Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are powerful supervised learning algorithms used primarily for classification and regression, known for their ability to handle nonlinear relationships via the kernel trick and maximize margins for robust decision boundaries. In 2025, SVMs remain relevant in ML for their mathematical elegance, feature in hybrid systems with large language models (LLMs), and applications in areas like anomaly detection and bioinformatics where interpretability and high-dimensional handling are key.

This lecture in the “Foundations for AI/ML” series (core-ml cluster) builds on prior topics like logistic regression and decision trees, exploring SVMs, their theoretical foundations, the kernel trick, regularization, and applications. We’ll provide intuitive explanations, mathematical derivations, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.

1. Motivation and Intuition

SVMs aim to find the optimal hyperplane that separates classes with the maximum margin, ensuring robustness to noise. Unlike logistic regression’s probabilistic approach, SVMs focus on support vectors—critical points defining the boundary.

Why SVMs in 2025?

Robustness: Large margins generalize well.
Kernel Trick: Handles nonlinearity without explicit features.
Sparsity: Relies on few support vectors.
Modern Applications: SVMs on LLM embeddings for classification, edge AI.

Real-World Examples

Bioinformatics: Classify proteins.
Finance: Detect fraud.
AI Pipelines: SVMs on LLM features for efficient decisions.

::: info SVMs are like a surgeon’s precise cut—maximizing the margin between classes for safe, robust separation. :::

2. Mathematical Formulation

Binary Classification

For data ( D = { (\mathbf{x}i, y_i) }{i=1}^m ), y_i ∈ {-1,1}, SVM finds hyperplane \mathbf{w}^T \mathbf{x} + b = 0.

Margin: Distance from hyperplane to nearest point, 1 / ||\mathbf{w}|| for normalized.

Hard-Margin SVM: Min ||\mathbf{w}||² /2 s.t. y_i (\mathbf{w}^T \mathbf{x}_i + b) ≥ 1.

Dual Form: Max sum α_i - (1/2) sum α_i α_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j s.t. sum α_i y_i = 0, α_i ≥ 0.

Soft-Margin SVM

Introduce slack ξ_i for errors:

Min ||\mathbf{w}||² /2 + C sum ξ_i s.t. y_i (\mathbf{w}^T \mathbf{x}_i + b) ≥ 1 - ξ_i, ξ_i ≥ 0.

C trades margin vs. errors.

ML Connection

Dual enables kernel trick for nonlinearity.

3. Kernel Trick

Map data to high-d via φ(x), compute k(x_i, x_j) = φ(x_i)^T φ(x_j).

Dual with Kernel:

Max sum α_i - (1/2) sum α_i α_j y_i y_j k(x_i, x_j).

Prediction: sign(sum α_i y_i k(x, x_i) + b).

Kernels:

Linear: x^T y.
Polynomial: (x^T y + c)^d.
RBF: exp(-γ ||x-y||²).

In 2025, custom kernels for LLM embeddings.

4. Optimization and Support Vectors

Quadratic Programming: Solve dual QP for α.

Support vectors: Points with α_i > 0, on margin.

b from support vector: b = y_i - sum α_j y_j k(x_j, x_i).

Derivation

Lagrangian dual transforms primal constraints.

ML Insight

Sparsity: Few support vectors for efficiency.

5. SVM for Regression (SVR)

Minimize ε-insensitive loss: |y - f(x)|_ε = max(0, |y - f(x)| - ε).

Dual similar, with ε-tube margin.

In ML: Robust regression.

6. Evaluation Metrics

Accuracy/Precision/Recall/F1: For classification.
MSE/MAE: For SVR.
Support Vectors: Measure model complexity.
Margin: Indicates robustness.

In 2025, SHAP for SVM explainability.

7. Applications in Machine Learning (2025)

Classification: Text categorization, image recognition.
Regression: Time-series forecasting.
Anomaly Detection: One-class SVM.
Bioinformatics: Protein classification.
Hybrid Systems: SVM on LLM embeddings.
Edge AI: Lightweight SVM on devices.

Challenges

Scalability: O(n²) for kernel matrix.
High-D: Curse of dimensionality; use kernels.
Imbalanced Data: Weighted C.

8. Numerical Implementations

Implement SVM for classification.

::: code-group

import numpy as np
from sklearn.svm import SVC, SVR
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Classification
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)
svm_clf.fit(X_train, y_train)
y_pred = svm_clf.predict(X_test)
print("SVM Classification Accuracy:", accuracy_score(y_test, y_pred))
print("Support Vectors:", svm_clf.n_support_)

# Regression
X_reg = np.random.rand(200, 1) * 10
y_reg = np.sin(X_reg).ravel() + np.random.normal(0, 0.1, 200)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=0)

svr = SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale')
svr.fit(X_train_reg, y_train_reg)
y_pred_reg = svr.predict(X_test_reg)
print("SVR MSE:", mean_squared_error(y_test_reg, y_pred_reg))

# ML: Kernel SVM on nonlinear data
from sklearn.datasets import make_moons
X_moon, y_moon = make_moons(n_samples=200, noise=0.1, random_state=0)
svm_nonlin = SVC(kernel='rbf', C=1.0, gamma=0.5)
svm_nonlin.fit(X_moon, y_moon)
print("Nonlinear SVM Accuracy:", accuracy_score(y_moon, svm_nonlin.predict(X_moon)))

use linfa::prelude::*;
use linfa_svm::Svm;
use ndarray::{Array2, Array1};

fn main() {
    // Classification (placeholder Iris)
    let x_train: Array2<f64> = Array2::zeros((120, 4));
    let y_train: Array1<i32> = Array1::zeros(120);
    let x_test: Array2<f64> = Array2::zeros((30, 4));
    let y_test: Array1<i32> = Array1::zeros(30);

    let dataset = Dataset::new(x_train, y_train);
    let model = Svm::params().fit(&dataset).unwrap();
    let preds = model.predict(&x_test);
    let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64;
    println!("SVM Classification Accuracy: {}", accuracy);

    // Regression (SVR)
    let x_reg: Array2<f64> = Array2::zeros((200, 1));
    let y_reg: Array1<f64> = Array1::zeros(200);
    let (x_train_reg, x_test_reg, y_train_reg, y_test_reg) = (
        x_reg.slice(s![0..160, ..]).to_owned(),
        x_reg.slice(s![160..200, ..]).to_owned(),
        y_reg.slice(s![0..160]).to_owned(),
        y_reg.slice(s![160..200]).to_owned(),
    );
    let dataset_reg = Dataset::new(x_train_reg, y_train_reg);
    let svr = Svm::params().fit(&dataset_reg).unwrap();
    let preds_reg = svr.predict(&x_test_reg);
    let mse = preds_reg.iter().zip(y_test_reg.iter()).map(|(&p, &t)| (p - t).powi(2)).sum::<f64>() / y_test_reg.len() as f64;
    println!("SVR MSE: {}", mse);
}

Dependencies (Cargo.toml):

[dependencies]
linfa = "0.7.1"
linfa-svm = "0.7.1"
ndarray = "0.15.6"

:::

Implements SVM for classification and regression.

8. Numerical Stability and High-Dimensions

Kernel Matrix: Ill-conditioned in high-d; regularization helps.
High-D: Curse of dimensionality; use linear kernel or reduction.
Stability: QP solvers (SMO in SVM) ensure convergence.

In 2025, stability in distributed SVM for federated ML.

9. Case Study: Iris Dataset (Classification)

::: code-group

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Train SVM
svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)
svm.fit(X_train, y_train)

# Evaluate
y_pred = svm.predict(X_test)
print(classification_report(y_test, y_pred))

use linfa::prelude::*;
use linfa_svm::Svm;
use ndarray::{Array2, Array1};

fn main() {
    // Placeholder: Iris dataset
    let x_train: Array2<f64> = Array2::zeros((120, 4));
    let y_train: Array1<i32> = Array1::zeros(120);
    let x_test: Array2<f64> = Array2::zeros((30, 4));
    let y_test: Array1<i32> = Array1::zeros(30);

    let dataset = Dataset::new(x_train, y_train);
    let model = Svm::params().fit(&dataset).unwrap();
    let preds = model.predict(&x_test);
    let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64;
    println!("SVM Accuracy: {}", accuracy);
}

:::

Note: Rust requires external data loading; use Python for full example.

10. Under the Hood Insights

Margin Maximization: Robust to noise.
Kernel Trick: Nonlinear boundaries.
Support Vectors: Sparsity for efficiency.
Dual Form: Enables kernels, QP solving.

11. Limitations

Scalability: O(n²) kernel matrix.
High-D: Curse of dimensionality; kernels help.
Imbalanced Data: Weighted C.
Non-Probabilistic: SVC lacks calibrated probabilities (use Platt scaling).

12. Summary

SVMs are robust, kernel-based classifiers foundational to ML. In 2025, their role in explainable AI, bioinformatics, and LLM hybrids keeps them vital. The kernel trick and margin maximization address nonlinearity and overfitting.

Support Vector Machines (SVMs)

Support Vector Machines (SVMs)

1. Motivation and Intuition

Real-World Examples

2. Mathematical Formulation

Binary Classification

Soft-Margin SVM

ML Connection

3. Kernel Trick

4. Optimization and Support Vectors

Derivation

ML Insight

5. SVM for Regression (SVR)

6. Evaluation Metrics

7. Applications in Machine Learning (2025)

Challenges

8. Numerical Implementations

8. Numerical Stability and High-Dimensions

9. Case Study: Iris Dataset (Classification)

10. Under the Hood Insights

11. Limitations

12. Summary

Further Reading