Support Vector Machines (SVMs)
Support Vector Machines (SVMs)
Section titled “Support Vector Machines (SVMs)”Support Vector Machines (SVMs) are powerful supervised learning algorithms used primarily for classification and regression, known for their ability to handle nonlinear relationships via the kernel trick and maximize margins for robust decision boundaries. In 2025, SVMs remain relevant in ML for their mathematical elegance, feature in hybrid systems with large language models (LLMs), and applications in areas like anomaly detection and bioinformatics where interpretability and high-dimensional handling are key.
This lecture in the “Foundations for AI/ML” series (core-ml cluster) builds on prior topics like logistic regression and decision trees, exploring SVMs, their theoretical foundations, the kernel trick, regularization, and applications. We’ll provide intuitive explanations, mathematical derivations, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.
1. Motivation and Intuition
Section titled “1. Motivation and Intuition”SVMs aim to find the optimal hyperplane that separates classes with the maximum margin, ensuring robustness to noise. Unlike logistic regression’s probabilistic approach, SVMs focus on support vectors—critical points defining the boundary.
Why SVMs in 2025?
- Robustness: Large margins generalize well.
- Kernel Trick: Handles nonlinearity without explicit features.
- Sparsity: Relies on few support vectors.
- Modern Applications: SVMs on LLM embeddings for classification, edge AI.
Real-World Examples
Section titled “Real-World Examples”- Bioinformatics: Classify proteins.
- Finance: Detect fraud.
- AI Pipelines: SVMs on LLM features for efficient decisions.
::: info SVMs are like a surgeon’s precise cut—maximizing the margin between classes for safe, robust separation. :::
2. Mathematical Formulation
Section titled “2. Mathematical Formulation”Binary Classification
Section titled “Binary Classification”For data ( D = { (\mathbf{x}i, y_i) }{i=1}^m ), y_i ∈ {-1,1}, SVM finds hyperplane \mathbf{w}^T \mathbf{x} + b = 0.
Margin: Distance from hyperplane to nearest point, 1 / ||\mathbf{w}|| for normalized.
Hard-Margin SVM: Min ||\mathbf{w}||² /2 s.t. y_i (\mathbf{w}^T \mathbf{x}_i + b) ≥ 1.
Dual Form: Max sum α_i - (1/2) sum α_i α_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j s.t. sum α_i y_i = 0, α_i ≥ 0.
Soft-Margin SVM
Section titled “Soft-Margin SVM”Introduce slack ξ_i for errors:
Min ||\mathbf{w}||² /2 + C sum ξ_i s.t. y_i (\mathbf{w}^T \mathbf{x}_i + b) ≥ 1 - ξ_i, ξ_i ≥ 0.
C trades margin vs. errors.
ML Connection
Section titled “ML Connection”- Dual enables kernel trick for nonlinearity.
3. Kernel Trick
Section titled “3. Kernel Trick”Map data to high-d via φ(x), compute k(x_i, x_j) = φ(x_i)^T φ(x_j).
Dual with Kernel:
Max sum α_i - (1/2) sum α_i α_j y_i y_j k(x_i, x_j).
Prediction: sign(sum α_i y_i k(x, x_i) + b).
Kernels:
- Linear: x^T y.
- Polynomial: (x^T y + c)^d.
- RBF: exp(-γ ||x-y||²).
In 2025, custom kernels for LLM embeddings.
4. Optimization and Support Vectors
Section titled “4. Optimization and Support Vectors”Quadratic Programming: Solve dual QP for α.
Support vectors: Points with α_i > 0, on margin.
b from support vector: b = y_i - sum α_j y_j k(x_j, x_i).
Derivation
Section titled “Derivation”Lagrangian dual transforms primal constraints.
ML Insight
Section titled “ML Insight”- Sparsity: Few support vectors for efficiency.
5. SVM for Regression (SVR)
Section titled “5. SVM for Regression (SVR)”Minimize ε-insensitive loss: |y - f(x)|_ε = max(0, |y - f(x)| - ε).
Dual similar, with ε-tube margin.
In ML: Robust regression.
6. Evaluation Metrics
Section titled “6. Evaluation Metrics”- Accuracy/Precision/Recall/F1: For classification.
- MSE/MAE: For SVR.
- Support Vectors: Measure model complexity.
- Margin: Indicates robustness.
In 2025, SHAP for SVM explainability.
7. Applications in Machine Learning (2025)
Section titled “7. Applications in Machine Learning (2025)”- Classification: Text categorization, image recognition.
- Regression: Time-series forecasting.
- Anomaly Detection: One-class SVM.
- Bioinformatics: Protein classification.
- Hybrid Systems: SVM on LLM embeddings.
- Edge AI: Lightweight SVM on devices.
Challenges
Section titled “Challenges”- Scalability: O(n²) for kernel matrix.
- High-D: Curse of dimensionality; use kernels.
- Imbalanced Data: Weighted C.
8. Numerical Implementations
Section titled “8. Numerical Implementations”Implement SVM for classification.
::: code-group
import numpy as npfrom sklearn.svm import SVC, SVRfrom sklearn.metrics import accuracy_score, mean_squared_errorfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_split
# Classificationiris = load_iris()X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)
svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)svm_clf.fit(X_train, y_train)y_pred = svm_clf.predict(X_test)print("SVM Classification Accuracy:", accuracy_score(y_test, y_pred))print("Support Vectors:", svm_clf.n_support_)
# RegressionX_reg = np.random.rand(200, 1) * 10y_reg = np.sin(X_reg).ravel() + np.random.normal(0, 0.1, 200)X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=0)
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale')svr.fit(X_train_reg, y_train_reg)y_pred_reg = svr.predict(X_test_reg)print("SVR MSE:", mean_squared_error(y_test_reg, y_pred_reg))
# ML: Kernel SVM on nonlinear datafrom sklearn.datasets import make_moonsX_moon, y_moon = make_moons(n_samples=200, noise=0.1, random_state=0)svm_nonlin = SVC(kernel='rbf', C=1.0, gamma=0.5)svm_nonlin.fit(X_moon, y_moon)print("Nonlinear SVM Accuracy:", accuracy_score(y_moon, svm_nonlin.predict(X_moon)))use linfa::prelude::*;use linfa_svm::Svm;use ndarray::{Array2, Array1};
fn main() { // Classification (placeholder Iris) let x_train: Array2<f64> = Array2::zeros((120, 4)); let y_train: Array1<i32> = Array1::zeros(120); let x_test: Array2<f64> = Array2::zeros((30, 4)); let y_test: Array1<i32> = Array1::zeros(30);
let dataset = Dataset::new(x_train, y_train); let model = Svm::params().fit(&dataset).unwrap(); let preds = model.predict(&x_test); let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64; println!("SVM Classification Accuracy: {}", accuracy);
// Regression (SVR) let x_reg: Array2<f64> = Array2::zeros((200, 1)); let y_reg: Array1<f64> = Array1::zeros(200); let (x_train_reg, x_test_reg, y_train_reg, y_test_reg) = ( x_reg.slice(s![0..160, ..]).to_owned(), x_reg.slice(s![160..200, ..]).to_owned(), y_reg.slice(s![0..160]).to_owned(), y_reg.slice(s![160..200]).to_owned(), ); let dataset_reg = Dataset::new(x_train_reg, y_train_reg); let svr = Svm::params().fit(&dataset_reg).unwrap(); let preds_reg = svr.predict(&x_test_reg); let mse = preds_reg.iter().zip(y_test_reg.iter()).map(|(&p, &t)| (p - t).powi(2)).sum::<f64>() / y_test_reg.len() as f64; println!("SVR MSE: {}", mse);}Dependencies (Cargo.toml):
[dependencies]linfa = "0.7.1"linfa-svm = "0.7.1"ndarray = "0.15.6":::
Implements SVM for classification and regression.
8. Numerical Stability and High-Dimensions
Section titled “8. Numerical Stability and High-Dimensions”- Kernel Matrix: Ill-conditioned in high-d; regularization helps.
- High-D: Curse of dimensionality; use linear kernel or reduction.
- Stability: QP solvers (SMO in SVM) ensure convergence.
In 2025, stability in distributed SVM for federated ML.
9. Case Study: Iris Dataset (Classification)
Section titled “9. Case Study: Iris Dataset (Classification)”::: code-group
from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.svm import SVCfrom sklearn.metrics import classification_report
# Load datasetiris = load_iris()X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
# Train SVMsvm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)svm.fit(X_train, y_train)
# Evaluatey_pred = svm.predict(X_test)print(classification_report(y_test, y_pred))use linfa::prelude::*;use linfa_svm::Svm;use ndarray::{Array2, Array1};
fn main() { // Placeholder: Iris dataset let x_train: Array2<f64> = Array2::zeros((120, 4)); let y_train: Array1<i32> = Array1::zeros(120); let x_test: Array2<f64> = Array2::zeros((30, 4)); let y_test: Array1<i32> = Array1::zeros(30);
let dataset = Dataset::new(x_train, y_train); let model = Svm::params().fit(&dataset).unwrap(); let preds = model.predict(&x_test); let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64; println!("SVM Accuracy: {}", accuracy);}:::
Note: Rust requires external data loading; use Python for full example.
10. Under the Hood Insights
Section titled “10. Under the Hood Insights”- Margin Maximization: Robust to noise.
- Kernel Trick: Nonlinear boundaries.
- Support Vectors: Sparsity for efficiency.
- Dual Form: Enables kernels, QP solving.
11. Limitations
Section titled “11. Limitations”- Scalability: O(n²) kernel matrix.
- High-D: Curse of dimensionality; kernels help.
- Imbalanced Data: Weighted C.
- Non-Probabilistic: SVC lacks calibrated probabilities (use Platt scaling).
12. Summary
Section titled “12. Summary”SVMs are robust, kernel-based classifiers foundational to ML. In 2025, their role in explainable AI, bioinformatics, and LLM hybrids keeps them vital. The kernel trick and margin maximization address nonlinearity and overfitting.
Further Reading
Section titled “Further Reading”- Cortes, Vapnik, “Support-Vector Networks” (1995).
- Hastie, Elements of Statistical Learning (Ch. 12).
linfa-svmdocs: github.com/rust-ml/linfa.- Schölkopf, Learning with Kernels.