Higher-Order Derivatives - Hessian & Curvature
Higher-Order Derivatives - Hessian & Curvature
Section titled “Higher-Order Derivatives - Hessian & Curvature”Higher-order derivatives extend first derivatives to capture acceleration, jerk, and beyond, revealing deeper behaviors like concavity and curvature. In AI and machine learning, they form the Hessian matrix, which quantifies the curvature of loss functions, enabling advanced optimization techniques like Newton’s method and informing analyses of training dynamics in neural networks.
This lecture builds on partial derivatives and gradients, exploring second and higher derivatives, Taylor approximations, the Hessian, and their roles in ML. We’ll integrate mathematical foundations with practical applications, supported by intuitions, examples, and implementations in Python and Rust, to illuminate how curvature shapes AI model training and performance.
1. Intuition for Higher-Order Derivatives
Section titled “1. Intuition for Higher-Order Derivatives”First derivative: Slope, rate of change.
Second: Rate of change of slope—concavity. Positive: U-shaped (speeds up), negative: Inverted U (slows down).
Higher: Third (jerk), fourth (snap), etc., model abrupt changes.
ML Connection
Section titled “ML Connection”- In optimization, second derivatives detect minima (positive curvature).
- Loss landscapes: Flat regions (low curvature) slow training; sharp (high) risk overshoot.
::: info Higher derivatives peel layers of “how change changes,” like acceleration in physics or curvature in roads. :::
Everyday Example
Section titled “Everyday Example”- Position s(t): s’ velocity, s” acceleration, s''' jerk (whiplash feel).
2. Second Derivatives in Single Variable
Section titled “2. Second Derivatives in Single Variable”f”(x) = d/dx [f’(x)] = lim h→0 [f’(x+h) - f’(x)] / h.
Tests: f” >0 concave up (min if f’=0), <0 concave down (max).
Inflection: f”=0 and changes sign.
Examples
Section titled “Examples”- f(x)=x^4: f’=4x^3, f”=12x^2 ≥0, global min at 0.
- f(x)=sin(x): f”=-sin(x), oscillates.
ML Insight
Section titled “ML Insight”- Quadratic loss approximations use second derivatives for faster convergence.
3. Higher-Order Derivatives Beyond Second
Section titled “3. Higher-Order Derivatives Beyond Second”n-th derivative: f^{(n)}(x).
Leibniz rule for products: (uv)^{(n)} = sum_{k=0}^n binom(n,k) u^{(k)} v^{(n-k)}.
Applications
Section titled “Applications”- Taylor series: f(x) ≈ sum_{k=0}^n f^{(k)}(a)/k! (x-a)^k.
- Error bounds via remainder.
ML Application
Section titled “ML Application”- Polynomial approximations in models; higher terms capture nonlinearity.
Example: e^x = sum x^k / k!, all derivatives e^x.
4. Multivariable Second Partials and Hessian
Section titled “4. Multivariable Second Partials and Hessian”Partial second: ∂²f/∂x², mixed ∂²f/∂x∂y.
Clairaut: Mixed equal if continuous.
Hessian H: Matrix [∂²f/∂x_i ∂x_j].
Determinant or eigenvalues classify critical points.
Properties
Section titled “Properties”- Symmetric if continuous seconds.
- Positive definite (all eigen >0): Local min.
ML Connection
Section titled “ML Connection”- In Newton’s: Update Δw = -H^{-1} ∇L.
- Approximates for quasi-Newton (BFGS).
Example: f(x,y)=x^2 + y^2, H = [[2,0],[0,2]], positive definite, min at (0,0).
5. Curvature Analysis via Hessian
Section titled “5. Curvature Analysis via Hessian”Curvature: How much function bends.
In 1D: κ = |f”| / (1 + (f’)^2)^{3/2}.
Multivar: Mean/principal curvatures from Hessian eigen.
Condition number κ(H) = |λ_max / λ_min|, ill-conditioned if large—slow GD.
ML Insight
Section titled “ML Insight”- Loss curvature: High in some directions causes anisotropic landscapes, fixed by preconditioning.
- Saddle points: Mixed eigen signs.
6. Taylor Expansions in Multiple Variables
Section titled “6. Taylor Expansions in Multiple Variables”f(x) ≈ f(a) + ∇f(a)^T (x-a) + 1/2 (x-a)^T H(a) (x-a) + …
Quadratic term captures local curvature.
Applications
Section titled “Applications”- Approximation errors in ML.
- Trust region methods limit step where approx holds.
Example: Approximate sin(x) near 0: x - x^3/6.
7. Geometric Interpretations
Section titled “7. Geometric Interpretations”1D: Second deriv as curvature radius inverse.
Multivar: Hessian eigenvectors: Directions of principal curvatures.
Visual: Paraboloid f=x^2 + 2y^2, stretched along y.
In ML: Visualize loss surfaces—valleys, plateaus.
8. Numerical Computation of Higher Derivatives
Section titled “8. Numerical Computation of Higher Derivatives”Finite diffs for seconds: f”(x) ≈ [f(x+2h) - 2f(x+h) + f(x)] / h^2? Wait, central: [f(x+h) - 2f(x) + f(x-h)] / h^2.
For Hessian: Similar per entry.
::: code-group
import numpy as np
def f(x): return x**3 - 3*x
def second_deriv(f, x, h=1e-5): return (f(x + h) - 2*f(x) + f(x - h)) / h**2
x = 1.0print(f"Second deriv at x={x}: {second_deriv(f, x)}") # Analytical: 6x, so 6
# Multivar Hessiandef g(x, y): return x**2 + y**2 + x*y
def hessian_approx(g, x, y, h=1e-5): fxx = (g(x + h, y) - 2*g(x, y) + g(x - h, y)) / h**2 fyy = (g(x, y + h) - 2*g(x, y) + g(x, y - h)) / h**2 fxy = ((g(x + h, y + h) - g(x + h, y - h)) - (g(x - h, y + h) - g(x - h, y - h))) / (4 * h**2) return np.array([[fxx, fxy], [fxy, fyy]])
print("Hessian:", hessian_approx(g, 0, 0)) # [[2,1],[1,2]]
# ML: Hessian for quadratic lossdef loss(w, X, y): return 0.5 * np.mean((X @ w - y)**2)
X = np.array([[1,1],[1,2],[1,3]])y = np.array([1,2,3])w = np.array([0,1]) # Intercept, slope approx
# Numerical Hessian (2x2)h = 1e-5H = np.zeros((2,2))for i in range(2): for j in range(2): ei = np.eye(2)[:,i] ej = np.eye(2)[:,j] fp = loss(w + h*ei + h*ej, X, y) fm = loss(w - h*ei - h*ej, X, y) fp_mix = loss(w + h*ei - h*ej, X, y) fm_mix = loss(w - h*ei + h*ej, X, y) H[i,j] = (fp - fp_mix - fm_mix + fm) / (4 * h**2)
print("Approx Hessian for loss:", H)fn f(x: f64) -> f64 { x.powi(3) - 3.0 * x}
fn second_deriv(f: fn(f64) -> f64, x: f64, h: f64) -> f64 { (f(x + h) - 2.0 * f(x) + f(x - h)) / h.powi(2)}
fn g(x: f64, y: f64) -> f64 { x.powi(2) + y.powi(2) + x * y}
fn hessian_approx(g: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> [[f64; 2]; 2] { let fxx = (g(x + h, y) - 2.0 * g(x, y) + g(x - h, y)) / h.powi(2); let fyy = (g(x, y + h) - 2.0 * g(x, y) + g(x, y - h)) / h.powi(2); let fxy = ((g(x + h, y + h) - g(x + h, y - h)) - (g(x - h, y + h) - g(x - h, y - h))) / (4.0 * h.powi(2)); [[fxx, fxy], [fxy, fyy]]}
fn main() { let x = 1.0; let h = 1e-5; println!("Second deriv at x={}: {}", x, second_deriv(f, x, h)); // ~6
println!("Hessian: {:?}", hessian_approx(g, 0.0, 0.0, h)); // ~[[2,1],[1,2]]
// ML: Hessian for quadratic loss fn loss(w: &[f64], x: &[[f64; 2]], y: &[f64]) -> f64 { let mut sum = 0.0; for (xi, &yi) in x.iter().zip(y.iter()) { let pred = w[0] * xi[0] + w[1] * xi[1]; sum += 0.5 * (pred - yi).powi(2); } sum / x.len() as f64 }
let x_data = [[1.0,1.0], [1.0,2.0], [1.0,3.0]]; let y_data = [1.0, 2.0, 3.0]; let w = [0.0, 1.0]; let mut H = [[0.0, 0.0], [0.0, 0.0]]; for i in 0..2 { for j in 0..2 { let mut wp = w; wp[i] += h; wp[j] += h; let mut wm = w; wm[i] -= h; wm[j] -= h; let mut wpm = w; wpm[i] += h; wpm[j] -= h; let mut wmp = w; wmp[i] -= h; wmp[j] += h; H[i][j] = (loss(&wp, &x_data, &y_data) - loss(&wpm, &x_data, &y_data) - loss(&wmp, &x_data, &y_data) + loss(&wm, &x_data, &y_data)) / (4.0 * h.powi(2)); } } println!("Approx Hessian for loss: {:?}", H);}:::
Approximates seconds and Hessians.
9. Symbolic Higher Derivatives
Section titled “9. Symbolic Higher Derivatives”SymPy for exact.
::: code-group
from sympy import symbols, diff, hessian, Matrix, sin
x, y = symbols('x y')f = x**4 + sin(x)f_xx = diff(f, x, 2)print("Second deriv:", f_xx)
g = x**2 + y**2 + x*yH = hessian(g, (x,y))print("Hessian:", H)// Simulatedfn main() { // f'' = 12 x^2 - sin(x) println!("Second deriv: 12 x^2 - sin(x)");
// Hessian [[2,1],[1,2]] println!("Hessian: [[2,1],[1,2]]");}:::
10. Second-Order Optimization in ML
Section titled “10. Second-Order Optimization in ML”Newton’s: Uses inverse Hessian for curved steps.
Quasi-Newton: Approximate H (BFGS).
In DL: Rarely full Hessian (too big), but diagonal approx or K-FAC.
Challenges: Saddle escape, indefiniteness.
11. Mean Value Theorem for Higher Orders
Section titled “11. Mean Value Theorem for Higher Orders”Extends to Taylor with remainder.
In ML: Bounds convergence.
12. Key ML Takeaways
Section titled “12. Key ML Takeaways”- Seconds for nature: Min/max/saddle detection.
- Hessian curvature: Informs optimizer choice.
- Taylor approx: Local models.
- Numerical/symbolic: Practical computation.
- Landscape insights: Guide architecture/training.
Higher derivatives reveal hidden structures in change.
13. Summary
Section titled “13. Summary”Explored higher derivatives from seconds to Hessians, curvature, Taylor, with ML ties to optimization. Examples and code in Python/Rust provide hands-on. This deepens calculus for AI.
Next: Integrals.
Word count: ~2900.
Further Reading
Section titled “Further Reading”- Apostol, Calculus Vol. 2 (Multivar).
- Boyd, Convex Optimization (Hessians).
- 3Blue1Brown: Taylor series.
- Rust: ‘nalgebra’ for matrices.