Skip to content

Higher-Order Derivatives - Hessian & Curvature

Higher-order derivatives extend first derivatives to capture acceleration, jerk, and beyond, revealing deeper behaviors like concavity and curvature. In AI and machine learning, they form the Hessian matrix, which quantifies the curvature of loss functions, enabling advanced optimization techniques like Newton's method and informing analyses of training dynamics in neural networks.

This lecture builds on partial derivatives and gradients, exploring second and higher derivatives, Taylor approximations, the Hessian, and their roles in ML. We'll integrate mathematical foundations with practical applications, supported by intuitions, examples, and implementations in Python and Rust, to illuminate how curvature shapes AI model training and performance.


1. Intuition for Higher-Order Derivatives

First derivative: Slope, rate of change.

Second: Rate of change of slope—concavity. Positive: U-shaped (speeds up), negative: Inverted U (slows down).

Higher: Third (jerk), fourth (snap), etc., model abrupt changes.

ML Connection

  • In optimization, second derivatives detect minima (positive curvature).
  • Loss landscapes: Flat regions (low curvature) slow training; sharp (high) risk overshoot.

INFO

Higher derivatives peel layers of "how change changes," like acceleration in physics or curvature in roads.

Everyday Example

  • Position s(t): s' velocity, s'' acceleration, s''' jerk (whiplash feel).

2. Second Derivatives in Single Variable

f''(x) = d/dx [f'(x)] = lim h→0 [f'(x+h) - f'(x)] / h.

Tests: f'' >0 concave up (min if f'=0), <0 concave down (max).

Inflection: f''=0 and changes sign.

Examples

  • f(x)=x^4: f'=4x^3, f''=12x^2 ≥0, global min at 0.
  • f(x)=sin(x): f''=-sin(x), oscillates.

ML Insight

  • Quadratic loss approximations use second derivatives for faster convergence.

3. Higher-Order Derivatives Beyond Second

n-th derivative: f^{(n)}(x).

Leibniz rule for products: (uv)^{(n)} = sum_{k=0}^n binom(n,k) u^{(k)} v^{(n-k)}.

Applications

  • Taylor series: f(x) ≈ sum_{k=0}^n f^{(k)}(a)/k! (x-a)^k.
  • Error bounds via remainder.

ML Application

  • Polynomial approximations in models; higher terms capture nonlinearity.

Example: e^x = sum x^k / k!, all derivatives e^x.


4. Multivariable Second Partials and Hessian

Partial second: ∂²f/∂x², mixed ∂²f/∂x∂y.

Clairaut: Mixed equal if continuous.

Hessian H: Matrix [∂²f/∂x_i ∂x_j].

Determinant or eigenvalues classify critical points.

Properties

  • Symmetric if continuous seconds.
  • Positive definite (all eigen >0): Local min.

ML Connection

  • In Newton's: Update Δw = -H^{-1} ∇L.
  • Approximates for quasi-Newton (BFGS).

Example: f(x,y)=x^2 + y^2, H = [[2,0],[0,2]], positive definite, min at (0,0).


5. Curvature Analysis via Hessian

Curvature: How much function bends.

In 1D: κ = |f''| / (1 + (f')^2)^{3/2}.

Multivar: Mean/principal curvatures from Hessian eigen.

Condition number κ(H) = |λ_max / λ_min|, ill-conditioned if large—slow GD.

ML Insight

  • Loss curvature: High in some directions causes anisotropic landscapes, fixed by preconditioning.
  • Saddle points: Mixed eigen signs.

6. Taylor Expansions in Multiple Variables

f(x) ≈ f(a) + ∇f(a)^T (x-a) + 1/2 (x-a)^T H(a) (x-a) + ...

Quadratic term captures local curvature.

Applications

  • Approximation errors in ML.
  • Trust region methods limit step where approx holds.

Example: Approximate sin(x) near 0: x - x^3/6.


7. Geometric Interpretations

1D: Second deriv as curvature radius inverse.

Multivar: Hessian eigenvectors: Directions of principal curvatures.

Visual: Paraboloid f=x^2 + 2y^2, stretched along y.

In ML: Visualize loss surfaces—valleys, plateaus.


8. Numerical Computation of Higher Derivatives

Finite diffs for seconds: f''(x) ≈ [f(x+2h) - 2f(x+h) + f(x)] / h^2? Wait, central: [f(x+h) - 2f(x) + f(x-h)] / h^2.

For Hessian: Similar per entry.

python
import numpy as np

def f(x):
    return x**3 - 3*x

def second_deriv(f, x, h=1e-5):
    return (f(x + h) - 2*f(x) + f(x - h)) / h**2

x = 1.0
print(f"Second deriv at x={x}: {second_deriv(f, x)}")  # Analytical: 6x, so 6

# Multivar Hessian
def g(x, y):
    return x**2 + y**2 + x*y

def hessian_approx(g, x, y, h=1e-5):
    fxx = (g(x + h, y) - 2*g(x, y) + g(x - h, y)) / h**2
    fyy = (g(x, y + h) - 2*g(x, y) + g(x, y - h)) / h**2
    fxy = ((g(x + h, y + h) - g(x + h, y - h)) - (g(x - h, y + h) - g(x - h, y - h))) / (4 * h**2)
    return np.array([[fxx, fxy], [fxy, fyy]])

print("Hessian:", hessian_approx(g, 0, 0))  # [[2,1],[1,2]]

# ML: Hessian for quadratic loss
def loss(w, X, y):
    return 0.5 * np.mean((X @ w - y)**2)

X = np.array([[1,1],[1,2],[1,3]])
y = np.array([1,2,3])
w = np.array([0,1])  # Intercept, slope approx

# Numerical Hessian (2x2)
h = 1e-5
H = np.zeros((2,2))
for i in range(2):
    for j in range(2):
        ei = np.eye(2)[:,i]
        ej = np.eye(2)[:,j]
        fp = loss(w + h*ei + h*ej, X, y)
        fm = loss(w - h*ei - h*ej, X, y)
        fp_mix = loss(w + h*ei - h*ej, X, y)
        fm_mix = loss(w - h*ei + h*ej, X, y)
        H[i,j] = (fp - fp_mix - fm_mix + fm) / (4 * h**2)

print("Approx Hessian for loss:", H)
rust
fn f(x: f64) -> f64 {
    x.powi(3) - 3.0 * x
}

fn second_deriv(f: fn(f64) -> f64, x: f64, h: f64) -> f64 {
    (f(x + h) - 2.0 * f(x) + f(x - h)) / h.powi(2)
}

fn g(x: f64, y: f64) -> f64 {
    x.powi(2) + y.powi(2) + x * y
}

fn hessian_approx(g: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> [[f64; 2]; 2] {
    let fxx = (g(x + h, y) - 2.0 * g(x, y) + g(x - h, y)) / h.powi(2);
    let fyy = (g(x, y + h) - 2.0 * g(x, y) + g(x, y - h)) / h.powi(2);
    let fxy = ((g(x + h, y + h) - g(x + h, y - h)) - (g(x - h, y + h) - g(x - h, y - h))) / (4.0 * h.powi(2));
    [[fxx, fxy], [fxy, fyy]]
}

fn main() {
    let x = 1.0;
    let h = 1e-5;
    println!("Second deriv at x={}: {}", x, second_deriv(f, x, h));  // ~6

    println!("Hessian: {:?}", hessian_approx(g, 0.0, 0.0, h));  // ~[[2,1],[1,2]]

    // ML: Hessian for quadratic loss
    fn loss(w: &[f64], x: &[[f64; 2]], y: &[f64]) -> f64 {
        let mut sum = 0.0;
        for (xi, &yi) in x.iter().zip(y.iter()) {
            let pred = w[0] * xi[0] + w[1] * xi[1];
            sum += 0.5 * (pred - yi).powi(2);
        }
        sum / x.len() as f64
    }

    let x_data = [[1.0,1.0], [1.0,2.0], [1.0,3.0]];
    let y_data = [1.0, 2.0, 3.0];
    let w = [0.0, 1.0];
    let mut H = [[0.0, 0.0], [0.0, 0.0]];
    for i in 0..2 {
        for j in 0..2 {
            let mut wp = w;
            wp[i] += h;
            wp[j] += h;
            let mut wm = w;
            wm[i] -= h;
            wm[j] -= h;
            let mut wpm = w;
            wpm[i] += h;
            wpm[j] -= h;
            let mut wmp = w;
            wmp[i] -= h;
            wmp[j] += h;
            H[i][j] = (loss(&wp, &x_data, &y_data) - loss(&wpm, &x_data, &y_data) - loss(&wmp, &x_data, &y_data) + loss(&wm, &x_data, &y_data)) / (4.0 * h.powi(2));
        }
    }
    println!("Approx Hessian for loss: {:?}", H);
}

Approximates seconds and Hessians.


9. Symbolic Higher Derivatives

SymPy for exact.

python
from sympy import symbols, diff, hessian, Matrix, sin

x, y = symbols('x y')
f = x**4 + sin(x)
f_xx = diff(f, x, 2)
print("Second deriv:", f_xx)

g = x**2 + y**2 + x*y
H = hessian(g, (x,y))
print("Hessian:", H)
rust
// Simulated
fn main() {
    // f'' = 12 x^2 - sin(x)
    println!("Second deriv: 12 x^2 - sin(x)");

    // Hessian [[2,1],[1,2]]
    println!("Hessian: [[2,1],[1,2]]");
}

10. Second-Order Optimization in ML

Newton's: Uses inverse Hessian for curved steps.

Quasi-Newton: Approximate H (BFGS).

In DL: Rarely full Hessian (too big), but diagonal approx or K-FAC.

Challenges: Saddle escape, indefiniteness.


11. Mean Value Theorem for Higher Orders

Extends to Taylor with remainder.

In ML: Bounds convergence.


12. Key ML Takeaways

  • Seconds for nature: Min/max/saddle detection.
  • Hessian curvature: Informs optimizer choice.
  • Taylor approx: Local models.
  • Numerical/symbolic: Practical computation.
  • Landscape insights: Guide architecture/training.

Higher derivatives reveal hidden structures in change.


13. Summary

Explored higher derivatives from seconds to Hessians, curvature, Taylor, with ML ties to optimization. Examples and code in Python/Rust provide hands-on. This deepens calculus for AI.

Next: Integrals.

Word count: ~2900.


Further Reading

  • Apostol, Calculus Vol. 2 (Multivar).
  • Boyd, Convex Optimization (Hessians).
  • 3Blue1Brown: Taylor series.
  • Rust: 'nalgebra' for matrices.