Higher-Order Derivatives - Hessian & Curvature

Higher-order derivatives extend first derivatives to capture acceleration, jerk, and beyond, revealing deeper behaviors like concavity and curvature. In AI and machine learning, they form the Hessian matrix, which quantifies the curvature of loss functions, enabling advanced optimization techniques like Newton’s method and informing analyses of training dynamics in neural networks.

This lecture builds on partial derivatives and gradients, exploring second and higher derivatives, Taylor approximations, the Hessian, and their roles in ML. We’ll integrate mathematical foundations with practical applications, supported by intuitions, examples, and implementations in Python and Rust, to illuminate how curvature shapes AI model training and performance.

1. Intuition for Higher-Order Derivatives

First derivative: Slope, rate of change.

Second: Rate of change of slope—concavity. Positive: U-shaped (speeds up), negative: Inverted U (slows down).

Higher: Third (jerk), fourth (snap), etc., model abrupt changes.

ML Connection

In optimization, second derivatives detect minima (positive curvature).
Loss landscapes: Flat regions (low curvature) slow training; sharp (high) risk overshoot.

::: info Higher derivatives peel layers of “how change changes,” like acceleration in physics or curvature in roads. :::

Everyday Example

Position s(t): s’ velocity, s” acceleration, s''' jerk (whiplash feel).

2. Second Derivatives in Single Variable

f”(x) = d/dx [f’(x)] = lim h→0 [f’(x+h) - f’(x)] / h.

Tests: f” >0 concave up (min if f’=0), <0 concave down (max).

Inflection: f”=0 and changes sign.

Examples

f(x)=x^4: f’=4x^3, f”=12x^2 ≥0, global min at 0.
f(x)=sin(x): f”=-sin(x), oscillates.

ML Insight

Quadratic loss approximations use second derivatives for faster convergence.

3. Higher-Order Derivatives Beyond Second

n-th derivative: f^{(n)}(x).

Leibniz rule for products: (uv)^{(n)} = sum_{k=0}^n binom(n,k) u^{(k)} v^{(n-k)}.

Applications

Taylor series: f(x) ≈ sum_{k=0}^n f^{(k)}(a)/k! (x-a)^k.
Error bounds via remainder.

ML Application

Polynomial approximations in models; higher terms capture nonlinearity.

Example: e^x = sum x^k / k!, all derivatives e^x.

4. Multivariable Second Partials and Hessian

Partial second: ∂²f/∂x², mixed ∂²f/∂x∂y.

Clairaut: Mixed equal if continuous.

Hessian H: Matrix [∂²f/∂x_i ∂x_j].

Determinant or eigenvalues classify critical points.

Properties

Symmetric if continuous seconds.
Positive definite (all eigen >0): Local min.

ML Connection

In Newton’s: Update Δw = -H^{-1} ∇L.
Approximates for quasi-Newton (BFGS).

Example: f(x,y)=x^2 + y^2, H = [[2,0],[0,2]], positive definite, min at (0,0).

5. Curvature Analysis via Hessian

Curvature: How much function bends.

In 1D: κ = |f”| / (1 + (f’)^2)^{3/2}.

Multivar: Mean/principal curvatures from Hessian eigen.

Condition number κ(H) = |λ_max / λ_min|, ill-conditioned if large—slow GD.

ML Insight

Loss curvature: High in some directions causes anisotropic landscapes, fixed by preconditioning.
Saddle points: Mixed eigen signs.

6. Taylor Expansions in Multiple Variables

f(x) ≈ f(a) + ∇f(a)^T (x-a) + 1/2 (x-a)^T H(a) (x-a) + …

Quadratic term captures local curvature.

Applications

Approximation errors in ML.
Trust region methods limit step where approx holds.

Example: Approximate sin(x) near 0: x - x^3/6.

7. Geometric Interpretations

1D: Second deriv as curvature radius inverse.

Multivar: Hessian eigenvectors: Directions of principal curvatures.

Visual: Paraboloid f=x^2 + 2y^2, stretched along y.

In ML: Visualize loss surfaces—valleys, plateaus.

8. Numerical Computation of Higher Derivatives

Finite diffs for seconds: f”(x) ≈ [f(x+2h) - 2f(x+h) + f(x)] / h^2? Wait, central: [f(x+h) - 2f(x) + f(x-h)] / h^2.

For Hessian: Similar per entry.

::: code-group

import numpy as np

def f(x):
    return x**3 - 3*x

def second_deriv(f, x, h=1e-5):
    return (f(x + h) - 2*f(x) + f(x - h)) / h**2

x = 1.0
print(f"Second deriv at x={x}: {second_deriv(f, x)}")  # Analytical: 6x, so 6

# Multivar Hessian
def g(x, y):
    return x**2 + y**2 + x*y

def hessian_approx(g, x, y, h=1e-5):
    fxx = (g(x + h, y) - 2*g(x, y) + g(x - h, y)) / h**2
    fyy = (g(x, y + h) - 2*g(x, y) + g(x, y - h)) / h**2
    fxy = ((g(x + h, y + h) - g(x + h, y - h)) - (g(x - h, y + h) - g(x - h, y - h))) / (4 * h**2)
    return np.array([[fxx, fxy], [fxy, fyy]])

print("Hessian:", hessian_approx(g, 0, 0))  # [[2,1],[1,2]]

# ML: Hessian for quadratic loss
def loss(w, X, y):
    return 0.5 * np.mean((X @ w - y)**2)

X = np.array([[1,1],[1,2],[1,3]])
y = np.array([1,2,3])
w = np.array([0,1])  # Intercept, slope approx

# Numerical Hessian (2x2)
h = 1e-5
H = np.zeros((2,2))
for i in range(2):
    for j in range(2):
        ei = np.eye(2)[:,i]
        ej = np.eye(2)[:,j]
        fp = loss(w + h*ei + h*ej, X, y)
        fm = loss(w - h*ei - h*ej, X, y)
        fp_mix = loss(w + h*ei - h*ej, X, y)
        fm_mix = loss(w - h*ei + h*ej, X, y)
        H[i,j] = (fp - fp_mix - fm_mix + fm) / (4 * h**2)

print("Approx Hessian for loss:", H)

fn f(x: f64) -> f64 {
    x.powi(3) - 3.0 * x
}

fn second_deriv(f: fn(f64) -> f64, x: f64, h: f64) -> f64 {
    (f(x + h) - 2.0 * f(x) + f(x - h)) / h.powi(2)
}

fn g(x: f64, y: f64) -> f64 {
    x.powi(2) + y.powi(2) + x * y
}

fn hessian_approx(g: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> [[f64; 2]; 2] {
    let fxx = (g(x + h, y) - 2.0 * g(x, y) + g(x - h, y)) / h.powi(2);
    let fyy = (g(x, y + h) - 2.0 * g(x, y) + g(x, y - h)) / h.powi(2);
    let fxy = ((g(x + h, y + h) - g(x + h, y - h)) - (g(x - h, y + h) - g(x - h, y - h))) / (4.0 * h.powi(2));
    [[fxx, fxy], [fxy, fyy]]
}

fn main() {
    let x = 1.0;
    let h = 1e-5;
    println!("Second deriv at x={}: {}", x, second_deriv(f, x, h));  // ~6

    println!("Hessian: {:?}", hessian_approx(g, 0.0, 0.0, h));  // ~[[2,1],[1,2]]

    // ML: Hessian for quadratic loss
    fn loss(w: &[f64], x: &[[f64; 2]], y: &[f64]) -> f64 {
        let mut sum = 0.0;
        for (xi, &yi) in x.iter().zip(y.iter()) {
            let pred = w[0] * xi[0] + w[1] * xi[1];
            sum += 0.5 * (pred - yi).powi(2);
        }
        sum / x.len() as f64
    }

    let x_data = [[1.0,1.0], [1.0,2.0], [1.0,3.0]];
    let y_data = [1.0, 2.0, 3.0];
    let w = [0.0, 1.0];
    let mut H = [[0.0, 0.0], [0.0, 0.0]];
    for i in 0..2 {
        for j in 0..2 {
            let mut wp = w;
            wp[i] += h;
            wp[j] += h;
            let mut wm = w;
            wm[i] -= h;
            wm[j] -= h;
            let mut wpm = w;
            wpm[i] += h;
            wpm[j] -= h;
            let mut wmp = w;
            wmp[i] -= h;
            wmp[j] += h;
            H[i][j] = (loss(&wp, &x_data, &y_data) - loss(&wpm, &x_data, &y_data) - loss(&wmp, &x_data, &y_data) + loss(&wm, &x_data, &y_data)) / (4.0 * h.powi(2));
        }
    }
    println!("Approx Hessian for loss: {:?}", H);
}

:::

Approximates seconds and Hessians.

9. Symbolic Higher Derivatives

SymPy for exact.

::: code-group

from sympy import symbols, diff, hessian, Matrix, sin

x, y = symbols('x y')
f = x**4 + sin(x)
f_xx = diff(f, x, 2)
print("Second deriv:", f_xx)

g = x**2 + y**2 + x*y
H = hessian(g, (x,y))
print("Hessian:", H)

// Simulated
fn main() {
    // f'' = 12 x^2 - sin(x)
    println!("Second deriv: 12 x^2 - sin(x)");

    // Hessian [[2,1],[1,2]]
    println!("Hessian: [[2,1],[1,2]]");
}

:::

10. Second-Order Optimization in ML

Newton’s: Uses inverse Hessian for curved steps.

Quasi-Newton: Approximate H (BFGS).

In DL: Rarely full Hessian (too big), but diagonal approx or K-FAC.

Challenges: Saddle escape, indefiniteness.

11. Mean Value Theorem for Higher Orders

Extends to Taylor with remainder.

In ML: Bounds convergence.

12. Key ML Takeaways

Seconds for nature: Min/max/saddle detection.
Hessian curvature: Informs optimizer choice.
Taylor approx: Local models.
Numerical/symbolic: Practical computation.
Landscape insights: Guide architecture/training.

Higher derivatives reveal hidden structures in change.

13. Summary

Explored higher derivatives from seconds to Hessians, curvature, Taylor, with ML ties to optimization. Examples and code in Python/Rust provide hands-on. This deepens calculus for AI.

Next: Integrals.

Word count: ~2900.

Higher-Order Derivatives - Hessian & Curvature

Higher-Order Derivatives - Hessian & Curvature

1. Intuition for Higher-Order Derivatives

ML Connection

Everyday Example

2. Second Derivatives in Single Variable

Examples

ML Insight

3. Higher-Order Derivatives Beyond Second

Applications

ML Application

4. Multivariable Second Partials and Hessian

Properties

ML Connection

5. Curvature Analysis via Hessian

ML Insight

6. Taylor Expansions in Multiple Variables

Applications

7. Geometric Interpretations

8. Numerical Computation of Higher Derivatives

9. Symbolic Higher Derivatives

10. Second-Order Optimization in ML

11. Mean Value Theorem for Higher Orders

12. Key ML Takeaways

13. Summary

Further Reading