Partial Derivatives & Gradients

Multivariable calculus extends single-variable concepts to higher dimensions, where functions depend on multiple inputs. In AI and machine learning, this is essential since models often process vectors or tensors. Partial derivatives measure change with respect to one variable while holding others constant, and gradients aggregate these into a vector pointing toward steepest ascent. These tools power optimization in neural networks, enabling efficient parameter updates via algorithms like gradient descent.

This lecture builds on derivatives, exploring partials, gradients, Jacobians, Hessians, and their ML applications. We’ll connect theory to practice with intuitions, examples, and implementations in Python and Rust, fostering a deep understanding of how AI navigates complex loss landscapes.

1. Intuition Behind Multivariable Functions

In single-variable calculus, functions like ( f(x) = x^2 ) model one-dimensional change. Now, consider ( f(x, y) = x^2 + y^2 ), a paraboloid surface. Change can occur in x or y directions independently.

Partial derivatives slice this surface: Fixing y treats it as a constant, yielding a curve in x whose slope is the partial with respect to x.

ML Connection

Neural networks: Loss functions ( L(\mathbf{w}) ) depend on weight vectors (\mathbf{w} = (w_1, w_2, \dots, w_n)).
Partial derivatives: Sensitivity of loss to each weight, guiding adjustments.
High dimensions: Modern models have millions of parameters—partials make this tractable.

::: info Think of partials as “what if” scenarios: What if I tweak only this parameter while freezing others? :::

Example

For ( f(x, y) = 3x^2 y + e^{xy} ), partial w.r.t. x explores x’s isolated impact.

2. Formal Definition of Partial Derivatives

For ( f(x_1, x_2, \dots, x_n) ), the partial with respect to ( x_i ) is:

[ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_i, \dots, x_n)}{h} ]

Treat other variables as constants.

Notation: ( f_x, \partial_x f, D_x f ).

Differentiability in Multiple Variables

Continuous partials imply differentiability.
But existence alone doesn’t guarantee total differentiability (unlike single-variable).

ML Insight

In regression: ( L(w, b) = \sum (y - (w x + b))^2 / n ).
- ( \partial L / \partial w = -2 \sum x (y - \hat{y}) / n ).
- Used to solve normal equations or iteratively optimize.

3. Computing Partial Derivatives: Rules and Examples

Extend single-variable rules:

Power Rule: ( \partial (x^n) / \partial x = n x^{n-1} ), others zero.
Sum Rule: Partials distribute over sums.
Product Rule: ( \partial (u v) / \partial x = u_x v + u v_x ).
Chain Rule: For composite ( f(g(x,y)) ), ( \partial f / \partial x = f’(g) \cdot \partial g / \partial x ).

Examples

( f(x,y) = x^2 y + \sin(xy) ):
- ( \partial f / \partial x = 2x y + y \cos(xy) ).
- ( \partial f / \partial y = x^2 + x \cos(xy) ).
Implicit: ( x^2 + y^2 + z^2 = 1 ), ( \partial z / \partial x = -x / z ).

ML Application

Sigmoid in multivariable: ( \sigma(\mathbf{w} \cdot \mathbf{x}) ), partial w.r.t. ( w_i ) involves chain rule.

4. The Gradient Vector

The gradient ( \nabla f = (\partial f / \partial x_1, \dots, \partial f / \partial x_n) ) points to steepest ascent, magnitude is the rate.

At point (\mathbf{p}), the directional derivative in direction (\mathbf{u}) (unit vector) is ( \nabla f(\mathbf{p}) \cdot \mathbf{u} ).

Properties

Orthogonal to level surfaces (constant f).
Zero at critical points (local min/max/saddle).

ML Connection

Gradient descent: ( \mathbf{w}{new} = \mathbf{w}{old} - \eta \nabla L(\mathbf{w}_{old}) ).
In backprop, gradients flow backward to update layers.

::: info Gradients are like compasses in parameter space: Follow negative for minima. :::

Example

For ( f(x,y) = x^2 + y^2 ), ( \nabla f = (2x, 2y) ), points outward from origin.

5. Higher-Order Partials: Jacobian and Hessian

Mixed Partials: ( \partial^2 f / \partial x \partial y ). Clairaut’s theorem: Equal if continuous.
Jacobian Matrix: For vector-valued f, matrix of partials. In ML, for transformations.
Hessian Matrix: Matrix of second partials, ( H_{ij} = \partial^2 f / \partial x_i \partial x_j ).
- Positive definite: Local min.
- Used in Newton’s method: ( \mathbf{w}{new} = \mathbf{w}{old} - H^{-1} \nabla L ).

ML Insight

In deep learning, Hessian approximates curvature for second-order optimization (e.g., L-BFGS).
Eigenvalues reveal saddle points in loss landscapes.

Example: For quadratic ( f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} ), Hessian is 2A.

6. Geometric Interpretations and Visualizations

Geometrically, partials are slopes of tangent lines in each coordinate plane.

Gradient: Tangent vector to the path of steepest ascent on the surface.

Visualization (Conceptual)

Contour plots: Gradients perpendicular to contours.
Vector fields: Plot gradients over domain—shows flow toward maxima.
In 3D: For ( z = f(x,y) ), tangent plane equation uses partials.

In ML, loss surfaces are visualized as hilly terrains; gradients navigate valleys.

7. Directional Derivatives and Applications

Directional derivative: Rate of change along vector (\mathbf{v}): ( D_{\mathbf{v}} f = \nabla f \cdot \hat{\mathbf{v}} ).

Max at (\mathbf{v}) parallel to gradient.

Applications

Constrained optimization: Lagrange multipliers set ( \nabla f = \lambda \nabla g ).
In RL: Policy gradients for action spaces.

ML Example

Feature importance: Partial derivatives w.r.t. inputs approximate sensitivity.

8. Numerical Computation of Partials and Gradients

Use finite differences for approximation:

Central: ( \partial f / \partial x_i \approx [f(\mathbf{x} + h \mathbf{e}_i) - f(\mathbf{x} - h \mathbf{e}_i)] / (2h) ).

In code, leverage libraries for auto-diff (PyTorch, but here manual).

::: code-group

import numpy as np

def f(x, y):
    return x**2 * y + np.sin(x * y)

def partial_x(f, x, y, h=1e-5):
    return (f(x + h, y) - f(x - h, y)) / (2 * h)

def partial_y(f, x, y, h=1e-5):
    return (f(x, y + h) - f(x, y - h)) / (2 * h)

def gradient(f, x, y, h=1e-5):
    return np.array([partial_x(f, x, y, h), partial_y(f, x, y, h)])

x, y = 1.0, 2.0
grad = gradient(f, x, y)
print(f"Gradient at ({x}, {y}): {grad}")

# ML example: Gradient for multivariate loss
def loss(w, X, y):
    # w: vector, X: matrix, y: vector
    return np.mean((y - X @ w)**2)

X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([3, 7, 11])  # Approx y = x1 + x2
w = np.array([0.5, 0.5])
h = 1e-5
grad_w = np.array([ (loss(w + h*np.eye(2)[i], X, y) - loss(w - h*np.eye(2)[i], X, y)) / (2*h) for i in range(2) ])
print(f"Gradient w.r.t. w: {grad_w}")

fn f(x: f64, y: f64) -> f64 {
    x.powi(2) * y + (x * y).sin()
}

fn partial_x(f: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> f64 {
    (f(x + h, y) - f(x - h, y)) / (2.0 * h)
}

fn partial_y(f: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> f64 {
    (f(x, y + h) - f(x, y - h)) / (2.0 * h)
}

fn gradient(f: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> [f64; 2] {
    [partial_x(f, x, y, h), partial_y(f, x, y, h)]
}

fn main() {
    let x = 1.0;
    let y = 2.0;
    let h = 1e-5;
    let grad = gradient(f, x, y, h);
    println!("Gradient at ({}, {}): {:?}", x, y, grad);

    // ML example: Gradient for multivariate loss
    fn loss(w: &[f64], x: &[[f64; 2]], y: &[f64]) -> f64 {
        let mut sum = 0.0;
        for (xi, &yi) in x.iter().zip(y.iter()) {
            let pred = w[0] * xi[0] + w[1] * xi[1];
            sum += (yi - pred).powi(2);
        }
        sum / x.len() as f64
    }

    let x_data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
    let y_data = [3.0, 7.0, 11.0];
    let mut w = [0.5, 0.5];
    let mut grad_w = [0.0, 0.0];
    for i in 0..2 {
        let mut w_plus = w;
        w_plus[i] += h;
        let mut w_minus = w;
        w_minus[i] -= h;
        grad_w[i] = (loss(&w_plus, &x_data, &y_data) - loss(&w_minus, &x_data, &y_data)) / (2.0 * h);
    }
    println!("Gradient w.r.t. w: {:?}", grad_w);
}

:::

This approximates partials and gradients for functions and ML losses.

9. Symbolic Computation in Multivariables

Use SymPy for exact partials.

::: code-group

from sympy import symbols, diff, sin

x, y = symbols('x y')
f = x**2 * y + sin(x * y)
partial_x = diff(f, x)
partial_y = diff(f, y)
print("Partial x:", partial_x)
print("Partial y:", partial_y)

# Evaluate
vals = {x: 1, y: 2}
print("Partial x at (1,2):", partial_x.subs(vals).evalf())

// Rust symbolic via hardcoded or crates (simulated)
fn main() {
    // Symbolic: partial_x = 2x y + y cos(x y)
    // partial_y = x^2 + x cos(x y)
    println!("Partial x: 2 x y + y cos(x y)");
    println!("Partial y: x^2 + x cos(x y)");

    let x_val = 1.0;
    let y_val = 2.0;
    let partial_x_val = 2.0 * x_val * y_val + y_val * (x_val * y_val).cos();
    println!("Partial x at (1,2): {}", partial_x_val);
}

:::

10. Optimization with Gradients in ML

Gradient Descent Variants: Batch, mini-batch, stochastic.
Backpropagation: Chain rule for gradients in nets.
Advanced: Momentum adds velocity; Adam adapts per-parameter.

Challenges: Vanishing gradients (sigmoid), exploding (deep nets)—fixes like ReLU, batch norm.

Example: Minimizing ( L(w_1, w_2) = (w_1 - 1)^2 + (w_2 - 2)^2 ), gradient leads to (1,2).

11. Implicit Functions and Lagrange Multipliers

Implicit theorem: Partials ensure solvability for one variable.

Lagrange: For min f subject to g=0, solve ( \nabla f = \lambda \nabla g ).

In ML: Regularization as constraints (e.g., L1/L2).

12. Mean Value Theorem in Multiple Variables

Extends: Line segment between points has point where directional derivative matches average change.

In ML: Bounds approximation errors in Taylor expansions for trust regions.

13. Key ML Takeaways

Partials isolate effects: Crucial for high-dim models.
Gradients guide descent: Core of training.
Hessians for curvature: Enable efficient optimizers.
Numerical/symbolic tools: Bridge theory to code.
Multivariable mastery: Unlocks deep learning dynamics.

Partials and gradients empower AI to optimize in vast spaces.

14. Summary

We explored partial derivatives from definitions to computations, gradients as vectors of change, and higher-order structures like Hessians. Tied to ML via optimization and backprop, with Python/Rust code for practical insight. This equips you for multivariable challenges in AI.

Next: Optimization Techniques.

Word count: Approximately 2950.

Partial Derivatives & Gradients

Partial Derivatives & Gradients

1. Intuition Behind Multivariable Functions

ML Connection

Example

2. Formal Definition of Partial Derivatives

Differentiability in Multiple Variables

ML Insight

3. Computing Partial Derivatives: Rules and Examples

Examples

ML Application

4. The Gradient Vector

Properties

ML Connection

Example

5. Higher-Order Partials: Jacobian and Hessian

ML Insight

6. Geometric Interpretations and Visualizations

Visualization (Conceptual)

7. Directional Derivatives and Applications

Applications

ML Example

8. Numerical Computation of Partials and Gradients

9. Symbolic Computation in Multivariables

10. Optimization with Gradients in ML

11. Implicit Functions and Lagrange Multipliers

12. Mean Value Theorem in Multiple Variables

13. Key ML Takeaways

14. Summary

Further Reading