Partial Derivatives & Gradients
Partial Derivatives & Gradients
Section titled “Partial Derivatives & Gradients”Multivariable calculus extends single-variable concepts to higher dimensions, where functions depend on multiple inputs. In AI and machine learning, this is essential since models often process vectors or tensors. Partial derivatives measure change with respect to one variable while holding others constant, and gradients aggregate these into a vector pointing toward steepest ascent. These tools power optimization in neural networks, enabling efficient parameter updates via algorithms like gradient descent.
This lecture builds on derivatives, exploring partials, gradients, Jacobians, Hessians, and their ML applications. We’ll connect theory to practice with intuitions, examples, and implementations in Python and Rust, fostering a deep understanding of how AI navigates complex loss landscapes.
1. Intuition Behind Multivariable Functions
Section titled “1. Intuition Behind Multivariable Functions”In single-variable calculus, functions like ( f(x) = x^2 ) model one-dimensional change. Now, consider ( f(x, y) = x^2 + y^2 ), a paraboloid surface. Change can occur in x or y directions independently.
Partial derivatives slice this surface: Fixing y treats it as a constant, yielding a curve in x whose slope is the partial with respect to x.
ML Connection
Section titled “ML Connection”- Neural networks: Loss functions ( L(\mathbf{w}) ) depend on weight vectors (\mathbf{w} = (w_1, w_2, \dots, w_n)).
- Partial derivatives: Sensitivity of loss to each weight, guiding adjustments.
- High dimensions: Modern models have millions of parameters—partials make this tractable.
::: info Think of partials as “what if” scenarios: What if I tweak only this parameter while freezing others? :::
Example
Section titled “Example”- For ( f(x, y) = 3x^2 y + e^{xy} ), partial w.r.t. x explores x’s isolated impact.
2. Formal Definition of Partial Derivatives
Section titled “2. Formal Definition of Partial Derivatives”For ( f(x_1, x_2, \dots, x_n) ), the partial with respect to ( x_i ) is:
[ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_i, \dots, x_n)}{h} ]
Treat other variables as constants.
Notation: ( f_x, \partial_x f, D_x f ).
Differentiability in Multiple Variables
Section titled “Differentiability in Multiple Variables”- Continuous partials imply differentiability.
- But existence alone doesn’t guarantee total differentiability (unlike single-variable).
ML Insight
Section titled “ML Insight”- In regression: ( L(w, b) = \sum (y - (w x + b))^2 / n ).
- ( \partial L / \partial w = -2 \sum x (y - \hat{y}) / n ).
- Used to solve normal equations or iteratively optimize.
3. Computing Partial Derivatives: Rules and Examples
Section titled “3. Computing Partial Derivatives: Rules and Examples”Extend single-variable rules:
- Power Rule: ( \partial (x^n) / \partial x = n x^{n-1} ), others zero.
- Sum Rule: Partials distribute over sums.
- Product Rule: ( \partial (u v) / \partial x = u_x v + u v_x ).
- Chain Rule: For composite ( f(g(x,y)) ), ( \partial f / \partial x = f’(g) \cdot \partial g / \partial x ).
Examples
Section titled “Examples”- ( f(x,y) = x^2 y + \sin(xy) ):
- ( \partial f / \partial x = 2x y + y \cos(xy) ).
- ( \partial f / \partial y = x^2 + x \cos(xy) ).
- Implicit: ( x^2 + y^2 + z^2 = 1 ), ( \partial z / \partial x = -x / z ).
ML Application
Section titled “ML Application”- Sigmoid in multivariable: ( \sigma(\mathbf{w} \cdot \mathbf{x}) ), partial w.r.t. ( w_i ) involves chain rule.
4. The Gradient Vector
Section titled “4. The Gradient Vector”The gradient ( \nabla f = (\partial f / \partial x_1, \dots, \partial f / \partial x_n) ) points to steepest ascent, magnitude is the rate.
At point (\mathbf{p}), the directional derivative in direction (\mathbf{u}) (unit vector) is ( \nabla f(\mathbf{p}) \cdot \mathbf{u} ).
Properties
Section titled “Properties”- Orthogonal to level surfaces (constant f).
- Zero at critical points (local min/max/saddle).
ML Connection
Section titled “ML Connection”- Gradient descent: ( \mathbf{w}{new} = \mathbf{w}{old} - \eta \nabla L(\mathbf{w}_{old}) ).
- In backprop, gradients flow backward to update layers.
::: info Gradients are like compasses in parameter space: Follow negative for minima. :::
Example
Section titled “Example”- For ( f(x,y) = x^2 + y^2 ), ( \nabla f = (2x, 2y) ), points outward from origin.
5. Higher-Order Partials: Jacobian and Hessian
Section titled “5. Higher-Order Partials: Jacobian and Hessian”- Mixed Partials: ( \partial^2 f / \partial x \partial y ). Clairaut’s theorem: Equal if continuous.
- Jacobian Matrix: For vector-valued f, matrix of partials. In ML, for transformations.
- Hessian Matrix: Matrix of second partials, ( H_{ij} = \partial^2 f / \partial x_i \partial x_j ).
- Positive definite: Local min.
- Used in Newton’s method: ( \mathbf{w}{new} = \mathbf{w}{old} - H^{-1} \nabla L ).
ML Insight
Section titled “ML Insight”- In deep learning, Hessian approximates curvature for second-order optimization (e.g., L-BFGS).
- Eigenvalues reveal saddle points in loss landscapes.
Example: For quadratic ( f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} ), Hessian is 2A.
6. Geometric Interpretations and Visualizations
Section titled “6. Geometric Interpretations and Visualizations”Geometrically, partials are slopes of tangent lines in each coordinate plane.
Gradient: Tangent vector to the path of steepest ascent on the surface.
Visualization (Conceptual)
Section titled “Visualization (Conceptual)”- Contour plots: Gradients perpendicular to contours.
- Vector fields: Plot gradients over domain—shows flow toward maxima.
- In 3D: For ( z = f(x,y) ), tangent plane equation uses partials.
In ML, loss surfaces are visualized as hilly terrains; gradients navigate valleys.
7. Directional Derivatives and Applications
Section titled “7. Directional Derivatives and Applications”Directional derivative: Rate of change along vector (\mathbf{v}): ( D_{\mathbf{v}} f = \nabla f \cdot \hat{\mathbf{v}} ).
Max at (\mathbf{v}) parallel to gradient.
Applications
Section titled “Applications”- Constrained optimization: Lagrange multipliers set ( \nabla f = \lambda \nabla g ).
- In RL: Policy gradients for action spaces.
ML Example
Section titled “ML Example”- Feature importance: Partial derivatives w.r.t. inputs approximate sensitivity.
8. Numerical Computation of Partials and Gradients
Section titled “8. Numerical Computation of Partials and Gradients”Use finite differences for approximation:
- Central: ( \partial f / \partial x_i \approx [f(\mathbf{x} + h \mathbf{e}_i) - f(\mathbf{x} - h \mathbf{e}_i)] / (2h) ).
In code, leverage libraries for auto-diff (PyTorch, but here manual).
::: code-group
import numpy as np
def f(x, y): return x**2 * y + np.sin(x * y)
def partial_x(f, x, y, h=1e-5): return (f(x + h, y) - f(x - h, y)) / (2 * h)
def partial_y(f, x, y, h=1e-5): return (f(x, y + h) - f(x, y - h)) / (2 * h)
def gradient(f, x, y, h=1e-5): return np.array([partial_x(f, x, y, h), partial_y(f, x, y, h)])
x, y = 1.0, 2.0grad = gradient(f, x, y)print(f"Gradient at ({x}, {y}): {grad}")
# ML example: Gradient for multivariate lossdef loss(w, X, y): # w: vector, X: matrix, y: vector return np.mean((y - X @ w)**2)
X = np.array([[1, 2], [3, 4], [5, 6]])y = np.array([3, 7, 11]) # Approx y = x1 + x2w = np.array([0.5, 0.5])h = 1e-5grad_w = np.array([ (loss(w + h*np.eye(2)[i], X, y) - loss(w - h*np.eye(2)[i], X, y)) / (2*h) for i in range(2) ])print(f"Gradient w.r.t. w: {grad_w}")fn f(x: f64, y: f64) -> f64 { x.powi(2) * y + (x * y).sin()}
fn partial_x(f: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> f64 { (f(x + h, y) - f(x - h, y)) / (2.0 * h)}
fn partial_y(f: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> f64 { (f(x, y + h) - f(x, y - h)) / (2.0 * h)}
fn gradient(f: fn(f64, f64) -> f64, x: f64, y: f64, h: f64) -> [f64; 2] { [partial_x(f, x, y, h), partial_y(f, x, y, h)]}
fn main() { let x = 1.0; let y = 2.0; let h = 1e-5; let grad = gradient(f, x, y, h); println!("Gradient at ({}, {}): {:?}", x, y, grad);
// ML example: Gradient for multivariate loss fn loss(w: &[f64], x: &[[f64; 2]], y: &[f64]) -> f64 { let mut sum = 0.0; for (xi, &yi) in x.iter().zip(y.iter()) { let pred = w[0] * xi[0] + w[1] * xi[1]; sum += (yi - pred).powi(2); } sum / x.len() as f64 }
let x_data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]; let y_data = [3.0, 7.0, 11.0]; let mut w = [0.5, 0.5]; let mut grad_w = [0.0, 0.0]; for i in 0..2 { let mut w_plus = w; w_plus[i] += h; let mut w_minus = w; w_minus[i] -= h; grad_w[i] = (loss(&w_plus, &x_data, &y_data) - loss(&w_minus, &x_data, &y_data)) / (2.0 * h); } println!("Gradient w.r.t. w: {:?}", grad_w);}:::
This approximates partials and gradients for functions and ML losses.
9. Symbolic Computation in Multivariables
Section titled “9. Symbolic Computation in Multivariables”Use SymPy for exact partials.
::: code-group
from sympy import symbols, diff, sin
x, y = symbols('x y')f = x**2 * y + sin(x * y)partial_x = diff(f, x)partial_y = diff(f, y)print("Partial x:", partial_x)print("Partial y:", partial_y)
# Evaluatevals = {x: 1, y: 2}print("Partial x at (1,2):", partial_x.subs(vals).evalf())// Rust symbolic via hardcoded or crates (simulated)fn main() { // Symbolic: partial_x = 2x y + y cos(x y) // partial_y = x^2 + x cos(x y) println!("Partial x: 2 x y + y cos(x y)"); println!("Partial y: x^2 + x cos(x y)");
let x_val = 1.0; let y_val = 2.0; let partial_x_val = 2.0 * x_val * y_val + y_val * (x_val * y_val).cos(); println!("Partial x at (1,2): {}", partial_x_val);}:::
10. Optimization with Gradients in ML
Section titled “10. Optimization with Gradients in ML”- Gradient Descent Variants: Batch, mini-batch, stochastic.
- Backpropagation: Chain rule for gradients in nets.
- Advanced: Momentum adds velocity; Adam adapts per-parameter.
Challenges: Vanishing gradients (sigmoid), exploding (deep nets)—fixes like ReLU, batch norm.
Example: Minimizing ( L(w_1, w_2) = (w_1 - 1)^2 + (w_2 - 2)^2 ), gradient leads to (1,2).
11. Implicit Functions and Lagrange Multipliers
Section titled “11. Implicit Functions and Lagrange Multipliers”Implicit theorem: Partials ensure solvability for one variable.
Lagrange: For min f subject to g=0, solve ( \nabla f = \lambda \nabla g ).
In ML: Regularization as constraints (e.g., L1/L2).
12. Mean Value Theorem in Multiple Variables
Section titled “12. Mean Value Theorem in Multiple Variables”Extends: Line segment between points has point where directional derivative matches average change.
In ML: Bounds approximation errors in Taylor expansions for trust regions.
13. Key ML Takeaways
Section titled “13. Key ML Takeaways”- Partials isolate effects: Crucial for high-dim models.
- Gradients guide descent: Core of training.
- Hessians for curvature: Enable efficient optimizers.
- Numerical/symbolic tools: Bridge theory to code.
- Multivariable mastery: Unlocks deep learning dynamics.
Partials and gradients empower AI to optimize in vast spaces.
14. Summary
Section titled “14. Summary”We explored partial derivatives from definitions to computations, gradients as vectors of change, and higher-order structures like Hessians. Tied to ML via optimization and backprop, with Python/Rust code for practical insight. This equips you for multivariable challenges in AI.
Next: Optimization Techniques.
Word count: Approximately 2950.
Further Reading
Section titled “Further Reading”- Stewart, Multivariable Calculus (Chapters 14-15).
- Goodfellow et al., Deep Learning (Chapter 4: Numerical Computation).
- 3Blue1Brown: Linear Algebra and Multivariable Calculus videos.
- Rust: Explore ‘ndarray’ for array ops.