Derivatives - Measuring Change
Derivatives: Measuring Change
Section titled “Derivatives: Measuring Change”Calculus empowers us to quantify change, and derivatives are the cornerstone for measuring instantaneous rates of change. In artificial intelligence and machine learning, derivatives manifest as gradients that drive optimization algorithms like gradient descent. They tell us how to adjust model parameters to minimize loss functions, enabling models to learn from data. This lecture delves into the intuition behind derivatives, their formal definitions, computational rules, and practical applications in ML, all while bridging mathematical rigor with code implementations in Python and Rust.
Whether you’re tuning neural networks or analyzing algorithmic efficiency, mastering derivatives unlocks deeper insights into how AI systems evolve and improve.
1. What is a Derivative? Intuition First
Section titled “1. What is a Derivative? Intuition First”Imagine driving a car: your speedometer shows the instantaneous speed, not the average over the trip. A derivative captures this “instantaneous rate of change” for any function. For a position function ( s(t) ), the derivative ( s’(t) ) gives velocity at time ( t ).
Formally, the derivative of ( f(x) ) at ( x = a ) is the slope of the tangent line to the graph at that point. It’s the limit of the average rate of change over shrinking intervals.
ML Connection
Section titled “ML Connection”- In gradient descent, the derivative of the loss function with respect to weights tells us the direction to update parameters: ( w_{new} = w_{old} - \eta \frac{\partial L}{\partial w} ).
- Backpropagation computes derivatives layer by layer in neural networks.
- Derivatives help diagnose issues like vanishing or exploding gradients in deep learning.
::: info Derivatives aren’t just slopes—they’re the DNA of optimization, revealing how small tweaks in inputs ripple through outputs. :::
Everyday Example
Section titled “Everyday Example”- For ( f(x) = x^2 ), the derivative at ( x = 2 ) is 4, meaning the function increases by about 4 units for every small unit increase in ( x ) near 2.
- In economics (analogous to ML cost functions), marginal cost is the derivative of total cost.
2. Formal Definition Using Limits
Section titled “2. Formal Definition Using Limits”The derivative is defined via limits:
[ f’(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]
This is the difference quotient approaching the instantaneous rate as ( h ) shrinks.
If the limit exists, ( f ) is differentiable at ( x ). Differentiability implies continuity, but not vice versa (e.g., ( |x| ) at 0 is continuous but not differentiable).
Conditions for Differentiability
Section titled “Conditions for Differentiability”- Left and right limits must agree.
- No sharp corners, cusps, or vertical tangents.
ML Insight
Section titled “ML Insight”- Loss functions like mean squared error (MSE) are differentiable, allowing smooth optimization.
- Non-differentiable activations like ReLU use subgradients in practice, but derivatives guide most training.
3. Geometric and Physical Interpretations
Section titled “3. Geometric and Physical Interpretations”Geometrically: The derivative is the tangent slope. For ( f(x) = \sin(x) ), at ( x = 0 ), ( f’(0) = 1 ), a gentle upward slope.
Physically: Acceleration is the derivative of velocity (second derivative of position). In robotics or reinforcement learning, this models dynamic systems.
Visualization (Conceptual)
Section titled “Visualization (Conceptual)”- Plot ( f(x) = x^3 - x ): Tangents at local max/min have zero slope (critical points).
- In ML, the gradient vector field over a loss surface shows paths of steepest descent—valleys lead to minima.
These graphs illustrate how derivatives pinpoint where functions flatten or accelerate.
4. Basic Differentiation Rules
Section titled “4. Basic Differentiation Rules”No need to compute limits every time—rules simplify:
- Constant Rule: ( \frac{d}{dx} c = 0 ).
- Power Rule: ( \frac{d}{dx} x^n = n x^{n-1} ).
- Sum Rule: ( \frac{d}{dx} [f(x) + g(x)] = f’(x) + g’(x) ).
- Product Rule: ( \frac{d}{dx} [f(x) g(x)] = f’(x)g(x) + f(x)g’(x) ).
- Quotient Rule: ( \frac{d}{dx} \left[ \frac{f(x)}{g(x)} \right] = \frac{f’(x)g(x) - f(x)g’(x)}{[g(x)]^2} ).
- Chain Rule: ( \frac{d}{dx} f(g(x)) = f’(g(x)) \cdot g’(x) ) — crucial for composite functions like neural nets.
Examples
Section titled “Examples”- ( f(x) = 3x^4 ): ( f’(x) = 12x^3 ).
- ( f(x) = e^x \sin(x) ): Use product rule → ( f’(x) = e^x \sin(x) + e^x \cos(x) ).
- Chain rule in action: For ( f(x) = (x^2 + 1)^3 ), let ( u = x^2 + 1 ), ( f = u^3 ), so ( f’(x) = 3(x^2 + 1)^2 \cdot 2x ).
ML Application
Section titled “ML Application”- Chain rule enables backpropagation: Derivatives propagate backward through layers.
- In logistic regression, derivative of sigmoid: ( \sigma’(x) = \sigma(x)(1 - \sigma(x)) ).
5. Higher-Order Derivatives
Section titled “5. Higher-Order Derivatives”Derivatives of derivatives reveal concavity and acceleration.
- Second derivative: ( f”(x) = \frac{d}{dx} f’(x) ), measures curvature.
- ( f”(x) > 0 ): Concave up (local min possible).
- ( f”(x) < 0 ): Concave down (local max).
- Third and higher: Used in Taylor expansions for approximations.
ML Connection
Section titled “ML Connection”- Hessian matrix (second partials) in Newton’s method for faster convergence.
- In optimization, positive definite Hessian ensures a minimum.
Example: For ( f(x) = x^4 ), ( f’(x) = 4x^3 ), ( f”(x) = 12x^2 ), always non-negative—global minimum at 0.
6. Partial Derivatives in Multivariable Calculus
Section titled “6. Partial Derivatives in Multivariable Calculus”ML deals with high-dimensional spaces, so extend to functions of multiple variables.
For ( f(x, y) ), partial with respect to ( x ): Treat ( y ) as constant.
[ \frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h} ]
Gradient: ( \nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) ), direction of steepest ascent.
ML Example
Section titled “ML Example”- In linear regression: Loss ( L(w, b) = \frac{1}{n} \sum (y_i - (w x_i + b))^2 ).
- ( \frac{\partial L}{\partial w} = -\frac{2}{n} \sum x_i (y_i - \hat{y_i}) ).
- Used in batch gradient descent.
7. Numerical Approximation of Derivatives
Section titled “7. Numerical Approximation of Derivatives”Analytically impossible sometimes? Use finite differences:
- Forward: ( f’(x) \approx \frac{f(x + h) - f(x)}{h} ).
- Central: ( f’(x) \approx \frac{f(x + h) - f(x - h)}{2h} ) — more accurate.
Choose small ( h ), but watch for floating-point errors.
::: code-group
import numpy as np
def f(x): return x**2 + np.sin(x)
def numerical_derivative(f, x, h=1e-5): return (f(x + h) - f(x - h)) / (2 * h)
x = 1.0analytical = 2*x + np.cos(x) # Exact derivativenumerical = numerical_derivative(f, x)
print(f"Analytical derivative at x={x}: {analytical}")print(f"Numerical derivative at x={x}: {numerical}")
# ML example: Gradient for simple lossdef loss(w, x_data, y_data): return np.mean((y_data - w * x_data)**2)
x_data = np.array([1, 2, 3])y_data = np.array([2, 4, 6]) # Linear: y=2xw = 1.5grad_w = numerical_derivative(lambda w: loss(w, x_data, y_data), w)print(f"Gradient w.r.t. w at {w}: {grad_w}")use std::f64;
fn f(x: f64) -> f64 { x.powi(2) + x.sin()}
fn numerical_derivative(f: fn(f64) -> f64, x: f64, h: f64) -> f64 { (f(x + h) - f(x - h)) / (2.0 * h)}
fn main() { let x = 1.0; let h = 1e-5; let analytical = 2.0 * x + x.cos(); let numerical = numerical_derivative(f, x, h);
println!("Analytical derivative at x={}: {}", x, analytical); println!("Numerical derivative at x={}: {}", x, numerical);
// ML example: Gradient for simple loss fn loss(w: f64, x_data: &[f64], y_data: &[f64]) -> f64 { let mut sum = 0.0; for (&xi, &yi) in x_data.iter().zip(y_data.iter()) { sum += (yi - w * xi).powi(2); } sum / x_data.len() as f64 }
let x_data = [1.0, 2.0, 3.0]; let y_data = [2.0, 4.0, 6.0]; // Linear: y=2x let w = 1.5; let grad_w = numerical_derivative(|w| loss(w, &x_data, &y_data), w, h); println!("Gradient w.r.t. w at {}: {}", w, grad_w);}:::
This code demonstrates approximating derivatives and applying them to a toy ML loss function.
8. Symbolic Differentiation in Code
Section titled “8. Symbolic Differentiation in Code”For exact derivatives, use symbolic libraries. In Python, SymPy; in Rust, libraries like symbolic-rs (simulated here).
::: code-group
from sympy import symbols, diff, sin
x = symbols('x')f = x**2 + sin(x)f_prime = diff(f, x)print("Symbolic derivative:", f_prime)
# Evaluate at x=1value = f_prime.subs(x, 1)print("At x=1:", value.evalf())fn symbolic_derivative() { // Note: Rust lacks built-in symbolic differentiation; simulate or use crates. // Hardcoding derivative: d/dx (x^2 + sin(x)) = 2x + cos(x) println!("Symbolic derivative: 2x + cos(x)");
let x = 1.0; let value = 2.0 * x + x.cos(); println!("At x=1: {}", value);}
fn main() { symbolic_derivative();}:::
In practice, for Rust, consider crates like nalgebra or implement manually for simple cases.
9. Derivatives in Optimization Algorithms
Section titled “9. Derivatives in Optimization Algorithms”Gradient descent variants rely on derivatives:
- Vanilla GD: Step proportional to negative gradient.
- Stochastic GD (SGD): Noisy but efficient for large data.
- Adam: Adaptive learning rates using first and second moments.
Challenges
Section titled “Challenges”- Local minima: Derivatives zero, but not global optimum.
- Saddle points: Hessian has mixed signs.
- ML fix: Momentum, learning rate scheduling.
Example: Optimizing ( L(\theta) = \theta^2 ) converges to 0 via ( \theta \leftarrow \theta - \eta \cdot 2\theta ).
10. Implicit Differentiation and Related Rates
Section titled “10. Implicit Differentiation and Related Rates”For equations like ( x^2 + y^2 = 1 ) (circle), differentiate implicitly: ( 2x + 2y \frac{dy}{dx} = 0 \Rightarrow \frac{dy}{dx} = -\frac{x}{y} ).
In ML: Used in constrained optimization or auto-differentiation frameworks like PyTorch.
Related rates: If volume ( V = \frac{4}{3}\pi r^3 ), ( \frac{dV}{dt} = 4\pi r^2 \frac{dr}{dt} )—analogous to chain rule in dynamic models.
11. Mean Value Theorem and Applications
Section titled “11. Mean Value Theorem and Applications”MVT: If continuous on [a,b] and differentiable on (a,b), there exists c where ( f’(c) = \frac{f(b) - f(a)}{b - a} ).
Implications: Connects average and instantaneous rates. In ML, bounds error in approximations.
Rolle’s Theorem: Special case where f(a)=f(b), so f’(c)=0—guarantees critical points.
12. Key ML Takeaways
Section titled “12. Key ML Takeaways”- Gradients drive learning: Derivatives quantify parameter sensitivity.
- Chain rule = backprop: Enables deep networks.
- Higher derivatives: Inform curvature-aware optimizers.
- Numerical vs. Analytical: Trade accuracy for computability in code.
- Multivariable: Handles real-world high-dim data.
Derivatives transform static functions into dynamic tools for change detection and prediction.
13. Summary
Section titled “13. Summary”This lecture unpacked derivatives from intuitive slopes to rigorous limits, rules, and multivariable extensions. We connected them to ML optimization, illustrated with examples, and implemented numerical/symbolic computations in Python and Rust. Understanding derivatives equips you to dissect how AI models adapt and improve.
Next up: Integrals—reversing derivatives to accumulate change.
Further Reading
Section titled “Further Reading”- Khan Academy: Derivatives section.
- Bishop, Pattern Recognition and Machine Learning (Chapter 5: Neural Networks).
- 3Blue1Brown: Essence of Calculus series (visual intuitions).
- Rust crates: For advanced symbolic math, explore
nalgebraorsymbolic.