Derivatives - Measuring Change

Derivatives: Measuring Change

Calculus empowers us to quantify change, and derivatives are the cornerstone for measuring instantaneous rates of change. In artificial intelligence and machine learning, derivatives manifest as gradients that drive optimization algorithms like gradient descent. They tell us how to adjust model parameters to minimize loss functions, enabling models to learn from data. This lecture delves into the intuition behind derivatives, their formal definitions, computational rules, and practical applications in ML, all while bridging mathematical rigor with code implementations in Python and Rust.

Whether you’re tuning neural networks or analyzing algorithmic efficiency, mastering derivatives unlocks deeper insights into how AI systems evolve and improve.

1. What is a Derivative? Intuition First

Imagine driving a car: your speedometer shows the instantaneous speed, not the average over the trip. A derivative captures this “instantaneous rate of change” for any function. For a position function ( s(t) ), the derivative ( s’(t) ) gives velocity at time ( t ).

Formally, the derivative of ( f(x) ) at ( x = a ) is the slope of the tangent line to the graph at that point. It’s the limit of the average rate of change over shrinking intervals.

ML Connection

In gradient descent, the derivative of the loss function with respect to weights tells us the direction to update parameters: ( w_{new} = w_{old} - \eta \frac{\partial L}{\partial w} ).
Backpropagation computes derivatives layer by layer in neural networks.
Derivatives help diagnose issues like vanishing or exploding gradients in deep learning.

::: info Derivatives aren’t just slopes—they’re the DNA of optimization, revealing how small tweaks in inputs ripple through outputs. :::

Everyday Example

For ( f(x) = x^2 ), the derivative at ( x = 2 ) is 4, meaning the function increases by about 4 units for every small unit increase in ( x ) near 2.
In economics (analogous to ML cost functions), marginal cost is the derivative of total cost.

2. Formal Definition Using Limits

The derivative is defined via limits:

[ f’(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} ]

This is the difference quotient approaching the instantaneous rate as ( h ) shrinks.

If the limit exists, ( f ) is differentiable at ( x ). Differentiability implies continuity, but not vice versa (e.g., ( |x| ) at 0 is continuous but not differentiable).

Conditions for Differentiability

Left and right limits must agree.
No sharp corners, cusps, or vertical tangents.

ML Insight

Loss functions like mean squared error (MSE) are differentiable, allowing smooth optimization.
Non-differentiable activations like ReLU use subgradients in practice, but derivatives guide most training.

3. Geometric and Physical Interpretations

Geometrically: The derivative is the tangent slope. For ( f(x) = \sin(x) ), at ( x = 0 ), ( f’(0) = 1 ), a gentle upward slope.

Physically: Acceleration is the derivative of velocity (second derivative of position). In robotics or reinforcement learning, this models dynamic systems.

Visualization (Conceptual)

Plot ( f(x) = x^3 - x ): Tangents at local max/min have zero slope (critical points).
In ML, the gradient vector field over a loss surface shows paths of steepest descent—valleys lead to minima.

These graphs illustrate how derivatives pinpoint where functions flatten or accelerate.

4. Basic Differentiation Rules

No need to compute limits every time—rules simplify:

Constant Rule: ( \frac{d}{dx} c = 0 ).
Power Rule: ( \frac{d}{dx} x^n = n x^{n-1} ).
Sum Rule: ( \frac{d}{dx} [f(x) + g(x)] = f’(x) + g’(x) ).
Product Rule: ( \frac{d}{dx} [f(x) g(x)] = f’(x)g(x) + f(x)g’(x) ).
Quotient Rule: ( \frac{d}{dx} \left[ \frac{f(x)}{g(x)} \right] = \frac{f’(x)g(x) - f(x)g’(x)}{[g(x)]^2} ).
Chain Rule: ( \frac{d}{dx} f(g(x)) = f’(g(x)) \cdot g’(x) ) — crucial for composite functions like neural nets.

Examples

( f(x) = 3x^4 ): ( f’(x) = 12x^3 ).
( f(x) = e^x \sin(x) ): Use product rule → ( f’(x) = e^x \sin(x) + e^x \cos(x) ).
Chain rule in action: For ( f(x) = (x^2 + 1)^3 ), let ( u = x^2 + 1 ), ( f = u^3 ), so ( f’(x) = 3(x^2 + 1)^2 \cdot 2x ).

ML Application

Chain rule enables backpropagation: Derivatives propagate backward through layers.
In logistic regression, derivative of sigmoid: ( \sigma’(x) = \sigma(x)(1 - \sigma(x)) ).

5. Higher-Order Derivatives

Derivatives of derivatives reveal concavity and acceleration.

Second derivative: ( f”(x) = \frac{d}{dx} f’(x) ), measures curvature.
- ( f”(x) > 0 ): Concave up (local min possible).
- ( f”(x) < 0 ): Concave down (local max).
Third and higher: Used in Taylor expansions for approximations.

ML Connection

Hessian matrix (second partials) in Newton’s method for faster convergence.
In optimization, positive definite Hessian ensures a minimum.

Example: For ( f(x) = x^4 ), ( f’(x) = 4x^3 ), ( f”(x) = 12x^2 ), always non-negative—global minimum at 0.

6. Partial Derivatives in Multivariable Calculus

ML deals with high-dimensional spaces, so extend to functions of multiple variables.

For ( f(x, y) ), partial with respect to ( x ): Treat ( y ) as constant.

[ \frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h} ]

Gradient: ( \nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) ), direction of steepest ascent.

ML Example

In linear regression: Loss ( L(w, b) = \frac{1}{n} \sum (y_i - (w x_i + b))^2 ).
- ( \frac{\partial L}{\partial w} = -\frac{2}{n} \sum x_i (y_i - \hat{y_i}) ).
- Used in batch gradient descent.

7. Numerical Approximation of Derivatives

Analytically impossible sometimes? Use finite differences:

Forward: ( f’(x) \approx \frac{f(x + h) - f(x)}{h} ).
Central: ( f’(x) \approx \frac{f(x + h) - f(x - h)}{2h} ) — more accurate.

Choose small ( h ), but watch for floating-point errors.

::: code-group

import numpy as np

def f(x):
    return x**2 + np.sin(x)

def numerical_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

x = 1.0
analytical = 2*x + np.cos(x)  # Exact derivative
numerical = numerical_derivative(f, x)

print(f"Analytical derivative at x={x}: {analytical}")
print(f"Numerical derivative at x={x}: {numerical}")

# ML example: Gradient for simple loss
def loss(w, x_data, y_data):
    return np.mean((y_data - w * x_data)**2)

x_data = np.array([1, 2, 3])
y_data = np.array([2, 4, 6])  # Linear: y=2x
w = 1.5
grad_w = numerical_derivative(lambda w: loss(w, x_data, y_data), w)
print(f"Gradient w.r.t. w at {w}: {grad_w}")

use std::f64;

fn f(x: f64) -> f64 {
    x.powi(2) + x.sin()
}

fn numerical_derivative(f: fn(f64) -> f64, x: f64, h: f64) -> f64 {
    (f(x + h) - f(x - h)) / (2.0 * h)
}

fn main() {
    let x = 1.0;
    let h = 1e-5;
    let analytical = 2.0 * x + x.cos();
    let numerical = numerical_derivative(f, x, h);

    println!("Analytical derivative at x={}: {}", x, analytical);
    println!("Numerical derivative at x={}: {}", x, numerical);

    // ML example: Gradient for simple loss
    fn loss(w: f64, x_data: &[f64], y_data: &[f64]) -> f64 {
        let mut sum = 0.0;
        for (&xi, &yi) in x_data.iter().zip(y_data.iter()) {
            sum += (yi - w * xi).powi(2);
        }
        sum / x_data.len() as f64
    }

    let x_data = [1.0, 2.0, 3.0];
    let y_data = [2.0, 4.0, 6.0];  // Linear: y=2x
    let w = 1.5;
    let grad_w = numerical_derivative(|w| loss(w, &x_data, &y_data), w, h);
    println!("Gradient w.r.t. w at {}: {}", w, grad_w);
}

:::

This code demonstrates approximating derivatives and applying them to a toy ML loss function.

8. Symbolic Differentiation in Code

For exact derivatives, use symbolic libraries. In Python, SymPy; in Rust, libraries like symbolic-rs (simulated here).

::: code-group

from sympy import symbols, diff, sin

x = symbols('x')
f = x**2 + sin(x)
f_prime = diff(f, x)
print("Symbolic derivative:", f_prime)

# Evaluate at x=1
value = f_prime.subs(x, 1)
print("At x=1:", value.evalf())

fn symbolic_derivative() {
    // Note: Rust lacks built-in symbolic differentiation; simulate or use crates.
    // Hardcoding derivative: d/dx (x^2 + sin(x)) = 2x + cos(x)
    println!("Symbolic derivative: 2x + cos(x)");

    let x = 1.0;
    let value = 2.0 * x + x.cos();
    println!("At x=1: {}", value);
}

fn main() {
    symbolic_derivative();
}

:::

In practice, for Rust, consider crates like nalgebra or implement manually for simple cases.

9. Derivatives in Optimization Algorithms

Gradient descent variants rely on derivatives:

Vanilla GD: Step proportional to negative gradient.
Stochastic GD (SGD): Noisy but efficient for large data.
Adam: Adaptive learning rates using first and second moments.

Challenges

Local minima: Derivatives zero, but not global optimum.
Saddle points: Hessian has mixed signs.
ML fix: Momentum, learning rate scheduling.

Example: Optimizing ( L(\theta) = \theta^2 ) converges to 0 via ( \theta \leftarrow \theta - \eta \cdot 2\theta ).

For equations like ( x^2 + y^2 = 1 ) (circle), differentiate implicitly: ( 2x + 2y \frac{dy}{dx} = 0 \Rightarrow \frac{dy}{dx} = -\frac{x}{y} ).

In ML: Used in constrained optimization or auto-differentiation frameworks like PyTorch.

Related rates: If volume ( V = \frac{4}{3}\pi r^3 ), ( \frac{dV}{dt} = 4\pi r^2 \frac{dr}{dt} )—analogous to chain rule in dynamic models.

11. Mean Value Theorem and Applications

MVT: If continuous on [a,b] and differentiable on (a,b), there exists c where ( f’(c) = \frac{f(b) - f(a)}{b - a} ).

Implications: Connects average and instantaneous rates. In ML, bounds error in approximations.

Rolle’s Theorem: Special case where f(a)=f(b), so f’(c)=0—guarantees critical points.

12. Key ML Takeaways

Gradients drive learning: Derivatives quantify parameter sensitivity.
Chain rule = backprop: Enables deep networks.
Higher derivatives: Inform curvature-aware optimizers.
Numerical vs. Analytical: Trade accuracy for computability in code.
Multivariable: Handles real-world high-dim data.

Derivatives transform static functions into dynamic tools for change detection and prediction.

13. Summary

This lecture unpacked derivatives from intuitive slopes to rigorous limits, rules, and multivariable extensions. We connected them to ML optimization, illustrated with examples, and implemented numerical/symbolic computations in Python and Rust. Understanding derivatives equips you to dissect how AI models adapt and improve.

Next up: Integrals—reversing derivatives to accumulate change.

Derivatives - Measuring Change

Derivatives: Measuring Change

1. What is a Derivative? Intuition First

ML Connection

Everyday Example

2. Formal Definition Using Limits

Conditions for Differentiability

ML Insight

3. Geometric and Physical Interpretations

Visualization (Conceptual)

4. Basic Differentiation Rules

Examples

ML Application

5. Higher-Order Derivatives

ML Connection

6. Partial Derivatives in Multivariable Calculus

ML Example

7. Numerical Approximation of Derivatives

8. Symbolic Differentiation in Code

9. Derivatives in Optimization Algorithms

Challenges

10. Implicit Differentiation and Related Rates

11. Mean Value Theorem and Applications

12. Key ML Takeaways

13. Summary

Further Reading