Entropy, Cross-Entropy & KL Divergence

Entropy, cross-entropy, and Kullback-Leibler (KL) divergence are key concepts from information theory that quantify uncertainty, dissimilarity between distributions, and information loss. In machine learning (ML), they serve as loss functions for classification (cross-entropy), regularization in variational autoencoders (KL), and measures of model complexity (entropy). These tools enable efficient training, model comparison, and understanding of probabilistic predictions.

This ninth lecture in the “Probability Foundations for AI/ML” series builds on MLE and MAP, exploring entropy as uncertainty measure, cross-entropy for predictive accuracy, KL for distribution divergence, their properties, and ML applications. We’ll provide intuitive explanations, mathematical derivations, and implementations in Python and Rust, preparing you for Markov chains and Bayesian inference.

1. Intuition Behind Entropy

Entropy H(P) measures the average uncertainty or information in a distribution P—the “surprise” level or bits needed to encode outcomes.

For discrete P(p_i), high entropy if uniform (max uncertainty), low if concentrated (predictable).

Geometrically, entropy is -expected log-probability.

ML Connection

Decision Trees: Split to reduce entropy (information gain).
Generative Models: Maximize entropy for diverse outputs.

::: info Entropy gauges “chaos” in probabilities—like how unpredictable a lottery is. :::

Example

Fair coin: H= -0.5 log2(0.5) *2 =1 bit (max uncertainty).
Biased coin (p=0.9): H≈0.47 bits (more predictable).

2. Formal Definition of Shannon Entropy

For discrete RV with P(X=x_i)=p_i:

H(X) = - sum p_i log p_i (log base 2 for bits, e for nats).

Continuous (differential entropy): h(X) = - ∫ f(x) log f(x) dx.

Properties:

Non-negative.
H=0 if deterministic.
Max for uniform over k outcomes: log k.
Additivity for indep: H(X,Y)=H(X)+H(Y).

ML Insight

Entropy regularizers promote diversity in GANs.

3. Cross-Entropy: Measuring Predictive Accuracy

Cross-entropy H(P,Q) = - sum p_i log q_i.

Measures bits needed to encode P using Q’s code—inefficiency if Q ≠ P.

For continuous: h(P,Q) = - ∫ f_P(x) log f_Q(x) dx.

Properties:

H(P,Q) ≥ H(P), equality iff P=Q.
Not symmetric.

ML Application

Loss function: Min H(true, pred) in classification.

Example: True P=[1,0], pred Q=[0.9,0.1], H(P,Q)= -1log(0.9) -0log(0.1)≈0.105 nats.

4. Kullback-Leibler (KL) Divergence

KL(P||Q) = sum p_i log (p_i / q_i) = H(P,Q) - H(P).

“Relative entropy”—extra bits to encode P with Q vs optimal.

Continuous: KL(P||Q) = ∫ f_P log (f_P / f_Q) dx.

Properties:

KL≥0, =0 iff P=Q.
Asymmetric: KL(P||Q) ≠ KL(Q||P).
Not distance (no triangle).

Proof Sketch

From Jensen’s inequality on -log, since -log convex.

ML Connection

VAEs: KL regularizes latent dist to prior.
Policy gradients: KL constrains updates.

Example: P=Bern(0.5), Q=Bern(0.6), KL(P||Q)≈0.029 bits.

5. Properties and Relationships

Cross-entropy = entropy + KL.
Mutual info I(X;Y) = KL(P(X,Y)||P(X)P(Y)) = H(X) - H(X|Y).
Chain rule: H(X,Y)=H(X)+H(Y|X).

ML Insight

Information bottleneck: Min KL for compression.

6. Derivations and Calculations

Entropy for Bernoulli(p): H(p) = -p log p - (1-p) log(1-p).

Max at p=0.5.

KL for Normals: N(μ1,σ1^2)||N(μ2,σ2^2) = log(σ2/σ1) + (σ1^2 + (μ1-μ2)^2)/(2σ2^2) - 1/2.

Closed form useful in VAEs.

ML Application

Mode collapse in GANs: High entropy promotes diversity.

7. Applications in Machine Learning

Loss Functions: Cross-entropy for classification.
VAEs: KL(P(z|x)||Q(z)) regularizes latent.
RL: Entropy bonuses for exploration.
Clustering: Entropy for purity measures.
Feature Selection: Mutual info for relevance.

Challenges

Log(0) issues: Add smoothing.
High-dim entropy estimation hard.

8. Numerical Computations

Compute entropy, cross-entropy, KL.

::: code-group

import numpy as np
from scipy.stats import entropy

# Discrete entropy
p = [0.5, 0.5]
print("Entropy coin:", entropy(p, base=2))  # 1 bit

# Cross-entropy
q = [0.9, 0.1]
cross_h = -np.sum(p * np.log(q + 1e-10))
print("Cross-entropy:", cross_h)

# KL divergence
kl = entropy(p, q, base=np.e)
print("KL(P||Q):", kl)

# ML: Cross-entropy loss
y_true = np.array([1, 0])
y_pred = np.array([0.8, 0.2])
loss = -np.sum(y_true * np.log(y_pred))
print("CE loss:", loss)

# Continuous entropy approx (Gaussian)
mu, sigma = 0, 1
h_diff = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
print("Differential entropy normal:", h_diff)

fn entropy(p: &[f64], base: f64) -> f64 {
    -p.iter().map(|&pi| if pi > 0.0 { pi * pi.log(base) } else { 0.0 }).sum::<f64>()
}

fn cross_entropy(p: &[f64], q: &[f64]) -> f64 {
    -p.iter().zip(q.iter()).map(|(&pi, &qi)| if pi > 0.0 { pi * qi.log(std::f64::consts::E) } else { 0.0 }).sum::<f64>()
}

fn kl_div(p: &[f64], q: &[f64]) -> f64 {
    p.iter().zip(q.iter()).map(|(&pi, &qi| if pi > 0.0 && qi > 0.0 { pi * (pi / qi).ln() } else { 0.0 }).sum::<f64>()
}

fn main() {
    let p = [0.5, 0.5];
    println!("Entropy coin (base 2): {}", entropy(&p, 2.0));  # 1.0

    let q = [0.9, 0.1];
    println!("Cross-entropy:", cross_entropy(&p, &q));

    println!("KL(P||Q):", kl_div(&p, &q));

    // ML: CE loss
    let y_true = [1.0, 0.0];
    let y_pred = [0.8, 0.2];
    let loss = - y_true.iter().zip(y_pred.iter()).map(|(&yt, &yp)| yt * yp.ln()).sum::<f64>();
    println!("CE loss:", loss);

    // Differential entropy normal
    let sigma = 1.0;
    let h_diff = 0.5 * (2.0 * std::f64::consts::PI * std::f64::consts::E * sigma.powi(2)).ln();
    println!("Differential entropy normal:", h_diff);
}

:::

Computes entropy, cross-entropy, KL for discrete, continuous approx.

9. Symbolic Computations with SymPy

Exact expressions.

::: code-group

from sympy import symbols, log, diff, solve

p = symbols('p', positive=True)
H = -p * log(p) - (1-p) * log(1-p)
dH_dp = diff(H, p)
crit_p = solve(dH_dp, p)[0]
print("Max entropy p:", crit_p)  # 0.5

q = symbols('q', positive=True)
KL = p * log(p/q) + (1-p) * log((1-p)/(1-q))
print("KL(P||Q):", KL)

fn main() {
    println!("Max entropy p: 0.5");
    println!("KL(P||Q): p ln(p/q) + (1-p) ln((1-p)/(1-q))");
}

:::

10. Challenges in ML

Zero Probs: Log undefined; add smoothing.
Differential Entropy Negative: Not comparable to discrete.
High-Dim: Entropy estimation curse.

11. Key ML Takeaways

Entropy measures uncertainty: Model complexity.
Cross-entropy loss: Predictive accuracy.
KL quantifies divergence: Regularization.
Properties guide usage: Non-negativity, asymmetry.
Code computes: Practical information measures.

Information theory enhances ML.

12. Summary

Explored entropy, cross-entropy, KL divergence, properties, derivations, with ML applications. Examples and Python/Rust code bridge theory to practice. Prepares for Markov chains and Bayesian inference.

Word count: Approximately 3000.

Entropy, Cross-Entropy & KL Divergence

Entropy, Cross-Entropy & KL Divergence

1. Intuition Behind Entropy

ML Connection

Example

2. Formal Definition of Shannon Entropy

ML Insight

3. Cross-Entropy: Measuring Predictive Accuracy

ML Application

4. Kullback-Leibler (KL) Divergence

Proof Sketch

ML Connection

5. Properties and Relationships

ML Insight

6. Derivations and Calculations

ML Application

7. Applications in Machine Learning

Challenges

8. Numerical Computations

9. Symbolic Computations with SymPy

10. Challenges in ML

11. Key ML Takeaways

12. Summary

Further Reading