Entropy, Cross-Entropy & KL Divergence
Entropy, Cross-Entropy & KL Divergence
Section titled “Entropy, Cross-Entropy & KL Divergence”Entropy, cross-entropy, and Kullback-Leibler (KL) divergence are key concepts from information theory that quantify uncertainty, dissimilarity between distributions, and information loss. In machine learning (ML), they serve as loss functions for classification (cross-entropy), regularization in variational autoencoders (KL), and measures of model complexity (entropy). These tools enable efficient training, model comparison, and understanding of probabilistic predictions.
This ninth lecture in the “Probability Foundations for AI/ML” series builds on MLE and MAP, exploring entropy as uncertainty measure, cross-entropy for predictive accuracy, KL for distribution divergence, their properties, and ML applications. We’ll provide intuitive explanations, mathematical derivations, and implementations in Python and Rust, preparing you for Markov chains and Bayesian inference.
1. Intuition Behind Entropy
Section titled “1. Intuition Behind Entropy”Entropy H(P) measures the average uncertainty or information in a distribution P—the “surprise” level or bits needed to encode outcomes.
For discrete P(p_i), high entropy if uniform (max uncertainty), low if concentrated (predictable).
Geometrically, entropy is -expected log-probability.
ML Connection
Section titled “ML Connection”- Decision Trees: Split to reduce entropy (information gain).
- Generative Models: Maximize entropy for diverse outputs.
::: info Entropy gauges “chaos” in probabilities—like how unpredictable a lottery is. :::
Example
Section titled “Example”- Fair coin: H= -0.5 log2(0.5) *2 =1 bit (max uncertainty).
- Biased coin (p=0.9): H≈0.47 bits (more predictable).
2. Formal Definition of Shannon Entropy
Section titled “2. Formal Definition of Shannon Entropy”For discrete RV with P(X=x_i)=p_i:
H(X) = - sum p_i log p_i (log base 2 for bits, e for nats).
Continuous (differential entropy): h(X) = - ∫ f(x) log f(x) dx.
Properties:
- Non-negative.
- H=0 if deterministic.
- Max for uniform over k outcomes: log k.
- Additivity for indep: H(X,Y)=H(X)+H(Y).
ML Insight
Section titled “ML Insight”- Entropy regularizers promote diversity in GANs.
3. Cross-Entropy: Measuring Predictive Accuracy
Section titled “3. Cross-Entropy: Measuring Predictive Accuracy”Cross-entropy H(P,Q) = - sum p_i log q_i.
Measures bits needed to encode P using Q’s code—inefficiency if Q ≠ P.
For continuous: h(P,Q) = - ∫ f_P(x) log f_Q(x) dx.
Properties:
- H(P,Q) ≥ H(P), equality iff P=Q.
- Not symmetric.
ML Application
Section titled “ML Application”- Loss function: Min H(true, pred) in classification.
Example: True P=[1,0], pred Q=[0.9,0.1], H(P,Q)= -1log(0.9) -0log(0.1)≈0.105 nats.
4. Kullback-Leibler (KL) Divergence
Section titled “4. Kullback-Leibler (KL) Divergence”KL(P||Q) = sum p_i log (p_i / q_i) = H(P,Q) - H(P).
“Relative entropy”—extra bits to encode P with Q vs optimal.
Continuous: KL(P||Q) = ∫ f_P log (f_P / f_Q) dx.
Properties:
- KL≥0, =0 iff P=Q.
- Asymmetric: KL(P||Q) ≠ KL(Q||P).
- Not distance (no triangle).
Proof Sketch
Section titled “Proof Sketch”From Jensen’s inequality on -log, since -log convex.
ML Connection
Section titled “ML Connection”- VAEs: KL regularizes latent dist to prior.
- Policy gradients: KL constrains updates.
Example: P=Bern(0.5), Q=Bern(0.6), KL(P||Q)≈0.029 bits.
5. Properties and Relationships
Section titled “5. Properties and Relationships”- Cross-entropy = entropy + KL.
- Mutual info I(X;Y) = KL(P(X,Y)||P(X)P(Y)) = H(X) - H(X|Y).
- Chain rule: H(X,Y)=H(X)+H(Y|X).
ML Insight
Section titled “ML Insight”- Information bottleneck: Min KL for compression.
6. Derivations and Calculations
Section titled “6. Derivations and Calculations”Entropy for Bernoulli(p): H(p) = -p log p - (1-p) log(1-p).
Max at p=0.5.
KL for Normals: N(μ1,σ1^2)||N(μ2,σ2^2) = log(σ2/σ1) + (σ1^2 + (μ1-μ2)^2)/(2σ2^2) - 1/2.
Closed form useful in VAEs.
ML Application
Section titled “ML Application”- Mode collapse in GANs: High entropy promotes diversity.
7. Applications in Machine Learning
Section titled “7. Applications in Machine Learning”- Loss Functions: Cross-entropy for classification.
- VAEs: KL(P(z|x)||Q(z)) regularizes latent.
- RL: Entropy bonuses for exploration.
- Clustering: Entropy for purity measures.
- Feature Selection: Mutual info for relevance.
Challenges
Section titled “Challenges”- Log(0) issues: Add smoothing.
- High-dim entropy estimation hard.
8. Numerical Computations
Section titled “8. Numerical Computations”Compute entropy, cross-entropy, KL.
::: code-group
import numpy as npfrom scipy.stats import entropy
# Discrete entropyp = [0.5, 0.5]print("Entropy coin:", entropy(p, base=2)) # 1 bit
# Cross-entropyq = [0.9, 0.1]cross_h = -np.sum(p * np.log(q + 1e-10))print("Cross-entropy:", cross_h)
# KL divergencekl = entropy(p, q, base=np.e)print("KL(P||Q):", kl)
# ML: Cross-entropy lossy_true = np.array([1, 0])y_pred = np.array([0.8, 0.2])loss = -np.sum(y_true * np.log(y_pred))print("CE loss:", loss)
# Continuous entropy approx (Gaussian)mu, sigma = 0, 1h_diff = 0.5 * np.log(2 * np.pi * np.e * sigma**2)print("Differential entropy normal:", h_diff)fn entropy(p: &[f64], base: f64) -> f64 { -p.iter().map(|&pi| if pi > 0.0 { pi * pi.log(base) } else { 0.0 }).sum::<f64>()}
fn cross_entropy(p: &[f64], q: &[f64]) -> f64 { -p.iter().zip(q.iter()).map(|(&pi, &qi)| if pi > 0.0 { pi * qi.log(std::f64::consts::E) } else { 0.0 }).sum::<f64>()}
fn kl_div(p: &[f64], q: &[f64]) -> f64 { p.iter().zip(q.iter()).map(|(&pi, &qi| if pi > 0.0 && qi > 0.0 { pi * (pi / qi).ln() } else { 0.0 }).sum::<f64>()}
fn main() { let p = [0.5, 0.5]; println!("Entropy coin (base 2): {}", entropy(&p, 2.0)); # 1.0
let q = [0.9, 0.1]; println!("Cross-entropy:", cross_entropy(&p, &q));
println!("KL(P||Q):", kl_div(&p, &q));
// ML: CE loss let y_true = [1.0, 0.0]; let y_pred = [0.8, 0.2]; let loss = - y_true.iter().zip(y_pred.iter()).map(|(&yt, &yp)| yt * yp.ln()).sum::<f64>(); println!("CE loss:", loss);
// Differential entropy normal let sigma = 1.0; let h_diff = 0.5 * (2.0 * std::f64::consts::PI * std::f64::consts::E * sigma.powi(2)).ln(); println!("Differential entropy normal:", h_diff);}:::
Computes entropy, cross-entropy, KL for discrete, continuous approx.
9. Symbolic Computations with SymPy
Section titled “9. Symbolic Computations with SymPy”Exact expressions.
::: code-group
from sympy import symbols, log, diff, solve
p = symbols('p', positive=True)H = -p * log(p) - (1-p) * log(1-p)dH_dp = diff(H, p)crit_p = solve(dH_dp, p)[0]print("Max entropy p:", crit_p) # 0.5
q = symbols('q', positive=True)KL = p * log(p/q) + (1-p) * log((1-p)/(1-q))print("KL(P||Q):", KL)fn main() { println!("Max entropy p: 0.5"); println!("KL(P||Q): p ln(p/q) + (1-p) ln((1-p)/(1-q))");}:::
10. Challenges in ML
Section titled “10. Challenges in ML”- Zero Probs: Log undefined; add smoothing.
- Differential Entropy Negative: Not comparable to discrete.
- High-Dim: Entropy estimation curse.
11. Key ML Takeaways
Section titled “11. Key ML Takeaways”- Entropy measures uncertainty: Model complexity.
- Cross-entropy loss: Predictive accuracy.
- KL quantifies divergence: Regularization.
- Properties guide usage: Non-negativity, asymmetry.
- Code computes: Practical information measures.
Information theory enhances ML.
12. Summary
Section titled “12. Summary”Explored entropy, cross-entropy, KL divergence, properties, derivations, with ML applications. Examples and Python/Rust code bridge theory to practice. Prepares for Markov chains and Bayesian inference.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Cover, Thomas, Elements of Information Theory.
- Bishop, Pattern Recognition (Ch. 1.6).
- Murphy, Probabilistic ML (Ch. 3).
- Rust: Implement custom entropy functions.