Entropy, Cross-Entropy & KL Divergence
Entropy, cross-entropy, and Kullback-Leibler (KL) divergence are key concepts from information theory that quantify uncertainty, dissimilarity between distributions, and information loss. In machine learning (ML), they serve as loss functions for classification (cross-entropy), regularization in variational autoencoders (KL), and measures of model complexity (entropy). These tools enable efficient training, model comparison, and understanding of probabilistic predictions.
This ninth lecture in the "Probability Foundations for AI/ML" series builds on MLE and MAP, exploring entropy as uncertainty measure, cross-entropy for predictive accuracy, KL for distribution divergence, their properties, and ML applications. We'll provide intuitive explanations, mathematical derivations, and implementations in Python and Rust, preparing you for Markov chains and Bayesian inference.
1. Intuition Behind Entropy
Entropy H(P) measures the average uncertainty or information in a distribution P—the "surprise" level or bits needed to encode outcomes.
For discrete P(p_i), high entropy if uniform (max uncertainty), low if concentrated (predictable).
Geometrically, entropy is -expected log-probability.
ML Connection
- Decision Trees: Split to reduce entropy (information gain).
- Generative Models: Maximize entropy for diverse outputs.
INFO
Entropy gauges "chaos" in probabilities—like how unpredictable a lottery is.
Example
- Fair coin: H= -0.5 log2(0.5) *2 =1 bit (max uncertainty).
- Biased coin (p=0.9): H≈0.47 bits (more predictable).
2. Formal Definition of Shannon Entropy
For discrete RV with P(X=x_i)=p_i:
H(X) = - sum p_i log p_i (log base 2 for bits, e for nats).
Continuous (differential entropy): h(X) = - ∫ f(x) log f(x) dx.
Properties:
- Non-negative.
- H=0 if deterministic.
- Max for uniform over k outcomes: log k.
- Additivity for indep: H(X,Y)=H(X)+H(Y).
ML Insight
- Entropy regularizers promote diversity in GANs.
3. Cross-Entropy: Measuring Predictive Accuracy
Cross-entropy H(P,Q) = - sum p_i log q_i.
Measures bits needed to encode P using Q's code—inefficiency if Q ≠ P.
For continuous: h(P,Q) = - ∫ f_P(x) log f_Q(x) dx.
Properties:
- H(P,Q) ≥ H(P), equality iff P=Q.
- Not symmetric.
ML Application
- Loss function: Min H(true, pred) in classification.
Example: True P=[1,0], pred Q=[0.9,0.1], H(P,Q)= -1log(0.9) -0log(0.1)≈0.105 nats.
4. Kullback-Leibler (KL) Divergence
KL(P||Q) = sum p_i log (p_i / q_i) = H(P,Q) - H(P).
"Relative entropy"—extra bits to encode P with Q vs optimal.
Continuous: KL(P||Q) = ∫ f_P log (f_P / f_Q) dx.
Properties:
- KL≥0, =0 iff P=Q.
- Asymmetric: KL(P||Q) ≠ KL(Q||P).
- Not distance (no triangle).
Proof Sketch
From Jensen's inequality on -log, since -log convex.
ML Connection
- VAEs: KL regularizes latent dist to prior.
- Policy gradients: KL constrains updates.
Example: P=Bern(0.5), Q=Bern(0.6), KL(P||Q)≈0.029 bits.
5. Properties and Relationships
- Cross-entropy = entropy + KL.
- Mutual info I(X;Y) = KL(P(X,Y)||P(X)P(Y)) = H(X) - H(X|Y).
- Chain rule: H(X,Y)=H(X)+H(Y|X).
ML Insight
- Information bottleneck: Min KL for compression.
6. Derivations and Calculations
Entropy for Bernoulli(p): H(p) = -p log p - (1-p) log(1-p).
Max at p=0.5.
KL for Normals: N(μ1,σ1^2)||N(μ2,σ2^2) = log(σ2/σ1) + (σ1^2 + (μ1-μ2)^2)/(2σ2^2) - 1/2.
Closed form useful in VAEs.
ML Application
- Mode collapse in GANs: High entropy promotes diversity.
7. Applications in Machine Learning
- Loss Functions: Cross-entropy for classification.
- VAEs: KL(P(z|x)||Q(z)) regularizes latent.
- RL: Entropy bonuses for exploration.
- Clustering: Entropy for purity measures.
- Feature Selection: Mutual info for relevance.
Challenges
- Log(0) issues: Add smoothing.
- High-dim entropy estimation hard.
8. Numerical Computations
Compute entropy, cross-entropy, KL.
import numpy as np
from scipy.stats import entropy
# Discrete entropy
p = [0.5, 0.5]
print("Entropy coin:", entropy(p, base=2)) # 1 bit
# Cross-entropy
q = [0.9, 0.1]
cross_h = -np.sum(p * np.log(q + 1e-10))
print("Cross-entropy:", cross_h)
# KL divergence
kl = entropy(p, q, base=np.e)
print("KL(P||Q):", kl)
# ML: Cross-entropy loss
y_true = np.array([1, 0])
y_pred = np.array([0.8, 0.2])
loss = -np.sum(y_true * np.log(y_pred))
print("CE loss:", loss)
# Continuous entropy approx (Gaussian)
mu, sigma = 0, 1
h_diff = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
print("Differential entropy normal:", h_diff)
fn entropy(p: &[f64], base: f64) -> f64 {
-p.iter().map(|&pi| if pi > 0.0 { pi * pi.log(base) } else { 0.0 }).sum::<f64>()
}
fn cross_entropy(p: &[f64], q: &[f64]) -> f64 {
-p.iter().zip(q.iter()).map(|(&pi, &qi)| if pi > 0.0 { pi * qi.log(std::f64::consts::E) } else { 0.0 }).sum::<f64>()
}
fn kl_div(p: &[f64], q: &[f64]) -> f64 {
p.iter().zip(q.iter()).map(|(&pi, &qi| if pi > 0.0 && qi > 0.0 { pi * (pi / qi).ln() } else { 0.0 }).sum::<f64>()
}
fn main() {
let p = [0.5, 0.5];
println!("Entropy coin (base 2): {}", entropy(&p, 2.0)); # 1.0
let q = [0.9, 0.1];
println!("Cross-entropy:", cross_entropy(&p, &q));
println!("KL(P||Q):", kl_div(&p, &q));
// ML: CE loss
let y_true = [1.0, 0.0];
let y_pred = [0.8, 0.2];
let loss = - y_true.iter().zip(y_pred.iter()).map(|(&yt, &yp)| yt * yp.ln()).sum::<f64>();
println!("CE loss:", loss);
// Differential entropy normal
let sigma = 1.0;
let h_diff = 0.5 * (2.0 * std::f64::consts::PI * std::f64::consts::E * sigma.powi(2)).ln();
println!("Differential entropy normal:", h_diff);
}
Computes entropy, cross-entropy, KL for discrete, continuous approx.
9. Symbolic Computations with SymPy
Exact expressions.
from sympy import symbols, log, diff, solve
p = symbols('p', positive=True)
H = -p * log(p) - (1-p) * log(1-p)
dH_dp = diff(H, p)
crit_p = solve(dH_dp, p)[0]
print("Max entropy p:", crit_p) # 0.5
q = symbols('q', positive=True)
KL = p * log(p/q) + (1-p) * log((1-p)/(1-q))
print("KL(P||Q):", KL)
fn main() {
println!("Max entropy p: 0.5");
println!("KL(P||Q): p ln(p/q) + (1-p) ln((1-p)/(1-q))");
}
10. Challenges in ML
- Zero Probs: Log undefined; add smoothing.
- Differential Entropy Negative: Not comparable to discrete.
- High-Dim: Entropy estimation curse.
11. Key ML Takeaways
- Entropy measures uncertainty: Model complexity.
- Cross-entropy loss: Predictive accuracy.
- KL quantifies divergence: Regularization.
- Properties guide usage: Non-negativity, asymmetry.
- Code computes: Practical information measures.
Information theory enhances ML.
12. Summary
Explored entropy, cross-entropy, KL divergence, properties, derivations, with ML applications. Examples and Python/Rust code bridge theory to practice. Prepares for Markov chains and Bayesian inference.
Word count: Approximately 3000.
Further Reading
- Cover, Thomas, Elements of Information Theory.
- Bishop, Pattern Recognition (Ch. 1.6).
- Murphy, Probabilistic ML (Ch. 3).
- Rust: Implement custom entropy functions.