Skip to content

Entropy, Cross-Entropy & KL Divergence

Entropy, cross-entropy, and Kullback-Leibler (KL) divergence are key concepts from information theory that quantify uncertainty, dissimilarity between distributions, and information loss. In machine learning (ML), they serve as loss functions for classification (cross-entropy), regularization in variational autoencoders (KL), and measures of model complexity (entropy). These tools enable efficient training, model comparison, and understanding of probabilistic predictions.

This ninth lecture in the “Probability Foundations for AI/ML” series builds on MLE and MAP, exploring entropy as uncertainty measure, cross-entropy for predictive accuracy, KL for distribution divergence, their properties, and ML applications. We’ll provide intuitive explanations, mathematical derivations, and implementations in Python and Rust, preparing you for Markov chains and Bayesian inference.


Entropy H(P) measures the average uncertainty or information in a distribution P—the “surprise” level or bits needed to encode outcomes.

For discrete P(p_i), high entropy if uniform (max uncertainty), low if concentrated (predictable).

Geometrically, entropy is -expected log-probability.

  • Decision Trees: Split to reduce entropy (information gain).
  • Generative Models: Maximize entropy for diverse outputs.

::: info Entropy gauges “chaos” in probabilities—like how unpredictable a lottery is. :::

  • Fair coin: H= -0.5 log2(0.5) *2 =1 bit (max uncertainty).
  • Biased coin (p=0.9): H≈0.47 bits (more predictable).

For discrete RV with P(X=x_i)=p_i:

H(X) = - sum p_i log p_i (log base 2 for bits, e for nats).

Continuous (differential entropy): h(X) = - ∫ f(x) log f(x) dx.

Properties:

  • Non-negative.
  • H=0 if deterministic.
  • Max for uniform over k outcomes: log k.
  • Additivity for indep: H(X,Y)=H(X)+H(Y).
  • Entropy regularizers promote diversity in GANs.

3. Cross-Entropy: Measuring Predictive Accuracy

Section titled “3. Cross-Entropy: Measuring Predictive Accuracy”

Cross-entropy H(P,Q) = - sum p_i log q_i.

Measures bits needed to encode P using Q’s code—inefficiency if Q ≠ P.

For continuous: h(P,Q) = - ∫ f_P(x) log f_Q(x) dx.

Properties:

  • H(P,Q) ≥ H(P), equality iff P=Q.
  • Not symmetric.
  • Loss function: Min H(true, pred) in classification.

Example: True P=[1,0], pred Q=[0.9,0.1], H(P,Q)= -1log(0.9) -0log(0.1)≈0.105 nats.


KL(P||Q) = sum p_i log (p_i / q_i) = H(P,Q) - H(P).

“Relative entropy”—extra bits to encode P with Q vs optimal.

Continuous: KL(P||Q) = ∫ f_P log (f_P / f_Q) dx.

Properties:

  • KL≥0, =0 iff P=Q.
  • Asymmetric: KL(P||Q) ≠ KL(Q||P).
  • Not distance (no triangle).

From Jensen’s inequality on -log, since -log convex.

  • VAEs: KL regularizes latent dist to prior.
  • Policy gradients: KL constrains updates.

Example: P=Bern(0.5), Q=Bern(0.6), KL(P||Q)≈0.029 bits.


  • Cross-entropy = entropy + KL.
  • Mutual info I(X;Y) = KL(P(X,Y)||P(X)P(Y)) = H(X) - H(X|Y).
  • Chain rule: H(X,Y)=H(X)+H(Y|X).
  • Information bottleneck: Min KL for compression.

Entropy for Bernoulli(p): H(p) = -p log p - (1-p) log(1-p).

Max at p=0.5.

KL for Normals: N(μ1,σ1^2)||N(μ2,σ2^2) = log(σ2/σ1) + (σ1^2 + (μ1-μ2)^2)/(2σ2^2) - 1/2.

Closed form useful in VAEs.

  • Mode collapse in GANs: High entropy promotes diversity.

  1. Loss Functions: Cross-entropy for classification.
  2. VAEs: KL(P(z|x)||Q(z)) regularizes latent.
  3. RL: Entropy bonuses for exploration.
  4. Clustering: Entropy for purity measures.
  5. Feature Selection: Mutual info for relevance.
  • Log(0) issues: Add smoothing.
  • High-dim entropy estimation hard.

Compute entropy, cross-entropy, KL.

::: code-group

import numpy as np
from scipy.stats import entropy
# Discrete entropy
p = [0.5, 0.5]
print("Entropy coin:", entropy(p, base=2)) # 1 bit
# Cross-entropy
q = [0.9, 0.1]
cross_h = -np.sum(p * np.log(q + 1e-10))
print("Cross-entropy:", cross_h)
# KL divergence
kl = entropy(p, q, base=np.e)
print("KL(P||Q):", kl)
# ML: Cross-entropy loss
y_true = np.array([1, 0])
y_pred = np.array([0.8, 0.2])
loss = -np.sum(y_true * np.log(y_pred))
print("CE loss:", loss)
# Continuous entropy approx (Gaussian)
mu, sigma = 0, 1
h_diff = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
print("Differential entropy normal:", h_diff)
fn entropy(p: &[f64], base: f64) -> f64 {
-p.iter().map(|&pi| if pi > 0.0 { pi * pi.log(base) } else { 0.0 }).sum::<f64>()
}
fn cross_entropy(p: &[f64], q: &[f64]) -> f64 {
-p.iter().zip(q.iter()).map(|(&pi, &qi)| if pi > 0.0 { pi * qi.log(std::f64::consts::E) } else { 0.0 }).sum::<f64>()
}
fn kl_div(p: &[f64], q: &[f64]) -> f64 {
p.iter().zip(q.iter()).map(|(&pi, &qi| if pi > 0.0 && qi > 0.0 { pi * (pi / qi).ln() } else { 0.0 }).sum::<f64>()
}
fn main() {
let p = [0.5, 0.5];
println!("Entropy coin (base 2): {}", entropy(&p, 2.0)); # 1.0
let q = [0.9, 0.1];
println!("Cross-entropy:", cross_entropy(&p, &q));
println!("KL(P||Q):", kl_div(&p, &q));
// ML: CE loss
let y_true = [1.0, 0.0];
let y_pred = [0.8, 0.2];
let loss = - y_true.iter().zip(y_pred.iter()).map(|(&yt, &yp)| yt * yp.ln()).sum::<f64>();
println!("CE loss:", loss);
// Differential entropy normal
let sigma = 1.0;
let h_diff = 0.5 * (2.0 * std::f64::consts::PI * std::f64::consts::E * sigma.powi(2)).ln();
println!("Differential entropy normal:", h_diff);
}

:::

Computes entropy, cross-entropy, KL for discrete, continuous approx.


Exact expressions.

::: code-group

from sympy import symbols, log, diff, solve
p = symbols('p', positive=True)
H = -p * log(p) - (1-p) * log(1-p)
dH_dp = diff(H, p)
crit_p = solve(dH_dp, p)[0]
print("Max entropy p:", crit_p) # 0.5
q = symbols('q', positive=True)
KL = p * log(p/q) + (1-p) * log((1-p)/(1-q))
print("KL(P||Q):", KL)
fn main() {
println!("Max entropy p: 0.5");
println!("KL(P||Q): p ln(p/q) + (1-p) ln((1-p)/(1-q))");
}

:::


  • Zero Probs: Log undefined; add smoothing.
  • Differential Entropy Negative: Not comparable to discrete.
  • High-Dim: Entropy estimation curse.

  • Entropy measures uncertainty: Model complexity.
  • Cross-entropy loss: Predictive accuracy.
  • KL quantifies divergence: Regularization.
  • Properties guide usage: Non-negativity, asymmetry.
  • Code computes: Practical information measures.

Information theory enhances ML.


Explored entropy, cross-entropy, KL divergence, properties, derivations, with ML applications. Examples and Python/Rust code bridge theory to practice. Prepares for Markov chains and Bayesian inference.

Word count: Approximately 3000.


  • Cover, Thomas, Elements of Information Theory.
  • Bishop, Pattern Recognition (Ch. 1.6).
  • Murphy, Probabilistic ML (Ch. 3).
  • Rust: Implement custom entropy functions.