Maximum A Posteriori (MAP) Estimation
Maximum A Posteriori (MAP) Estimation
Section titled “Maximum A Posteriori (MAP) Estimation”Maximum A Posteriori (MAP) estimation is a Bayesian approach to parameter estimation that incorporates prior knowledge about parameters, maximizing the posterior probability given the data. In machine learning (ML), MAP bridges MLE and full Bayesian inference, providing regularized estimates that prevent overfitting, as seen in ridge regression and LASSO. By combining likelihood with priors, MAP offers a principled way to update beliefs in probabilistic models.
This eighth lecture in the “Probability Foundations for AI/ML” series builds on MLE, exploring MAP’s formulation, incorporation of priors, comparison to MLE, derivations for common cases, properties, and ML applications. We’ll provide intuitive insights, mathematical derivations, and practical implementations in Python and Rust, preparing you for entropy and advanced Bayesian methods.
1. Intuition Behind MAP Estimation
Section titled “1. Intuition Behind MAP Estimation”MAP extends MLE by including prior beliefs: Instead of maximizing P(data|θ), it maximizes P(θ|data) ∝ P(data|θ) P(θ), balancing data fit with prior assumptions.
Think of it as MLE with “regularization” from priors—priors pull estimates toward reasonable values when data is scarce.
ML Connection
Section titled “ML Connection”- Regularization: L2 penalty in ridge regression is MAP with Gaussian prior.
- Overfitting Prevention: Priors constrain parameters in sparse data.
::: info MAP is like MLE with a “reality check” from priors, preventing wild estimates from noisy data. :::
Example
Section titled “Example”- Coin flips: 3 heads in 5 tosses. MLE p=0.6. With Beta(2,2) prior (favoring 0.5), MAP p=(3+2-1)/(5+2+2-2)=4/7≈0.57.
2. Formal Definition: Posterior, Prior, Likelihood
Section titled “2. Formal Definition: Posterior, Prior, Likelihood”From Bayes’ theorem:
P(θ|D) = P(D|θ) P(θ) / P(D)
MAP: θ_hat = argmax_θ P(θ|D) = argmax_θ P(D|θ) P(θ), since P(D) constant.
Log-Posterior: l(θ|D) = log L(θ|D) + log P(θ).
Assumptions
Section titled “Assumptions”- Prior P(θ) reflects belief.
- Likelihood P(D|θ) from model.
ML Insight
Section titled “ML Insight”- MAP ≡ MLE with log-prior penalty.
3. Comparison to MLE
Section titled “3. Comparison to MLE”MLE: Max P(D|θ), prior uniform (implicit).
MAP: Max P(θ|D), explicit prior.
- MLE unbiased for some, MAP biased toward prior.
- MAP reduces variance, better in small data.
- As n→∞, MAP → MLE (data dominates).
When to Use MAP
Section titled “When to Use MAP”- Sparse data: Priors stabilize.
- Regularization: Prevent extreme params.
Example: Normal variance MLE biased; MAP with inverse-gamma prior adjusts.
4. Deriving MAP for Common Distributions
Section titled “4. Deriving MAP for Common Distributions”Bernoulli with Beta Prior
Section titled “Bernoulli with Beta Prior”Likelihood: L(p) = p^k (1-p)^{n-k}.
Prior: Beta(α,β), P(p) ∝ p^{α-1} (1-p)^{β-1}.
Posterior ∝ p^{k+α-1} (1-p)^{n-k+β-1}.
MAP: p_hat = (k+α-1)/(n+α+β-2).
For α=β=1 (uniform), MAP=MLE.
Normal Mean with Normal Prior
Section titled “Normal Mean with Normal Prior”Likelihood: L(μ) ∝ exp(-n (μ - \bar{x})^2 / (2σ^2)).
Prior: N(μ_0, τ^2), P(μ) ∝ exp(-(μ - μ_0)^2 / (2τ^2)).
Posterior ∝ exp( - (μ - μ_hat)^2 / (2 var_hat) ), where μ_hat weighted average.
MAP: μ_hat = (n \bar{x}/σ^2 + μ_0/τ^2) / (n/σ^2 + 1/τ^2).
Normal Variance with Inverse-Gamma Prior
Section titled “Normal Variance with Inverse-Gamma Prior”More complex, but MAP corrects MLE bias.
ML Application
Section titled “ML Application”- Ridge Regression: MAP with Gaussian prior on β.
5. Properties of MAP Estimates
Section titled “5. Properties of MAP Estimates”- Consistency: If prior positive, MAP consistent under same conditions as MLE.
- Asymptotic Normality: Similar to MLE, but prior affects small n.
- Bias-Variance Tradeoff: Prior biases but reduces variance.
- Invariance: Not invariant under reparametrization (unlike MLE).
ML Insight
Section titled “ML Insight”- MAP’s regularization improves generalization.
6. Choosing Priors for MAP
Section titled “6. Choosing Priors for MAP”Conjugate Priors: Posterior same family as prior (e.g., Beta for Bernoulli).
Non-Informative: Uniform or Jeffreys for objectivity.
Informative: Based on domain knowledge.
In ML: Gaussian priors for weights in neural nets.
7. Numerical Optimization for MAP
Section titled “7. Numerical Optimization for MAP”Similar to MLE, but add log P(θ).
Use GD on log-posterior.
In code: Include prior term in objective.
::: code-group
import numpy as npfrom scipy.optimize import minimize
# MAP for Bernoulli with Beta priordef log_post_bern(p, k, n, alpha=2, beta=2): return (k + alpha - 1) * np.log(p) + (n - k + beta - 1) * np.log(1 - p)
k, n = 3, 5res = minimize(lambda p: -log_post_bern(p, k, n), 0.5, bounds=[(0.01, 0.99)])print("MAP p Bernoulli:", res.x[0])
# MAP for linear reg with Gaussian prior (ridge)def log_post_lin(beta, X, y, lam=0.1): pred = X @ beta lik = -0.5 * np.sum((y - pred)**2) prior = -lam / 2 * np.sum(beta**2) return -(lik + prior)
X = np.array([[1,1],[1,2],[1,3]])y = np.array([2,3,4])beta_init = np.zeros(2)res = minimize(lambda beta: log_post_lin(beta, X, y), beta_init)print("MAP β linear (ridge):", res.x)
# ML: MAP in logistic with L2def log_post_logistic(beta, X, y, lam=0.1): p = 1 / (1 + np.exp(-X @ beta)) lik = np.sum(y * np.log(p + 1e-10) + (1 - y) * np.log(1 - p + 1e-10)) prior = -lam / 2 * np.sum(beta**2) return - (lik + prior)
X_log = np.array([[1,1],[1,2],[1,3],[1,4]])y_log = np.array([0,0,1,1])beta_init = np.zeros(2)res_log = minimize(log_post_logistic, beta_init, args=(X_log, y_log))print("MAP β logistic (L2):", res_log.x)fn log_post_bern(p: f64, k: f64, n: f64, alpha: f64, beta: f64) -> f64 { (k + alpha - 1.0) * p.ln() + (n - k + beta - 1.0) * (1.0 - p).ln()}
fn main() { // MAP Bernoulli (simple search) let k = 3.0; let n = 5.0; let alpha = 2.0; let beta = 2.0; let mut max_p = 0.0; let mut max_val = f64::NEG_INFINITY; for i in 1..99 { let p = i as f64 / 100.0; let val = log_post_bern(p, k, n, alpha, beta); if val > max_val { max_val = val; max_p = p; } } println!("MAP p Bernoulli: {}", max_p);
// MAP linear (ridge, simple GD) let x = [[1.0,1.0],[1.0,2.0],[1.0,3.0]]; let y = [2.0,3.0,4.0]; let lam = 0.1; let mut beta = [0.0, 0.0]; let eta = 0.01; for _ in 0..1000 { let mut grad_lik = [0.0, 0.0]; for (i, &yi) in y.iter().enumerate() { let pred = beta[0] * x[i][0] + beta[1] * x[i][1]; let err = pred - yi; grad_lik[0] += err * x[i][0]; grad_lik[1] += err * x[i][1]; } let grad_prior = [lam * beta[0], lam * beta[1]]; let grad = [grad_lik[0] + grad_prior[0], grad_lik[1] + grad_prior[1]]; beta[0] -= eta * grad[0]; beta[1] -= eta * grad[1]; } println!("MAP β linear (ridge): {:?}", beta);}:::
Estimates MAP for Bernoulli, linear regression with prior.
8. Symbolic MAP with SymPy
Section titled “8. Symbolic MAP with SymPy”Derive closed forms.
::: code-group
from sympy import symbols, diff, solve, log
p, k, n, alpha, beta_sym = symbols('p k n alpha beta', positive=True)l_post = (k + alpha - 1) * log(p) + (n - k + beta_sym - 1) * log(1 - p)dl_dp = diff(l_post, p)p_map = solve(dl_dp, p)[0]print("MAP p Bernoulli:", p_map)
mu, tau, x_bar, sigma, lam = symbols('mu tau x_bar sigma lam', positive=True)l_post_norm = - (mu - x_bar)**2 / (2 * sigma**2) - (mu - 0)**2 / (2 * tau**2) # Prior N(0,τ^2)dl_mu = diff(l_post_norm, mu)mu_map = solve(dl_mu, mu)[0]print("MAP μ Normal:", mu_map)fn main() { println!("MAP p Bernoulli: (k + alpha - 1)/(n + alpha + beta - 2)"); println!("MAP μ Normal: (x_bar / sigma^2) / (1/sigma^2 + 1/tau^2)");}:::
9. Applications in ML
Section titled “9. Applications in ML”- Ridge/LASSO: MAP with Gaussian/Laplace priors.
- Bayesian Neural Nets: MAP as point estimate.
- Sparse Models: Laplace prior for sparsity.
10. Challenges and Considerations
Section titled “10. Challenges and Considerations”- Prior Selection: Subjective; impacts results.
- Computation: Like MLE, but prior adds terms.
- Non-Conjugate: Harder optimization.
11. Key ML Takeaways
Section titled “11. Key ML Takeaways”- MAP incorporates priors: For regularization.
- Posterior max: Balances data and belief.
- Comparison to MLE: Adds bias, reduces variance.
- Conjugate priors simplify: Closed forms.
- Code optimizes posteriors: Practical MAP.
MAP enhances MLE with prior knowledge.
12. Summary
Section titled “12. Summary”Explored MAP estimation from intuition to derivations with Beta, Normal priors, properties, and ML applications like ridge. Examples and Python/Rust code bridge theory to practice. Prepares for entropy and Markov chains.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Wasserman, All of Statistics (Ch. 13).
- Bishop, Pattern Recognition (Ch. 3.4).
- Murphy, Probabilistic ML (Ch. 7).
- Rust: ‘argmin’ for optimization.