Bayesian Inference for Machine Learning
Bayesian Inference for Machine Learning
Section titled “Bayesian Inference for Machine Learning”Bayesian inference treats parameters as random variables, updating prior beliefs with data to form posterior distributions. In machine learning (ML), it enables uncertainty quantification, regularization, and flexible modeling, as seen in Bayesian neural networks, Gaussian processes, and probabilistic programming. Unlike frequentist methods, Bayesian approaches provide full distributions over parameters, allowing better decision-making under uncertainty.
This eleventh and final lecture in the “Probability Foundations for AI/ML” series builds on entropy, Markov chains, and estimation techniques, exploring Bayesian basics, conjugate priors, approximate inference methods (MCMC, variational), and ML applications. We’ll provide intuitive explanations, mathematical derivations, and practical implementations in Python and Rust, concluding the series with a solid foundation in probabilistic ML.
1. Intuition Behind Bayesian Inference
Section titled “1. Intuition Behind Bayesian Inference”Bayesian inference uses Bayes’ theorem to update beliefs: Posterior ∝ Likelihood × Prior.
Prior: Initial belief about parameters. Likelihood: Data’s probability given parameters. Posterior: Updated belief.
Think of it as learning from experience—start with assumptions, revise with evidence.
ML Connection
Section titled “ML Connection”- Uncertainty: Posteriors give confidence (e.g., credible intervals).
- Regularization: Priors prevent overfitting.
::: info Bayesian inference turns “gut feelings” (priors) into data-driven knowledge (posteriors). :::
Example
Section titled “Example”- Coin bias θ: Prior Beta(1,1) uniform, 3 heads in 5 flips, posterior Beta(4,3), mean 4/7≈0.57.
2. Formal Bayesian Framework
Section titled “2. Formal Bayesian Framework”P(θ|D) = P(D|θ) P(θ) / P(D)
Marginal Likelihood (Evidence): P(D) = ∫ P(D|θ) P(θ) dθ.
Intractable often, use approximations.
Point Estimates: MAP or mean of posterior.
Priors
Section titled “Priors”- Conjugate: Posterior same family (e.g., Beta for Bernoulli).
- Non-informative: Flat or Jeffreys.
ML Insight
Section titled “ML Insight”- Probabilistic models: Full P(θ|D) for ensembles.
3. Conjugate Priors and Closed-Form Posteriors
Section titled “3. Conjugate Priors and Closed-Form Posteriors”Conjugate: Prior family closed under likelihood.
Bernoulli-Beta: Prior Beta(α,β), likelihood Binomial, posterior Beta(α+k,β+n-k).
Normal-Normal: Prior N(μ0,τ^2), likelihood N(μ,σ^2 known), posterior N weighted.
Gamma-Poisson: Prior Gamma(α,β), posterior Gamma(α+sum x_i, β+n).
Properties
Section titled “Properties”- Easy computation.
- Interpret α,β as pseudo-observations.
ML Application
Section titled “ML Application”- Bayesian A/B testing with Beta.
4. Approximate Inference: MCMC Methods
Section titled “4. Approximate Inference: MCMC Methods”For non-conjugate, sample posterior via Markov Chain Monte Carlo (MCMC).
Metropolis-Hastings: Propose θ’, accept with prob min(1, P(θ’|D)/P(θ|D) * q(θ|θ’)/q(θ’|θ)).
Gibbs Sampling: Sample each param conditionally.
Hamiltonian MC (HMC): Use gradients for efficient.
ML Connection
Section titled “ML Connection”- Pyro, Stan for probabilistic programming.
5. Variational Inference (VI)
Section titled “5. Variational Inference (VI)”Approximate posterior with q(θ|φ), min D_{KL}(q||P(θ|D)) ≈ max ELBO = E_q [log P(D,θ) - log q(θ)].
Mean-field: Assume q factorizes.
In ML: Faster than MCMC for large data.
6. Bayesian vs. Frequentist in ML
Section titled “6. Bayesian vs. Frequentist in ML”Frequentist: Point estimates, p-values. Bayesian: Distributions, credible intervals.
Bayesian advantages: Incorporate priors, full uncertainty.
Challenges: Computational cost, prior sensitivity.
7. Applications in Machine Learning
Section titled “7. Applications in Machine Learning”- Bayesian Neural Nets: Priors on weights, posteriors for uncertainty.
- Gaussian Processes: Bayesian non-param regression.
- Variational Autoencoders: VI for latent.
- RL: Thompson sampling with posteriors.
Challenges
Section titled “Challenges”- Scalability: MCMC slow for big models.
8. Numerical Bayesian Inference
Section titled “8. Numerical Bayesian Inference”Sample posteriors, compute means.
::: code-group
import numpy as npimport pymc as pmimport arviz as az
# Conjugate: Beta-Bernoullik, n = 3, 5with pm.Model() as model: p = pm.Beta('p', alpha=1, beta=1) obs = pm.Bernoulli('obs', p=p, observed=np.ones(k)) # Simplified trace = pm.sample(1000)
print("Posterior mean p:", trace.posterior['p'].mean())
# MCMC for normal meandata = np.random.normal(0, 1, 100)with pm.Model() as norm_model: mu = pm.Normal('mu', mu=0, sigma=10) sigma = pm.HalfNormal('sigma', sigma=10) obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=data) trace_norm = pm.sample(1000)
print("Posterior mean μ:", trace_norm.posterior['mu'].mean())
# ML: Bayesian linear regX = np.array([[1,1],[1,2],[1,3]])y = np.array([2,3,4])with pm.Model() as blr: beta = pm.Normal('beta', mu=0, sigma=10, shape=2) sigma = pm.HalfNormal('sigma', sigma=10) mu = pm.math.dot(X, beta) obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=y) trace_blr = pm.sample(1000)
print("Posterior beta mean:", trace_blr.posterior['beta'].mean(axis=0))fn metropolis_bern(k: f64, n: f64, alpha: f64, beta: f64, steps: usize) -> f64 { let mut rng = rand::thread_rng(); let mut p = 0.5; let mut sum = 0.0; for _ in 0..steps { let p_prop = rng.gen_range(0.0..1.0); let post_curr = (k + alpha - 1.0) * p.ln() + (n - k + beta - 1.0) * (1.0 - p).ln(); let post_prop = (k + alpha - 1.0) * p_prop.ln() + (n - k + beta - 1.0) * (1.0 - p_prop).ln(); let accept = (post_prop - post_curr).exp().min(1.0); if rng.gen::<f64>() < accept { p = p_prop; } sum += p; } sum / steps as f64}
fn main() { let k = 3.0; let n = 5.0; let alpha = 1.0; let beta = 1.0; println!("Posterior mean p (MCMC): {}", metropolis_bern(k, n, alpha, beta, 10000));
// Normal mean MCMC (simplified prior N(0,10)) // Omit for brevity, similar proposal}:::
Implements conjugate sampling, MCMC for posterior, Bayesian linear reg.
9. Symbolic Bayesian with SymPy
Section titled “9. Symbolic Bayesian with SymPy”Exact posteriors.
::: code-group
from sympy import symbols, exp, integrate, oo
p, k, n, alpha, beta_sym = symbols('p k n alpha beta', positive=True)post = p**(k + alpha - 1) * (1-p)**(n - k + beta_sym - 1)norm = integrate(post, (p, 0, 1))mean = integrate(p * post, (p, 0, 1)) / normprint("Posterior mean:", mean)fn main() { println!("Posterior mean: (k + alpha)/(n + alpha + beta)");}:::
10. Challenges in Bayesian ML
Section titled “10. Challenges in Bayesian ML”- Computation: MCMC/VI approximate.
- Prior Sensitivity: Subjective choice.
- Scalability: High-dim posteriors.
11. Key ML Takeaways
Section titled “11. Key ML Takeaways”- Bayesian updates beliefs: Prior to posterior.
- Conjugates simplify: Closed forms.
- MCMC/VI approximate: For complex.
- Uncertainty quantification: Posteriors.
- Code implements: Inference.
Bayesian inference empowers uncertain ML.
12. Summary
Section titled “12. Summary”Explored Bayesian inference from intuition to MCMC/VI, conjugate priors, with ML applications. Examples and Python/Rust code bridge theory to practice. Concludes series with probabilistic ML foundation.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Gelman, Bayesian Data Analysis.
- Murphy, Probabilistic ML (Ch. 7-9).
- McElreath, Statistical Rethinking.
- Rust: ‘rand’ for sampling, custom MCMC.