Maximum Likelihood vs. Method of Moments
Maximum Likelihood vs. Method of Moments
Section titled “Maximum Likelihood vs. Method of Moments”Maximum Likelihood Estimation (MLE) and Method of Moments (MoM) are two fundamental approaches for estimating population parameters from sample data. MLE maximizes the likelihood of observing the data, while MoM matches sample moments to population moments. In machine learning (ML), these methods are used to fit probabilistic models, such as estimating parameters for regression or clustering, balancing computational simplicity and statistical efficiency.
This ninth lecture in the “Statistics Foundations for AI/ML” series builds on resampling and hypothesis testing, exploring MLE and MoM, their mathematical foundations, strengths, weaknesses, and ML applications. We’ll provide intuitive explanations, derivations for common distributions, and practical implementations in Python and Rust, preparing you for Bayesian statistics and model evaluation.
1. Intuition Behind MLE and MoM
Section titled “1. Intuition Behind MLE and MoM”MLE: Finds parameters that make the observed data most probable under a specified model. It’s like tuning a model to maximize the chance of seeing your data.
MoM: Matches sample moments (e.g., mean, variance) to their theoretical counterparts, solving equations to estimate parameters. It’s like aligning data summaries to population truths.
ML Connection
Section titled “ML Connection”- MLE: Used in logistic regression, GMMs, and neural networks.
- MoM: Simpler for quick estimates, less common in complex ML.
::: info MLE seeks the “best fit” by probability; MoM aligns data stats to theory, like fitting puzzle pieces in different ways. :::
Example
Section titled “Example”- For a Normal distribution, MLE estimates mean as sample mean; MoM does the same but also matches variance directly.
2. Maximum Likelihood Estimation (MLE)
Section titled “2. Maximum Likelihood Estimation (MLE)”Likelihood: For i.i.d. data D={x₁,…,xₙ} from f(x|θ), L(θ|D) = ∏ f(xᵢ|θ).
Log-Likelihood: l(θ|D) = ∑ log f(xᵢ|θ).
MLE: θ_hat = argmax_θ l(θ|D).
Properties
Section titled “Properties”- Consistency: θ_hat → θ as n→∞ (LLN).
- Asymptotic Normality: √n (θ_hat - θ) ~ N(0, I⁻¹(θ)), I Fisher information.
- Efficiency: Achieves Cramér-Rao bound asymptotically.
- Invariance: g(θ_hat) is MLE for g(θ).
ML Insight
Section titled “ML Insight”- MLE drives optimization in supervised learning (e.g., cross-entropy loss).
3. Method of Moments (MoM)
Section titled “3. Method of Moments (MoM)”Match sample moments to population moments.
k-th Moment: μₖ = E[Xᵏ].
Sample Moment: mₖ = (1/n) ∑ xᵢᵏ.
Solve: μₖ(θ) = mₖ for k=1,2,…,p parameters.
Properties
Section titled “Properties”- Simplicity: Often closed-form.
- Consistency: Converges to true θ.
- Less Efficient: Higher variance than MLE.
- Not Invariant: Depends on parametrization.
ML Application
Section titled “ML Application”- Initial parameter estimates for iterative algorithms.
4. Deriving MLE and MoM for Common Distributions
Section titled “4. Deriving MLE and MoM for Common Distributions”Normal N(μ,σ²)
Section titled “Normal N(μ,σ²)”MLE:
- l = -n/2 log(2πσ²) - 1/(2σ²) ∑ (xᵢ-μ)².
- ∂l/∂μ = 0 ⇒ μ_hat = \bar{x}.
- ∂l/∂σ² = 0 ⇒ σ²_hat = (1/n) ∑ (xᵢ-\bar{x})² (biased).
MoM:
- First moment: E[X] = μ = \bar{x} ⇒ μ_hat = \bar{x}.
- Second moment: E[X²] = σ² + μ² = (1/n) ∑ xᵢ² ⇒ σ²_hat = (1/n) ∑ xᵢ² - \bar{x}².
Bernoulli(p)
Section titled “Bernoulli(p)”MLE:
- l = k log p + (n-k) log(1-p).
- ∂l/∂p = 0 ⇒ p_hat = k/n.
MoM:
- E[X] = p = \bar{x} ⇒ p_hat = \bar{x} = k/n.
Poisson(λ)
Section titled “Poisson(λ)”MLE:
- l = -nλ + ∑ xᵢ log λ - ∑ log(xᵢ!).
- ∂l/∂λ = 0 ⇒ λ_hat = \bar{x}.
MoM:
- E[X] = λ = \bar{x} ⇒ λ_hat = \bar{x}.
ML Insight
Section titled “ML Insight”- MLE often matches MoM for simple cases but outperforms in complex models.
5. Comparing MLE and MoM
Section titled “5. Comparing MLE and MoM”- Efficiency: MLE asymptotically optimal; MoM less efficient.
- Computation: MoM simpler (closed-form); MLE may need optimization.
- Robustness: MLE sensitive to model misspecification; MoM less so.
- ML Use: MLE for training, MoM for initialization.
Example: Normal variance, MLE biased (n vs n-1), MoM same.
6. Applications in Machine Learning
Section titled “6. Applications in Machine Learning”- Regression: MLE for linear/logistic regression parameters.
- Clustering: MLE for GMMs, MoM for initial means.
- Time-Series: Poisson λ estimation.
- Initialization: MoM for EM algorithm starting points.
Challenges
Section titled “Challenges”- MLE: Non-convex optimization, computational cost.
- MoM: Less accurate for small samples.
7. Numerical Computations for MLE and MoM
Section titled “7. Numerical Computations for MLE and MoM”Compute estimates for Normal, Bernoulli, Poisson.
::: code-group
import numpy as npfrom scipy.optimize import minimizefrom scipy.stats import poisson
# Normal: MLE and MoMdata = np.random.normal(10, 2, 100)mu_mle = np.mean(data)sigma2_mle = np.mean((data - mu_mle)**2)mu_mom = np.mean(data)sigma2_mom = np.mean(data**2) - mu_mom**2print("Normal MLE: μ=", mu_mle, "σ²=", sigma2_mle)print("Normal MoM: μ=", mu_mom, "σ²=", sigma2_mom)
# Bernoulli: MLE and MoMdata_bin = np.random.binomial(1, 0.6, 100)p_mle = np.mean(data_bin)p_mom = np.mean(data_bin)print("Bernoulli MLE p:", p_mle, "MoM p:", p_mom)
# Poisson: MLE and MoMdata_pois = np.random.poisson(3, 100)lam_mle = np.mean(data_pois)lam_mom = np.mean(data_pois)print("Poisson MLE λ:", lam_mle, "MoM λ:", lam_mom)
# ML: MLE for logistic regressiondef log_lik_logistic(beta, X, y): p = 1 / (1 + np.exp(-X @ beta)) return -np.sum(y * np.log(p + 1e-10) + (1-y) * np.log(1-p + 1e-10))
X = np.array([[1,1],[1,2],[1,3],[1,4]])y = np.array([0,0,1,1])beta_init = np.zeros(2)res = minimize(log_lik_logistic, beta_init, args=(X, y))print("Logistic MLE β:", res.x)fn log_lik_logistic(beta: &[f64], x: &[[f64; 2]], y: &[u8]) -> f64 { let mut sum = 0.0; for (xi, &yi) in x.iter().zip(y.iter()) { let p = 1.0 / (1.0 + (-(beta[0] * xi[0] + beta[1] * xi[1])).exp()); sum += if yi == 1 { p.ln() } else { (1.0 - p).ln() }; } -sum}
fn main() { let mut rng = rand::thread_rng();
// Normal: MLE and MoM let normal = rand_distr::Normal::new(10.0, 2.0).unwrap(); let data: Vec<f64> = (0..100).map(|_| normal.sample(&mut rng)).collect(); let mu_mle = data.iter().sum::<f64>() / data.len() as f64; let sigma2_mle = data.iter().map(|&x| (x - mu_mle).powi(2)).sum::<f64>() / data.len() as f64; let mu_mom = mu_mle; let sigma2_mom = data.iter().map(|&x| x.powi(2)).sum::<f64>() / data.len() as f64 - mu_mom.powi(2); println!("Normal MLE: μ={} σ²={}", mu_mle, sigma2_mle); println!("Normal MoM: μ={} σ²={}", mu_mom, sigma2_mom);
// Bernoulli: MLE and MoM let bern = rand_distr::Bernoulli::new(0.6).unwrap(); let data_bin: Vec<u8> = (0..100).map(|_| bern.sample(&mut rng) as u8).collect(); let p_mle = data_bin.iter().sum::<u8>() as f64 / data_bin.len() as f64; let p_mom = p_mle; println!("Bernoulli MLE p: {} MoM p: {}", p_mle, p_mom);
// Poisson: MLE and MoM let pois = rand_distr::Poisson::new(3.0).unwrap(); let data_pois: Vec<u64> = (0..100).map(|_| pois.sample(&mut rng)).collect(); let lam_mle = data_pois.iter().sum::<u64>() as f64 / data_pois.len() as f64; let lam_mom = lam_mle; println!("Poisson MLE λ: {} MoM λ: {}", lam_mle, lam_mom);
// Logistic MLE (GD) let x = [[1.0,1.0],[1.0,2.0],[1.0,3.0],[1.0,4.0]]; let y = [0,0,1,1]; let mut beta = [0.0, 0.0]; let eta = 0.01; for _ in 0..1000 { let mut grad = [0.0, 0.0]; for (xi, &yi) in x.iter().zip(y.iter()) { let p = 1.0 / (1.0 + (-(beta[0] * xi[0] + beta[1] * xi[1])).exp()); let err = yi as f64 - p; grad[0] += err * xi[0]; grad[1] += err * xi[1]; } beta[0] += eta * grad[0]; beta[1] += eta * grad[1]; } println!("Logistic MLE β: {:?}", beta);}:::
Computes MLE and MoM for Normal, Bernoulli, Poisson, and logistic regression.
8. Symbolic Derivations with SymPy
Section titled “8. Symbolic Derivations with SymPy”Derive MLE and MoM estimates.
::: code-group
from sympy import symbols, log, diff, solve, Sum, IndexedBase
# MLE Normaln, mu, sigma = symbols('n mu sigma', positive=True)x = IndexedBase('x')l = -n/2 * log(2*symbols('pi')*sigma**2) - 1/(2*sigma**2) * Sum((x[i]-mu)**2, (i,1,n))dl_mu = diff(l, mu)dl_sigma2 = diff(l, sigma**2)mu_mle = solve(dl_mu, mu)[0]sigma2_mle = solve(dl_sigma2, sigma**2)[0]print("MLE μ Normal:", mu_mle)print("MLE σ² Normal:", sigma2_mle)
# MoM Normalm1 = (1/n) * Sum(x[i], (i,1,n))m2 = (1/n) * Sum(x[i]**2, (i,1,n))mu_mom = m1sigma2_mom = m2 - m1**2print("MoM μ Normal:", mu_mom)print("MoM σ² Normal:", sigma2_mom)fn main() { println!("MLE μ Normal: x_bar"); println!("MLE σ² Normal: (1/n) sum (x_i - x_bar)^2"); println!("MoM μ Normal: x_bar"); println!("MoM σ² Normal: (1/n) sum x_i^2 - x_bar^2");}:::
9. Challenges in ML Applications
Section titled “9. Challenges in ML Applications”- MLE: Non-convexity, computational cost.
- MoM: Less efficient, sensitive to moment choice.
- Model Misspecification: Both methods suffer if f(x|θ) wrong.
10. Key ML Takeaways
Section titled “10. Key ML Takeaways”- MLE maximizes likelihood: Optimal for complex models.
- MoM matches moments: Simple, quick estimates.
- MLE more efficient: Preferred in ML.
- MoM for initialization: Starting points.
- Code computes both: Practical estimation.
MLE and MoM drive ML parameter fitting.
11. Summary
Section titled “11. Summary”Explored MLE and MoM, their derivations, properties, and ML applications in regression and clustering. Examples and Python/Rust code bridge theory to practice. Prepares for Bayesian statistics and model evaluation.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Wasserman, All of Statistics (Ch. 9).
- Casella, Berger, Statistical Inference (Ch. 7).
- James, Introduction to Statistical Learning (Ch. 4).
- Rust: ‘argmin’ for optimization, ‘rand_distr’ for sampling.