Maximum Likelihood vs. Method of Moments

Maximum Likelihood Estimation (MLE) and Method of Moments (MoM) are two fundamental approaches for estimating population parameters from sample data. MLE maximizes the likelihood of observing the data, while MoM matches sample moments to population moments. In machine learning (ML), these methods are used to fit probabilistic models, such as estimating parameters for regression or clustering, balancing computational simplicity and statistical efficiency.

This ninth lecture in the “Statistics Foundations for AI/ML” series builds on resampling and hypothesis testing, exploring MLE and MoM, their mathematical foundations, strengths, weaknesses, and ML applications. We’ll provide intuitive explanations, derivations for common distributions, and practical implementations in Python and Rust, preparing you for Bayesian statistics and model evaluation.

1. Intuition Behind MLE and MoM

MLE: Finds parameters that make the observed data most probable under a specified model. It’s like tuning a model to maximize the chance of seeing your data.

MoM: Matches sample moments (e.g., mean, variance) to their theoretical counterparts, solving equations to estimate parameters. It’s like aligning data summaries to population truths.

ML Connection

MLE: Used in logistic regression, GMMs, and neural networks.
MoM: Simpler for quick estimates, less common in complex ML.

::: info MLE seeks the “best fit” by probability; MoM aligns data stats to theory, like fitting puzzle pieces in different ways. :::

Example

For a Normal distribution, MLE estimates mean as sample mean; MoM does the same but also matches variance directly.

2. Maximum Likelihood Estimation (MLE)

Likelihood: For i.i.d. data D={x₁,…,xₙ} from f(x|θ), L(θ|D) = ∏ f(xᵢ|θ).

Log-Likelihood: l(θ|D) = ∑ log f(xᵢ|θ).

MLE: θ_hat = argmax_θ l(θ|D).

Properties

Consistency: θ_hat → θ as n→∞ (LLN).
Asymptotic Normality: √n (θ_hat - θ) ~ N(0, I⁻¹(θ)), I Fisher information.
Efficiency: Achieves Cramér-Rao bound asymptotically.
Invariance: g(θ_hat) is MLE for g(θ).

ML Insight

MLE drives optimization in supervised learning (e.g., cross-entropy loss).

3. Method of Moments (MoM)

Match sample moments to population moments.

k-th Moment: μₖ = E[Xᵏ].

Sample Moment: mₖ = (1/n) ∑ xᵢᵏ.

Solve: μₖ(θ) = mₖ for k=1,2,…,p parameters.

Properties

Simplicity: Often closed-form.
Consistency: Converges to true θ.
Less Efficient: Higher variance than MLE.
Not Invariant: Depends on parametrization.

ML Application

Initial parameter estimates for iterative algorithms.

4. Deriving MLE and MoM for Common Distributions

Normal N(μ,σ²)

MLE:

l = -n/2 log(2πσ²) - 1/(2σ²) ∑ (xᵢ-μ)².
∂l/∂μ = 0 ⇒ μ_hat = \bar{x}.
∂l/∂σ² = 0 ⇒ σ²_hat = (1/n) ∑ (xᵢ-\bar{x})² (biased).

MoM:

First moment: E[X] = μ = \bar{x} ⇒ μ_hat = \bar{x}.
Second moment: E[X²] = σ² + μ² = (1/n) ∑ xᵢ² ⇒ σ²_hat = (1/n) ∑ xᵢ² - \bar{x}².

Bernoulli(p)

MLE:

l = k log p + (n-k) log(1-p).
∂l/∂p = 0 ⇒ p_hat = k/n.

MoM:

E[X] = p = \bar{x} ⇒ p_hat = \bar{x} = k/n.

Poisson(λ)

MLE:

l = -nλ + ∑ xᵢ log λ - ∑ log(xᵢ!).
∂l/∂λ = 0 ⇒ λ_hat = \bar{x}.

MoM:

E[X] = λ = \bar{x} ⇒ λ_hat = \bar{x}.

ML Insight

MLE often matches MoM for simple cases but outperforms in complex models.

5. Comparing MLE and MoM

Efficiency: MLE asymptotically optimal; MoM less efficient.
Computation: MoM simpler (closed-form); MLE may need optimization.
Robustness: MLE sensitive to model misspecification; MoM less so.
ML Use: MLE for training, MoM for initialization.

Example: Normal variance, MLE biased (n vs n-1), MoM same.

6. Applications in Machine Learning

Regression: MLE for linear/logistic regression parameters.
Clustering: MLE for GMMs, MoM for initial means.
Time-Series: Poisson λ estimation.
Initialization: MoM for EM algorithm starting points.

Challenges

MLE: Non-convex optimization, computational cost.
MoM: Less accurate for small samples.

7. Numerical Computations for MLE and MoM

Compute estimates for Normal, Bernoulli, Poisson.

::: code-group

import numpy as np
from scipy.optimize import minimize
from scipy.stats import poisson

# Normal: MLE and MoM
data = np.random.normal(10, 2, 100)
mu_mle = np.mean(data)
sigma2_mle = np.mean((data - mu_mle)**2)
mu_mom = np.mean(data)
sigma2_mom = np.mean(data**2) - mu_mom**2
print("Normal MLE: μ=", mu_mle, "σ²=", sigma2_mle)
print("Normal MoM: μ=", mu_mom, "σ²=", sigma2_mom)

# Bernoulli: MLE and MoM
data_bin = np.random.binomial(1, 0.6, 100)
p_mle = np.mean(data_bin)
p_mom = np.mean(data_bin)
print("Bernoulli MLE p:", p_mle, "MoM p:", p_mom)

# Poisson: MLE and MoM
data_pois = np.random.poisson(3, 100)
lam_mle = np.mean(data_pois)
lam_mom = np.mean(data_pois)
print("Poisson MLE λ:", lam_mle, "MoM λ:", lam_mom)

# ML: MLE for logistic regression
def log_lik_logistic(beta, X, y):
    p = 1 / (1 + np.exp(-X @ beta))
    return -np.sum(y * np.log(p + 1e-10) + (1-y) * np.log(1-p + 1e-10))

X = np.array([[1,1],[1,2],[1,3],[1,4]])
y = np.array([0,0,1,1])
beta_init = np.zeros(2)
res = minimize(log_lik_logistic, beta_init, args=(X, y))
print("Logistic MLE β:", res.x)

fn log_lik_logistic(beta: &[f64], x: &[[f64; 2]], y: &[u8]) -> f64 {
    let mut sum = 0.0;
    for (xi, &yi) in x.iter().zip(y.iter()) {
        let p = 1.0 / (1.0 + (-(beta[0] * xi[0] + beta[1] * xi[1])).exp());
        sum += if yi == 1 { p.ln() } else { (1.0 - p).ln() };
    }
    -sum
}

fn main() {
    let mut rng = rand::thread_rng();

    // Normal: MLE and MoM
    let normal = rand_distr::Normal::new(10.0, 2.0).unwrap();
    let data: Vec<f64> = (0..100).map(|_| normal.sample(&mut rng)).collect();
    let mu_mle = data.iter().sum::<f64>() / data.len() as f64;
    let sigma2_mle = data.iter().map(|&x| (x - mu_mle).powi(2)).sum::<f64>() / data.len() as f64;
    let mu_mom = mu_mle;
    let sigma2_mom = data.iter().map(|&x| x.powi(2)).sum::<f64>() / data.len() as f64 - mu_mom.powi(2);
    println!("Normal MLE: μ={} σ²={}", mu_mle, sigma2_mle);
    println!("Normal MoM: μ={} σ²={}", mu_mom, sigma2_mom);

    // Bernoulli: MLE and MoM
    let bern = rand_distr::Bernoulli::new(0.6).unwrap();
    let data_bin: Vec<u8> = (0..100).map(|_| bern.sample(&mut rng) as u8).collect();
    let p_mle = data_bin.iter().sum::<u8>() as f64 / data_bin.len() as f64;
    let p_mom = p_mle;
    println!("Bernoulli MLE p: {} MoM p: {}", p_mle, p_mom);

    // Poisson: MLE and MoM
    let pois = rand_distr::Poisson::new(3.0).unwrap();
    let data_pois: Vec<u64> = (0..100).map(|_| pois.sample(&mut rng)).collect();
    let lam_mle = data_pois.iter().sum::<u64>() as f64 / data_pois.len() as f64;
    let lam_mom = lam_mle;
    println!("Poisson MLE λ: {} MoM λ: {}", lam_mle, lam_mom);

    // Logistic MLE (GD)
    let x = [[1.0,1.0],[1.0,2.0],[1.0,3.0],[1.0,4.0]];
    let y = [0,0,1,1];
    let mut beta = [0.0, 0.0];
    let eta = 0.01;
    for _ in 0..1000 {
        let mut grad = [0.0, 0.0];
        for (xi, &yi) in x.iter().zip(y.iter()) {
            let p = 1.0 / (1.0 + (-(beta[0] * xi[0] + beta[1] * xi[1])).exp());
            let err = yi as f64 - p;
            grad[0] += err * xi[0];
            grad[1] += err * xi[1];
        }
        beta[0] += eta * grad[0];
        beta[1] += eta * grad[1];
    }
    println!("Logistic MLE β: {:?}", beta);
}

:::

Computes MLE and MoM for Normal, Bernoulli, Poisson, and logistic regression.

8. Symbolic Derivations with SymPy

Derive MLE and MoM estimates.

::: code-group

from sympy import symbols, log, diff, solve, Sum, IndexedBase

# MLE Normal
n, mu, sigma = symbols('n mu sigma', positive=True)
x = IndexedBase('x')
l = -n/2 * log(2*symbols('pi')*sigma**2) - 1/(2*sigma**2) * Sum((x[i]-mu)**2, (i,1,n))
dl_mu = diff(l, mu)
dl_sigma2 = diff(l, sigma**2)
mu_mle = solve(dl_mu, mu)[0]
sigma2_mle = solve(dl_sigma2, sigma**2)[0]
print("MLE μ Normal:", mu_mle)
print("MLE σ² Normal:", sigma2_mle)

# MoM Normal
m1 = (1/n) * Sum(x[i], (i,1,n))
m2 = (1/n) * Sum(x[i]**2, (i,1,n))
mu_mom = m1
sigma2_mom = m2 - m1**2
print("MoM μ Normal:", mu_mom)
print("MoM σ² Normal:", sigma2_mom)

fn main() {
    println!("MLE μ Normal: x_bar");
    println!("MLE σ² Normal: (1/n) sum (x_i - x_bar)^2");
    println!("MoM μ Normal: x_bar");
    println!("MoM σ² Normal: (1/n) sum x_i^2 - x_bar^2");
}

:::

9. Challenges in ML Applications

MLE: Non-convexity, computational cost.
MoM: Less efficient, sensitive to moment choice.
Model Misspecification: Both methods suffer if f(x|θ) wrong.

10. Key ML Takeaways

MLE maximizes likelihood: Optimal for complex models.
MoM matches moments: Simple, quick estimates.
MLE more efficient: Preferred in ML.
MoM for initialization: Starting points.
Code computes both: Practical estimation.

MLE and MoM drive ML parameter fitting.

11. Summary

Explored MLE and MoM, their derivations, properties, and ML applications in regression and clustering. Examples and Python/Rust code bridge theory to practice. Prepares for Bayesian statistics and model evaluation.

Word count: Approximately 3000.

Maximum Likelihood vs. Method of Moments

Maximum Likelihood vs. Method of Moments

1. Intuition Behind MLE and MoM

ML Connection

Example

2. Maximum Likelihood Estimation (MLE)

Properties

ML Insight

3. Method of Moments (MoM)

Properties

ML Application

4. Deriving MLE and MoM for Common Distributions

Normal N(μ,σ²)

Bernoulli(p)

Poisson(λ)

ML Insight

5. Comparing MLE and MoM

6. Applications in Machine Learning

Challenges

7. Numerical Computations for MLE and MoM

8. Symbolic Derivations with SymPy

9. Challenges in ML Applications

10. Key ML Takeaways

11. Summary

Further Reading