Skip to content

Maximum Likelihood vs. Method of Moments

Maximum Likelihood Estimation (MLE) and Method of Moments (MoM) are two fundamental approaches for estimating population parameters from sample data. MLE maximizes the likelihood of observing the data, while MoM matches sample moments to population moments. In machine learning (ML), these methods are used to fit probabilistic models, such as estimating parameters for regression or clustering, balancing computational simplicity and statistical efficiency.

This ninth lecture in the "Statistics Foundations for AI/ML" series builds on resampling and hypothesis testing, exploring MLE and MoM, their mathematical foundations, strengths, weaknesses, and ML applications. We'll provide intuitive explanations, derivations for common distributions, and practical implementations in Python and Rust, preparing you for Bayesian statistics and model evaluation.


1. Intuition Behind MLE and MoM

MLE: Finds parameters that make the observed data most probable under a specified model. It's like tuning a model to maximize the chance of seeing your data.

MoM: Matches sample moments (e.g., mean, variance) to their theoretical counterparts, solving equations to estimate parameters. It's like aligning data summaries to population truths.

ML Connection

  • MLE: Used in logistic regression, GMMs, and neural networks.
  • MoM: Simpler for quick estimates, less common in complex ML.

INFO

MLE seeks the "best fit" by probability; MoM aligns data stats to theory, like fitting puzzle pieces in different ways.

Example

  • For a Normal distribution, MLE estimates mean as sample mean; MoM does the same but also matches variance directly.

2. Maximum Likelihood Estimation (MLE)

Likelihood: For i.i.d. data D={x₁,...,xₙ} from f(x|θ), L(θ|D) = ∏ f(xᵢ|θ).

Log-Likelihood: l(θ|D) = ∑ log f(xᵢ|θ).

MLE: θ_hat = argmax_θ l(θ|D).

Properties

  • Consistency: θ_hat → θ as n→∞ (LLN).
  • Asymptotic Normality: √n (θ_hat - θ) ~ N(0, I⁻¹(θ)), I Fisher information.
  • Efficiency: Achieves Cramér-Rao bound asymptotically.
  • Invariance: g(θ_hat) is MLE for g(θ).

ML Insight

  • MLE drives optimization in supervised learning (e.g., cross-entropy loss).

3. Method of Moments (MoM)

Match sample moments to population moments.

k-th Moment: μₖ = E[Xᵏ].

Sample Moment: mₖ = (1/n) ∑ xᵢᵏ.

Solve: μₖ(θ) = mₖ for k=1,2,...,p parameters.

Properties

  • Simplicity: Often closed-form.
  • Consistency: Converges to true θ.
  • Less Efficient: Higher variance than MLE.
  • Not Invariant: Depends on parametrization.

ML Application

  • Initial parameter estimates for iterative algorithms.

4. Deriving MLE and MoM for Common Distributions

Normal N(μ,σ²)

MLE:

  • l = -n/2 log(2πσ²) - 1/(2σ²) ∑ (xᵢ-μ)².
  • ∂l/∂μ = 0 ⇒ μ_hat = \bar{x}.
  • ∂l/∂σ² = 0 ⇒ σ²_hat = (1/n) ∑ (xᵢ-\bar{x})² (biased).

MoM:

  • First moment: E[X] = μ = \bar{x} ⇒ μ_hat = \bar{x}.
  • Second moment: E[X²] = σ² + μ² = (1/n) ∑ xᵢ² ⇒ σ²_hat = (1/n) ∑ xᵢ² - \bar{x}².

Bernoulli(p)

MLE:

  • l = k log p + (n-k) log(1-p).
  • ∂l/∂p = 0 ⇒ p_hat = k/n.

MoM:

  • E[X] = p = \bar{x} ⇒ p_hat = \bar{x} = k/n.

Poisson(λ)

MLE:

  • l = -nλ + ∑ xᵢ log λ - ∑ log(xᵢ!).
  • ∂l/∂λ = 0 ⇒ λ_hat = \bar{x}.

MoM:

  • E[X] = λ = \bar{x} ⇒ λ_hat = \bar{x}.

ML Insight

  • MLE often matches MoM for simple cases but outperforms in complex models.

5. Comparing MLE and MoM

  • Efficiency: MLE asymptotically optimal; MoM less efficient.
  • Computation: MoM simpler (closed-form); MLE may need optimization.
  • Robustness: MLE sensitive to model misspecification; MoM less so.
  • ML Use: MLE for training, MoM for initialization.

Example: Normal variance, MLE biased (n vs n-1), MoM same.


6. Applications in Machine Learning

  1. Regression: MLE for linear/logistic regression parameters.
  2. Clustering: MLE for GMMs, MoM for initial means.
  3. Time-Series: Poisson λ estimation.
  4. Initialization: MoM for EM algorithm starting points.

Challenges

  • MLE: Non-convex optimization, computational cost.
  • MoM: Less accurate for small samples.

7. Numerical Computations for MLE and MoM

Compute estimates for Normal, Bernoulli, Poisson.

python
import numpy as np
from scipy.optimize import minimize
from scipy.stats import poisson

# Normal: MLE and MoM
data = np.random.normal(10, 2, 100)
mu_mle = np.mean(data)
sigma2_mle = np.mean((data - mu_mle)**2)
mu_mom = np.mean(data)
sigma2_mom = np.mean(data**2) - mu_mom**2
print("Normal MLE: μ=", mu_mle, "σ²=", sigma2_mle)
print("Normal MoM: μ=", mu_mom, "σ²=", sigma2_mom)

# Bernoulli: MLE and MoM
data_bin = np.random.binomial(1, 0.6, 100)
p_mle = np.mean(data_bin)
p_mom = np.mean(data_bin)
print("Bernoulli MLE p:", p_mle, "MoM p:", p_mom)

# Poisson: MLE and MoM
data_pois = np.random.poisson(3, 100)
lam_mle = np.mean(data_pois)
lam_mom = np.mean(data_pois)
print("Poisson MLE λ:", lam_mle, "MoM λ:", lam_mom)

# ML: MLE for logistic regression
def log_lik_logistic(beta, X, y):
    p = 1 / (1 + np.exp(-X @ beta))
    return -np.sum(y * np.log(p + 1e-10) + (1-y) * np.log(1-p + 1e-10))

X = np.array([[1,1],[1,2],[1,3],[1,4]])
y = np.array([0,0,1,1])
beta_init = np.zeros(2)
res = minimize(log_lik_logistic, beta_init, args=(X, y))
print("Logistic MLE β:", res.x)
rust
fn log_lik_logistic(beta: &[f64], x: &[[f64; 2]], y: &[u8]) -> f64 {
    let mut sum = 0.0;
    for (xi, &yi) in x.iter().zip(y.iter()) {
        let p = 1.0 / (1.0 + (-(beta[0] * xi[0] + beta[1] * xi[1])).exp());
        sum += if yi == 1 { p.ln() } else { (1.0 - p).ln() };
    }
    -sum
}

fn main() {
    let mut rng = rand::thread_rng();

    // Normal: MLE and MoM
    let normal = rand_distr::Normal::new(10.0, 2.0).unwrap();
    let data: Vec<f64> = (0..100).map(|_| normal.sample(&mut rng)).collect();
    let mu_mle = data.iter().sum::<f64>() / data.len() as f64;
    let sigma2_mle = data.iter().map(|&x| (x - mu_mle).powi(2)).sum::<f64>() / data.len() as f64;
    let mu_mom = mu_mle;
    let sigma2_mom = data.iter().map(|&x| x.powi(2)).sum::<f64>() / data.len() as f64 - mu_mom.powi(2);
    println!("Normal MLE: μ={} σ²={}", mu_mle, sigma2_mle);
    println!("Normal MoM: μ={} σ²={}", mu_mom, sigma2_mom);

    // Bernoulli: MLE and MoM
    let bern = rand_distr::Bernoulli::new(0.6).unwrap();
    let data_bin: Vec<u8> = (0..100).map(|_| bern.sample(&mut rng) as u8).collect();
    let p_mle = data_bin.iter().sum::<u8>() as f64 / data_bin.len() as f64;
    let p_mom = p_mle;
    println!("Bernoulli MLE p: {} MoM p: {}", p_mle, p_mom);

    // Poisson: MLE and MoM
    let pois = rand_distr::Poisson::new(3.0).unwrap();
    let data_pois: Vec<u64> = (0..100).map(|_| pois.sample(&mut rng)).collect();
    let lam_mle = data_pois.iter().sum::<u64>() as f64 / data_pois.len() as f64;
    let lam_mom = lam_mle;
    println!("Poisson MLE λ: {} MoM λ: {}", lam_mle, lam_mom);

    // Logistic MLE (GD)
    let x = [[1.0,1.0],[1.0,2.0],[1.0,3.0],[1.0,4.0]];
    let y = [0,0,1,1];
    let mut beta = [0.0, 0.0];
    let eta = 0.01;
    for _ in 0..1000 {
        let mut grad = [0.0, 0.0];
        for (xi, &yi) in x.iter().zip(y.iter()) {
            let p = 1.0 / (1.0 + (-(beta[0] * xi[0] + beta[1] * xi[1])).exp());
            let err = yi as f64 - p;
            grad[0] += err * xi[0];
            grad[1] += err * xi[1];
        }
        beta[0] += eta * grad[0];
        beta[1] += eta * grad[1];
    }
    println!("Logistic MLE β: {:?}", beta);
}

Computes MLE and MoM for Normal, Bernoulli, Poisson, and logistic regression.


8. Symbolic Derivations with SymPy

Derive MLE and MoM estimates.

python
from sympy import symbols, log, diff, solve, Sum, IndexedBase

# MLE Normal
n, mu, sigma = symbols('n mu sigma', positive=True)
x = IndexedBase('x')
l = -n/2 * log(2*symbols('pi')*sigma**2) - 1/(2*sigma**2) * Sum((x[i]-mu)**2, (i,1,n))
dl_mu = diff(l, mu)
dl_sigma2 = diff(l, sigma**2)
mu_mle = solve(dl_mu, mu)[0]
sigma2_mle = solve(dl_sigma2, sigma**2)[0]
print("MLE μ Normal:", mu_mle)
print("MLE σ² Normal:", sigma2_mle)

# MoM Normal
m1 = (1/n) * Sum(x[i], (i,1,n))
m2 = (1/n) * Sum(x[i]**2, (i,1,n))
mu_mom = m1
sigma2_mom = m2 - m1**2
print("MoM μ Normal:", mu_mom)
print("MoM σ² Normal:", sigma2_mom)
rust
fn main() {
    println!("MLE μ Normal: x_bar");
    println!("MLE σ² Normal: (1/n) sum (x_i - x_bar)^2");
    println!("MoM μ Normal: x_bar");
    println!("MoM σ² Normal: (1/n) sum x_i^2 - x_bar^2");
}

9. Challenges in ML Applications

  • MLE: Non-convexity, computational cost.
  • MoM: Less efficient, sensitive to moment choice.
  • Model Misspecification: Both methods suffer if f(x|θ) wrong.

10. Key ML Takeaways

  • MLE maximizes likelihood: Optimal for complex models.
  • MoM matches moments: Simple, quick estimates.
  • MLE more efficient: Preferred in ML.
  • MoM for initialization: Starting points.
  • Code computes both: Practical estimation.

MLE and MoM drive ML parameter fitting.


11. Summary

Explored MLE and MoM, their derivations, properties, and ML applications in regression and clustering. Examples and Python/Rust code bridge theory to practice. Prepares for Bayesian statistics and model evaluation.

Word count: Approximately 3000.


Further Reading

  • Wasserman, All of Statistics (Ch. 9).
  • Casella, Berger, Statistical Inference (Ch. 7).
  • James, Introduction to Statistical Learning (Ch. 4).
  • Rust: 'argmin' for optimization, 'rand_distr' for sampling.