Skip to content

Gaussian Mixture Models & EM Algorithm

Gaussian Mixture Models (GMMs) are probabilistic models that represent data as a mixture of Gaussian distributions, useful for clustering, density estimation, and generative tasks. The Expectation-Maximization (EM) algorithm is a powerful iterative method for finding maximum likelihood estimates in models with latent variables, like GMMs. In 2025, GMMs and EM remain foundational in ML for unsupervised learning, serving as baselines for advanced generative models like VAEs and diffusion models, and in applications like anomaly detection and speech recognition.

This lecture in the "Foundations for AI/ML" series (core-ml cluster) builds on clustering and Naive Bayes, exploring GMMs, the EM algorithm, their theoretical foundations, derivations, and applications. We’ll provide intuitive explanations, mathematical insights, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.


1. Motivation and Intuition

GMMs assume data is generated from a mixture of k Gaussians, each with mean, covariance, and weight. EM optimizes parameters by alternating expectation (assign probabilities) and maximization (update parameters).

Why GMMs & EM in 2025?

  • Unsupervised Learning: Cluster data without labels.
  • Density Estimation: Model complex distributions.
  • Baseline: For advanced models like VAEs.
  • Modern Applications: Anomaly detection in IoT, speech processing with LLMs.

Real-World Examples

  • Audio Processing: Cluster sound features for recognition.
  • Image Segmentation: Group pixels by color distributions.
  • AI Pipelines: GMM on LLM embeddings for clustering.

INFO

GMMs are like mixing colors to paint a picture—EM iteratively refines the palette for the best fit.


2. Mathematical Formulation of GMMs

A GMM with k components:

[ p(x) = \sum_{j=1}^k \pi_j \mathcal{N}(x | \mu_j, \Sigma_j) ]

  • \pi_j: Mixing weights, sum \pi_j = 1.
  • \mathcal{N}(x | \mu_j, \Sigma_j): Gaussian with mean \mu_j, covariance \Sigma_j.

Log-likelihood:

[ \ell(\theta) = \sum_{i=1}^m \log \left( \sum_{j=1}^k \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j) \right) ]

No closed-form MLE due to sum inside log.

Latent Variables

Introduce z_i ~ Multinomial(\pi), x_i | z_i=j ~ N(\mu_j, \Sigma_j).

Joint p(x,z|θ) tractable.

ML Connection

  • GMMs model multimodal data.

3. EM Algorithm: Expectation-Maximization

EM: Iterative method for MLE with latents.

E-Step: Compute responsibilities r_{ij} = P(z_i=j | x_i, θ^t) = \pi_j N(x_i | \mu_j, \Sigma_j) / sum \pi_l N(x_i | \mu_l, \Sigma_l).

M-Step: Update θ^{t+1} = argmax E_z [log p(x,z|θ) | x, θ^t].

For GMM:

  • \pi_j = (1/m) sum r_
  • \mu_j = sum r_{ij} x_i / sum r_
  • \Sigma_j = sum r_{ij} (x_i - \mu_j)(x_i - \mu_j)^T / sum r_

Derivation

ELBO: Q(θ|θ^t) = E_z [log p(x,z|θ) | x, θ^t] ≥ \ell(θ) (Jensen).

Maximize Q + H(q), but EM maximizes Q.

ML Insight

  • EM for latent variable models like VAEs.

4. Convergence and Properties

EM increases likelihood each step, converges to local max.

Convergence: Monotonic, but may be slow.

Initialization: k-means for μ, equal π.

In 2025, EM variants for large data.


5. GMM Variants

Diagonal GMM: Diagonal Σ, faster.

Tied GMM: Shared Σ.

Bayesian GMM: Priors on parameters.

In ML: Variational EM for scalability.


6. Evaluation Metrics

  • Log-Likelihood: Higher better.
  • BIC/AIC: Penalize complexity: BIC = -2ℓ + k log m.
  • Silhouette: Cluster quality.

In 2025, ELBO for variational GMMs.


7. Applications in Machine Learning (2025)

  1. Clustering: Unsupervised grouping.
  2. Density Estimation: Model data distributions.
  3. Anomaly Detection: Low probability points.
  4. Generative Models: Baseline for VAEs.
  5. Speech Recognition: Acoustic modeling.
  6. LLM Applications: Cluster embeddings for topics.

Challenges

  • Local Maxima: EM sensitive to init.
  • High-D: Curse; use reduction.
  • k Selection: BIC for optimal k.

8. Numerical Implementations

Implement GMM with EM.

python
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# GMM clustering
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)

gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=0)
gmm.fit(X)
labels = gmm.predict(X)
print("GMM Silhouette:", silhouette_score(X, labels))

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("GMM Clustering")
plt.show()

# EM steps visible
print("Means:", gmm.means_)
print("Covariances:", gmm.covariances_)

# ML: Density estimation
probs = gmm.score_samples(X)
print("Log-likelihood mean:", np.mean(probs))

# Anomaly detection
anomalies = X[probs < np.percentile(probs, 5)]
plt.scatter(anomalies[:,0], anomalies[:,1], c='red', marker='x')
plt.title("GMM Anomalies")
plt.show()
rust
use linfa::prelude::*;
use linfa_bayes::GaussianMixtureModel;
use ndarray::{Array2, Array1};

fn main() {
    let mut rng = rand::thread_rng();
    let x: Array2<f64> = Array2::zeros((300, 2));
    // Generate blobs placeholder

    let dataset = Dataset::new(x.clone(), Array1::zeros(300));
    let gmm = GaussianMixtureModel::params(4).fit(&dataset).unwrap();
    let labels = gmm.predict(&x);
    // Silhouette not natively; compute manually

    println!("GMM Labels: {:?}", labels);
    println!("Means: {:?}", gmm.means());

    // Density placeholder
}

Note: Rust GMM support limited; use Python for full EM.


9. Case Study: Customer Segmentation (GMM)

python
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Generate customer data
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)

# Find optimal k (BIC)
bic = []
for k in range(1, 10):
    gmm = GaussianMixture(n_components=k)
    gmm.fit(X)
    bic.append(gmm.bic(X))
plt.plot(range(1,10), bic)
plt.title("BIC for k")
plt.show()

# Train
gmm = GaussianMixture(n_components=4)
gmm.fit(X)
labels = gmm.predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("Customer Segmentation")
plt.show()
rust
use linfa::prelude::*;
use linfa_bayes::GaussianMixtureModel;
use ndarray::Array2;

fn main() {
    let x: Array2<f64> = Array2::zeros((300, 2));
    // Generate blobs placeholder

    let gmm = GaussianMixtureModel::params(4).fit(&x).unwrap();
    let labels = gmm.predict(&x);
    println!("Labels: {:?}", labels);
}

Note: Rust requires plotting libraries for visualization.


10. Under the Hood Insights

  • EM: Iterative optimization for latents.
  • Convergence: To local max; multiple inits.
  • Covariance Types: Full, diagonal, tied.
  • Regularization: Add small ε to covariances.

11. Limitations

  • EM Local Maxima: Sensitive to init.
  • High-D: Curse; use reduction.
  • k Selection: BIC sensitive.
  • Non-Gaussian: GMM assumes Gaussians.

12. Summary

GMMs and EM are powerful for clustering and density estimation. In 2025, their role in generative baselines and anomaly detection keeps them vital. Initialization and regularization address limitations.


Further Reading

  • Dempster, "Maximum Likelihood from Incomplete Data via the EM Algorithm".
  • Hastie, Elements of Statistical Learning (Ch. 8).
  • linfa-bayes docs: github.com/rust-ml/linfa.
  • Bilmes, "A Gentle Tutorial of the EM Algorithm" (1998).