Gaussian Mixture Models & EM Algorithm
Gaussian Mixture Models & EM Algorithm
Section titled “Gaussian Mixture Models & EM Algorithm”Gaussian Mixture Models (GMMs) are probabilistic models that represent data as a mixture of Gaussian distributions, useful for clustering, density estimation, and generative tasks. The Expectation-Maximization (EM) algorithm is a powerful iterative method for finding maximum likelihood estimates in models with latent variables, like GMMs. In 2025, GMMs and EM remain foundational in ML for unsupervised learning, serving as baselines for advanced generative models like VAEs and diffusion models, and in applications like anomaly detection and speech recognition.
This lecture in the “Foundations for AI/ML” series (core-ml cluster) builds on clustering and Naive Bayes, exploring GMMs, the EM algorithm, their theoretical foundations, derivations, and applications. We’ll provide intuitive explanations, mathematical insights, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.
1. Motivation and Intuition
Section titled “1. Motivation and Intuition”GMMs assume data is generated from a mixture of k Gaussians, each with mean, covariance, and weight. EM optimizes parameters by alternating expectation (assign probabilities) and maximization (update parameters).
Why GMMs & EM in 2025?
- Unsupervised Learning: Cluster data without labels.
- Density Estimation: Model complex distributions.
- Baseline: For advanced models like VAEs.
- Modern Applications: Anomaly detection in IoT, speech processing with LLMs.
Real-World Examples
Section titled “Real-World Examples”- Audio Processing: Cluster sound features for recognition.
- Image Segmentation: Group pixels by color distributions.
- AI Pipelines: GMM on LLM embeddings for clustering.
::: info GMMs are like mixing colors to paint a picture—EM iteratively refines the palette for the best fit. :::
2. Mathematical Formulation of GMMs
Section titled “2. Mathematical Formulation of GMMs”A GMM with k components:
[ p(x) = \sum_{j=1}^k \pi_j \mathcal{N}(x | \mu_j, \Sigma_j) ]
- \pi_j: Mixing weights, sum \pi_j = 1.
- \mathcal{N}(x | \mu_j, \Sigma_j): Gaussian with mean \mu_j, covariance \Sigma_j.
Log-likelihood:
[ \ell(\theta) = \sum_{i=1}^m \log \left( \sum_{j=1}^k \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j) \right) ]
No closed-form MLE due to sum inside log.
Latent Variables
Section titled “Latent Variables”Introduce z_i ~ Multinomial(\pi), x_i | z_i=j ~ N(\mu_j, \Sigma_j).
Joint p(x,z|θ) tractable.
ML Connection
Section titled “ML Connection”- GMMs model multimodal data.
3. EM Algorithm: Expectation-Maximization
Section titled “3. EM Algorithm: Expectation-Maximization”EM: Iterative method for MLE with latents.
E-Step: Compute responsibilities r_{ij} = P(z_i=j | x_i, θ^t) = \pi_j N(x_i | \mu_j, \Sigma_j) / sum \pi_l N(x_i | \mu_l, \Sigma_l).
M-Step: Update θ^{t+1} = argmax E_z [log p(x,z|θ) | x, θ^t].
For GMM:
- \pi_j = (1/m) sum r_{ij}
- \mu_j = sum r_{ij} x_i / sum r_{ij}
- \Sigma_j = sum r_{ij} (x_i - \mu_j)(x_i - \mu_j)^T / sum r_{ij}
Derivation
Section titled “Derivation”ELBO: Q(θ|θ^t) = E_z [log p(x,z|θ) | x, θ^t] ≥ \ell(θ) (Jensen).
Maximize Q + H(q), but EM maximizes Q.
ML Insight
Section titled “ML Insight”- EM for latent variable models like VAEs.
4. Convergence and Properties
Section titled “4. Convergence and Properties”EM increases likelihood each step, converges to local max.
Convergence: Monotonic, but may be slow.
Initialization: k-means for μ, equal π.
In 2025, EM variants for large data.
5. GMM Variants
Section titled “5. GMM Variants”Diagonal GMM: Diagonal Σ, faster.
Tied GMM: Shared Σ.
Bayesian GMM: Priors on parameters.
In ML: Variational EM for scalability.
6. Evaluation Metrics
Section titled “6. Evaluation Metrics”- Log-Likelihood: Higher better.
- BIC/AIC: Penalize complexity: BIC = -2ℓ + k log m.
- Silhouette: Cluster quality.
In 2025, ELBO for variational GMMs.
7. Applications in Machine Learning (2025)
Section titled “7. Applications in Machine Learning (2025)”- Clustering: Unsupervised grouping.
- Density Estimation: Model data distributions.
- Anomaly Detection: Low probability points.
- Generative Models: Baseline for VAEs.
- Speech Recognition: Acoustic modeling.
- LLM Applications: Cluster embeddings for topics.
Challenges
Section titled “Challenges”- Local Maxima: EM sensitive to init.
- High-D: Curse; use reduction.
- k Selection: BIC for optimal k.
8. Numerical Implementations
Section titled “8. Numerical Implementations”Implement GMM with EM.
::: code-group
import numpy as npfrom sklearn.mixture import GaussianMixturefrom sklearn.metrics import silhouette_scorefrom sklearn.datasets import make_blobsimport matplotlib.pyplot as plt
# GMM clusteringX, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=0)gmm.fit(X)labels = gmm.predict(X)print("GMM Silhouette:", silhouette_score(X, labels))
plt.scatter(X[:,0], X[:,1], c=labels)plt.title("GMM Clustering")plt.show()
# EM steps visibleprint("Means:", gmm.means_)print("Covariances:", gmm.covariances_)
# ML: Density estimationprobs = gmm.score_samples(X)print("Log-likelihood mean:", np.mean(probs))
# Anomaly detectionanomalies = X[probs < np.percentile(probs, 5)]plt.scatter(anomalies[:,0], anomalies[:,1], c='red', marker='x')plt.title("GMM Anomalies")plt.show()use linfa::prelude::*;use linfa_bayes::GaussianMixtureModel;use ndarray::{Array2, Array1};
fn main() { let mut rng = rand::thread_rng(); let x: Array2<f64> = Array2::zeros((300, 2)); // Generate blobs placeholder
let dataset = Dataset::new(x.clone(), Array1::zeros(300)); let gmm = GaussianMixtureModel::params(4).fit(&dataset).unwrap(); let labels = gmm.predict(&x); // Silhouette not natively; compute manually
println!("GMM Labels: {:?}", labels); println!("Means: {:?}", gmm.means());
// Density placeholder}:::
Note: Rust GMM support limited; use Python for full EM.
9. Case Study: Customer Segmentation (GMM)
Section titled “9. Case Study: Customer Segmentation (GMM)”::: code-group
from sklearn.datasets import make_blobsfrom sklearn.mixture import GaussianMixturefrom sklearn.metrics import silhouette_scoreimport matplotlib.pyplot as pltimport numpy as np
# Generate customer dataX, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)
# Find optimal k (BIC)bic = []for k in range(1, 10): gmm = GaussianMixture(n_components=k) gmm.fit(X) bic.append(gmm.bic(X))plt.plot(range(1,10), bic)plt.title("BIC for k")plt.show()
# Traingmm = GaussianMixture(n_components=4)gmm.fit(X)labels = gmm.predict(X)plt.scatter(X[:,0], X[:,1], c=labels)plt.title("Customer Segmentation")plt.show()use linfa::prelude::*;use linfa_bayes::GaussianMixtureModel;use ndarray::Array2;
fn main() { let x: Array2<f64> = Array2::zeros((300, 2)); // Generate blobs placeholder
let gmm = GaussianMixtureModel::params(4).fit(&x).unwrap(); let labels = gmm.predict(&x); println!("Labels: {:?}", labels);}:::
Note: Rust requires plotting libraries for visualization.
10. Under the Hood Insights
Section titled “10. Under the Hood Insights”- EM: Iterative optimization for latents.
- Convergence: To local max; multiple inits.
- Covariance Types: Full, diagonal, tied.
- Regularization: Add small ε to covariances.
11. Limitations
Section titled “11. Limitations”- EM Local Maxima: Sensitive to init.
- High-D: Curse; use reduction.
- k Selection: BIC sensitive.
- Non-Gaussian: GMM assumes Gaussians.
12. Summary
Section titled “12. Summary”GMMs and EM are powerful for clustering and density estimation. In 2025, their role in generative baselines and anomaly detection keeps them vital. Initialization and regularization address limitations.
Further Reading
Section titled “Further Reading”- Dempster, “Maximum Likelihood from Incomplete Data via the EM Algorithm”.
- Hastie, Elements of Statistical Learning (Ch. 8).
linfa-bayesdocs: github.com/rust-ml/linfa.- Bilmes, “A Gentle Tutorial of the EM Algorithm” (1998).