Gaussian Mixture Models & EM Algorithm
Gaussian Mixture Models (GMMs) are probabilistic models that represent data as a mixture of Gaussian distributions, useful for clustering, density estimation, and generative tasks. The Expectation-Maximization (EM) algorithm is a powerful iterative method for finding maximum likelihood estimates in models with latent variables, like GMMs. In 2025, GMMs and EM remain foundational in ML for unsupervised learning, serving as baselines for advanced generative models like VAEs and diffusion models, and in applications like anomaly detection and speech recognition.
This lecture in the "Foundations for AI/ML" series (core-ml cluster) builds on clustering and Naive Bayes, exploring GMMs, the EM algorithm, their theoretical foundations, derivations, and applications. We’ll provide intuitive explanations, mathematical insights, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.
1. Motivation and Intuition
GMMs assume data is generated from a mixture of k Gaussians, each with mean, covariance, and weight. EM optimizes parameters by alternating expectation (assign probabilities) and maximization (update parameters).
Why GMMs & EM in 2025?
- Unsupervised Learning: Cluster data without labels.
- Density Estimation: Model complex distributions.
- Baseline: For advanced models like VAEs.
- Modern Applications: Anomaly detection in IoT, speech processing with LLMs.
Real-World Examples
- Audio Processing: Cluster sound features for recognition.
- Image Segmentation: Group pixels by color distributions.
- AI Pipelines: GMM on LLM embeddings for clustering.
INFO
GMMs are like mixing colors to paint a picture—EM iteratively refines the palette for the best fit.
2. Mathematical Formulation of GMMs
A GMM with k components:
[ p(x) = \sum_{j=1}^k \pi_j \mathcal{N}(x | \mu_j, \Sigma_j) ]
- \pi_j: Mixing weights, sum \pi_j = 1.
- \mathcal{N}(x | \mu_j, \Sigma_j): Gaussian with mean \mu_j, covariance \Sigma_j.
Log-likelihood:
[ \ell(\theta) = \sum_{i=1}^m \log \left( \sum_{j=1}^k \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j) \right) ]
No closed-form MLE due to sum inside log.
Latent Variables
Introduce z_i ~ Multinomial(\pi), x_i | z_i=j ~ N(\mu_j, \Sigma_j).
Joint p(x,z|θ) tractable.
ML Connection
- GMMs model multimodal data.
3. EM Algorithm: Expectation-Maximization
EM: Iterative method for MLE with latents.
E-Step: Compute responsibilities r_{ij} = P(z_i=j | x_i, θ^t) = \pi_j N(x_i | \mu_j, \Sigma_j) / sum \pi_l N(x_i | \mu_l, \Sigma_l).
M-Step: Update θ^{t+1} = argmax E_z [log p(x,z|θ) | x, θ^t].
For GMM:
- \pi_j = (1/m) sum r_
- \mu_j = sum r_{ij} x_i / sum r_
- \Sigma_j = sum r_{ij} (x_i - \mu_j)(x_i - \mu_j)^T / sum r_
Derivation
ELBO: Q(θ|θ^t) = E_z [log p(x,z|θ) | x, θ^t] ≥ \ell(θ) (Jensen).
Maximize Q + H(q), but EM maximizes Q.
ML Insight
- EM for latent variable models like VAEs.
4. Convergence and Properties
EM increases likelihood each step, converges to local max.
Convergence: Monotonic, but may be slow.
Initialization: k-means for μ, equal π.
In 2025, EM variants for large data.
5. GMM Variants
Diagonal GMM: Diagonal Σ, faster.
Tied GMM: Shared Σ.
Bayesian GMM: Priors on parameters.
In ML: Variational EM for scalability.
6. Evaluation Metrics
- Log-Likelihood: Higher better.
- BIC/AIC: Penalize complexity: BIC = -2ℓ + k log m.
- Silhouette: Cluster quality.
In 2025, ELBO for variational GMMs.
7. Applications in Machine Learning (2025)
- Clustering: Unsupervised grouping.
- Density Estimation: Model data distributions.
- Anomaly Detection: Low probability points.
- Generative Models: Baseline for VAEs.
- Speech Recognition: Acoustic modeling.
- LLM Applications: Cluster embeddings for topics.
Challenges
- Local Maxima: EM sensitive to init.
- High-D: Curse; use reduction.
- k Selection: BIC for optimal k.
8. Numerical Implementations
Implement GMM with EM.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# GMM clustering
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=0)
gmm.fit(X)
labels = gmm.predict(X)
print("GMM Silhouette:", silhouette_score(X, labels))
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("GMM Clustering")
plt.show()
# EM steps visible
print("Means:", gmm.means_)
print("Covariances:", gmm.covariances_)
# ML: Density estimation
probs = gmm.score_samples(X)
print("Log-likelihood mean:", np.mean(probs))
# Anomaly detection
anomalies = X[probs < np.percentile(probs, 5)]
plt.scatter(anomalies[:,0], anomalies[:,1], c='red', marker='x')
plt.title("GMM Anomalies")
plt.show()use linfa::prelude::*;
use linfa_bayes::GaussianMixtureModel;
use ndarray::{Array2, Array1};
fn main() {
let mut rng = rand::thread_rng();
let x: Array2<f64> = Array2::zeros((300, 2));
// Generate blobs placeholder
let dataset = Dataset::new(x.clone(), Array1::zeros(300));
let gmm = GaussianMixtureModel::params(4).fit(&dataset).unwrap();
let labels = gmm.predict(&x);
// Silhouette not natively; compute manually
println!("GMM Labels: {:?}", labels);
println!("Means: {:?}", gmm.means());
// Density placeholder
}Note: Rust GMM support limited; use Python for full EM.
9. Case Study: Customer Segmentation (GMM)
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
# Generate customer data
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)
# Find optimal k (BIC)
bic = []
for k in range(1, 10):
gmm = GaussianMixture(n_components=k)
gmm.fit(X)
bic.append(gmm.bic(X))
plt.plot(range(1,10), bic)
plt.title("BIC for k")
plt.show()
# Train
gmm = GaussianMixture(n_components=4)
gmm.fit(X)
labels = gmm.predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("Customer Segmentation")
plt.show()use linfa::prelude::*;
use linfa_bayes::GaussianMixtureModel;
use ndarray::Array2;
fn main() {
let x: Array2<f64> = Array2::zeros((300, 2));
// Generate blobs placeholder
let gmm = GaussianMixtureModel::params(4).fit(&x).unwrap();
let labels = gmm.predict(&x);
println!("Labels: {:?}", labels);
}Note: Rust requires plotting libraries for visualization.
10. Under the Hood Insights
- EM: Iterative optimization for latents.
- Convergence: To local max; multiple inits.
- Covariance Types: Full, diagonal, tied.
- Regularization: Add small ε to covariances.
11. Limitations
- EM Local Maxima: Sensitive to init.
- High-D: Curse; use reduction.
- k Selection: BIC sensitive.
- Non-Gaussian: GMM assumes Gaussians.
12. Summary
GMMs and EM are powerful for clustering and density estimation. In 2025, their role in generative baselines and anomaly detection keeps them vital. Initialization and regularization address limitations.
Further Reading
- Dempster, "Maximum Likelihood from Incomplete Data via the EM Algorithm".
- Hastie, Elements of Statistical Learning (Ch. 8).
linfa-bayesdocs: github.com/rust-ml/linfa.- Bilmes, "A Gentle Tutorial of the EM Algorithm" (1998).