Clustering - k-Means & Hierarchical
Clustering: k-Means & Hierarchical
Section titled “Clustering: k-Means & Hierarchical”Clustering is an unsupervised learning technique that groups similar data points without labels, uncovering hidden patterns in data. k-Means partitions data into k clusters by minimizing intra-cluster variance, while Hierarchical clustering builds a tree of clusters, offering flexibility in choosing the number of clusters. In 2025, these algorithms remain essential in ML for customer segmentation, anomaly detection, and as preprocessing steps in pipelines with large language models (LLMs) for embedding analysis.
This lecture in the “Foundations for AI/ML” series (core-ml cluster) builds on prior topics like decision trees and ensembles, exploring k-Means and Hierarchical clustering, their theoretical foundations, algorithms, and applications. We’ll provide intuitive explanations, mathematical insights, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.
1. Motivation and Intuition
Section titled “1. Motivation and Intuition”Clustering discovers structure in unlabeled data, grouping similar points while separating dissimilar ones. k-Means is simple and scalable, ideal for spherical clusters, while Hierarchical is flexible for hierarchical or non-spherical structures.
Why Clustering in 2025?
- Unsupervised Learning: Handles unlabeled data common in big data.
- Interpretability: Reveals data patterns.
- Baseline: For advanced methods like DBSCAN.
- Modern Applications: Cluster LLM embeddings for topic discovery.
Real-World Examples
Section titled “Real-World Examples”- Marketing: Customer segmentation.
- Biology: Gene expression clustering.
- AI Pipelines: Cluster LLM outputs for summarization.
::: info Clustering is like organizing a messy room—k-Means assigns to fixed bins, Hierarchical builds a nested hierarchy. :::
2. Mathematical Formulation
Section titled “2. Mathematical Formulation”Clustering minimizes within-group dissimilarity or maximizes between-group.
Objective: For k clusters, minimize sum of intra-cluster variances.
Distance Metrics:
- Euclidean: For continuous features.
- Manhattan: Robust to outliers.
- Cosine: For high-d embeddings.
ML Connection
Section titled “ML Connection”- Metrics affect cluster shape.
3. k-Means Clustering
Section titled “3. k-Means Clustering”Algorithm:
- Initialize k centroids (random or k-means++).
- Assign points to nearest centroid.
- Update centroids as cluster means.
- Repeat until convergence.
Objective: Min sum_{i=1}^m ||x_i - μ_{c_i}||², c_i cluster assignment.
EM View
Section titled “EM View”Expectation: Assign clusters. Maximization: Update means.
Derivation
Section titled “Derivation”Lloyd’s algorithm approximates NP-hard partitioning.
Variants
Section titled “Variants”- k-Means++: Smart initialization.
- Mini-Batch k-Means: Scalable for large data.
In 2025, k-Means on LLM embeddings for clustering.
4. Hierarchical Clustering
Section titled “4. Hierarchical Clustering”Agglomerative (Bottom-Up):
- Start with n singleton clusters.
- Merge closest clusters iteratively.
- Stop at desired k or single cluster.
Divisive (Top-Down): Start with one, split recursively.
Linkage Criteria:
- Single: Min distance.
- Complete: Max distance.
- Average: Mean distance.
- Ward: Minimize variance increase.
Dendrogram
Section titled “Dendrogram”Tree showing merge hierarchy.
ML Application
Section titled “ML Application”- Hierarchical for taxonomy in biology.
5. Choosing k and Validation
Section titled “5. Choosing k and Validation”Elbow Method: Plot inertia vs. k, find elbow.
Silhouette Score: Measure cohesion/separation.
Gap Statistic: Compare to null distribution.
In ML: Cross-validation for k.
6. Evaluation Metrics
Section titled “6. Evaluation Metrics”Internal:
- Inertia: Sum squared distances to centroids.
- Silhouette: (b - a) / max(a,b), a intra, b inter.
External (with labels):
- ARI, NMI: Compare to ground truth.
In 2025, calibration for probabilistic clustering.
7. Applications in Machine Learning (2025)
Section titled “7. Applications in Machine Learning (2025)”- Customer Segmentation: Marketing groups.
- Anomaly Detection: Outliers as small clusters.
- Image Segmentation: Pixel clustering.
- Bioinformatics: Gene clustering.
- LLM Analysis: Cluster embeddings for topics.
- Edge AI: Lightweight hierarchical for devices.
Challenges
Section titled “Challenges”- k-Means: Sensitive to initialization, spherical assumption.
- Hierarchical: O(n²) cost.
- High-D: Curse; use reduction.
8. Numerical Implementations
Section titled “8. Numerical Implementations”Implement k-Means, Hierarchical.
::: code-group
import numpy as npfrom sklearn.cluster import KMeans, AgglomerativeClusteringfrom sklearn.metrics import silhouette_scorefrom sklearn.datasets import make_blobsimport matplotlib.pyplot as plt
# k-MeansX, _ = make_blobs(n_samples=200, centers=3, random_state=0)kmeans = KMeans(n_clusters=3, init='k-means++', random_state=0)labels = kmeans.fit_predict(X)print("k-Means Silhouette:", silhouette_score(X, labels))
plt.scatter(X[:,0], X[:,1], c=labels)plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker='x', c='red')plt.title("k-Means Clustering")plt.show()
# Hierarchicalhier = AgglomerativeClustering(n_clusters=3, linkage='ward')labels_h = hier.fit_predict(X)print("Hierarchical Silhouette:", silhouette_score(X, labels_h))
plt.scatter(X[:,0], X[:,1], c=labels_h)plt.title("Hierarchical Clustering")plt.show()
# ML: Cluster LLM embeddings (placeholder)embeddings = np.random.rand(100, 768) # LLM embedskmeans_llm = KMeans(n_clusters=5)labels_llm = kmeans_llm.fit_predict(embeddings)print("LLM Embeddings Clusters:", np.unique(labels_llm))use linfa::prelude::*;use linfa_clustering::KMeans;use ndarray::{Array2, Array1};
fn main() { let mut rng = rand::thread_rng(); let x: Array2<f64> = Array2::zeros((200, 2)); // Generate blobs placeholder let dataset = Dataset::new(x.clone(), Array1::zeros(200));
let kmeans = KMeans::params(3).fit(&dataset).unwrap(); let labels = kmeans.predict(&x); // Silhouette not natively; compute manually
// Hierarchical not in linfa; use k-means for demo println!("k-Means Labels: {:?}", labels);}:::
Note: Rust clustering support limited; use Python for full hierarchical.
9. Case Study: Customer Segmentation (k-Means)
Section titled “9. Case Study: Customer Segmentation (k-Means)”::: code-group
from sklearn.datasets import make_blobsfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scoreimport matplotlib.pyplot as pltimport numpy as np
# Generate customer dataX, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=0)
# Find optimal ksil = []for k in range(2, 10): km = KMeans(n_clusters=k) km.fit(X) sil.append(silhouette_score(X, km.labels_))plt.plot(range(2,10), sil)plt.title("Elbow Method")plt.show()
# Trainkm = KMeans(n_clusters=4)labels = km.fit_predict(X)plt.scatter(X[:,0], X[:,1], c=labels)plt.title("Customer Segmentation")plt.show()use linfa::prelude::*;use linfa_clustering::KMeans;use ndarray::Array2;
fn main() { let x: Array2<f64> = Array2::zeros((300, 2)); // Generate blobs placeholder
let km = KMeans::params(4).fit(&x.view()).unwrap(); let labels = km.predict(&x); println!("Labels: {:?}", labels);}:::
Note: Rust requires plotting libraries for visualization.
10. Under the Hood Insights
Section titled “10. Under the Hood Insights”- k-Means: EM algorithm variant.
- Hierarchical: Linkage affects dendrogram.
- Validation: Silhouette for optimal k.
- Scalability: Mini-batch k-Means for large data.
11. Limitations
Section titled “11. Limitations”- k-Means: Assumes spherical clusters, sensitive to k.
- Hierarchical: Computationally expensive for large n.
- High-D: Curse; use reduction.
- Outliers: Distort clusters; use robust variants.
12. Summary
Section titled “12. Summary”Clustering with k-Means and Hierarchical uncovers data patterns. In 2025, their role in LLM embedding analysis and anomaly detection keeps them vital. Validation and preprocessing address limitations.
Further Reading
Section titled “Further Reading”- Hartigan, “Clustering Algorithms”.
- Hastie, Elements of Statistical Learning (Ch. 14).
linfa-clusteringdocs: github.com/rust-ml/linfa.- Rokach, Maimon, “Clustering Methods” (2005).