Appearance
Principal Component Analysis
Principal Component Analysis (PCA) is a key unsupervised learning technique for dimensionality reduction, transforming high-dimensional data into a lower-dimensional space while preserving variance. This section provides a comprehensive exploration of PCA’s theory, derivations, computational aspects, and evaluation, with a Rust lab using linfa
. We’ll dive into the mechanics of eigenvalue decomposition, data transformation, and Rust’s computational efficiency.
Theory
PCA reduces a dataset with
Given a centered dataset (mean-subtracted,
where
Derivation: Eigenvalue Decomposition
To maximize variance, PCA finds the direction
subject to
Taking the derivative with respect to
Thus,
The data is projected onto the top
where
Under the Hood: Computing the covariance matrix and its eigendecomposition is costly (
where linfa
leverages nalgebra
’s optimized SVD, ensuring numerical stability and memory safety, unlike C++ where manual matrix operations risk errors.
Optimization: Choosing
The number of components
Typically,
Under the Hood: Selecting linfa
computes eigenvalues efficiently, allowing rapid EVR calculation. Rust’s compile-time checks prevent indexing errors in eigenvalue sorting, unlike Python’s scikit-learn
, which may require additional validation for large datasets.
Evaluation
PCA performance is evaluated with:
- Explained Variance Ratio (EVR): As above, higher is better.
- Reconstruction Error: Measures information loss:
where , and . - Downstream Task Performance: Accuracy or MSE of a model trained on
.
Under the Hood: The reconstruction error quantifies the trade-off between dimensionality and fidelity. Rust’s linfa
optimizes projection and reconstruction, using vectorized operations to minimize computation time, outperforming Python for large
Lab: PCA with linfa
You’ll apply PCA to a synthetic dataset (e.g., high-dimensional customer data) to reduce dimensions and evaluate EVR.
Edit
src/main.rs
in yourrust_ml_tutorial
project:rustuse linfa::prelude::*; use linfa_reduction::Pca; use ndarray::{array, Array2}; fn main() { // Synthetic dataset: 4D features (x1, x2, x3, x4) let x: Array2<f64> = array![ [1.0, 2.0, 1.1, 2.1], [1.5, 1.8, 1.4, 1.9], [1.2, 2.1, 1.3, 2.0], // Cluster 1 [5.0, 8.0, 5.2, 8.2], [4.8, 7.9, 4.9, 7.8], [5.1, 8.1, 5.3, 8.0], // Cluster 2 [9.0, 3.0, 9.1, 3.1], [8.8, 3.2, 8.9, 3.0], [9.2, 2.9, 9.0, 3.3] // Cluster 3 ]; // Create dataset let dataset = DatasetBase::from(x.clone()); // Apply PCA to reduce to 2D let pca = Pca::params(2) .fit(&dataset) .expect("PCA fitting failed"); // Transform data let reduced = pca.transform(&x); println!("Reduced Data (2D):\n{:?}", reduced); // Compute explained variance ratio let evr = pca.explained_variance_ratio(); println!("Explained Variance Ratio: {:?}", evr); let total_evr = evr.iter().sum::<f64>(); println!("Total EVR: {}", total_evr); // Compute reconstruction error let reconstructed = pca.inverse_transform(&reduced); let error = x.iter().zip(reconstructed.iter()) .map(|(xi, ri)| (xi - ri).powi(2)).sum::<f64>() / x.nrows() as f64; println!("Reconstruction Error: {}", error); }
Ensure Dependencies:
- Verify
Cargo.toml
includes:toml[dependencies] linfa = "0.7.1" linfa-reduction = "0.7.0" ndarray = "0.15.0"
- Run
cargo build
.
- Verify
Run the Program:
bashcargo run
Expected Output (approximate):
Reduced Data (2D): [[-4.2, 0.1], [-4.0, -0.2], [-4.1, 0.0], [2.5, 0.3], [2.4, 0.1], [2.6, 0.2], [3.1, -0.1], [3.0, 0.0], [3.2, -0.2]] Explained Variance Ratio: [0.85, 0.10] Total EVR: 0.95 Reconstruction Error: 0.05
Understanding the Results
- Dataset: Synthetic 4D features form three clusters, mimicking customer data.
- Model: PCA reduces to 2D, capturing ~95% of variance (Total EVR ~0.95). The reduced data separates clusters effectively.
- Reconstruction Error: Low error (~0.05) indicates minimal information loss.
- Under the Hood:
linfa
uses SVD for PCA, leveragingnalgebra
’s optimized linear algebra routines. Rust’s memory safety ensures robust handling of large matrices, unlike C++ where manual memory management risks leaks. The SVD computation avoids explicit covariance matrix formation, reducingcosts for large , outperforming Python’s scikit-learn
for high-dimensional data. - Evaluation: High EVR and low reconstruction error confirm effective dimensionality reduction, suitable for tasks like visualization or preprocessing for ML models.
This lab introduces dimensionality reduction, preparing for model evaluation.
Next Steps
Continue to Model Evaluation for performance metrics, or revisit K-Means Clustering.
Further Reading
- An Introduction to Statistical Learning by James et al. (Chapter 12)
- Hands-On Machine Learning by Géron (Chapter 8)
linfa
Documentation: github.com/rust-ml/linfa