Multivariate Statistics - Correlated Features & MANOVA
Multivariate Statistics - Correlated Features & MANOVA
Section titled “Multivariate Statistics - Correlated Features & MANOVA”Multivariate statistics analyze multiple variables simultaneously, capturing their interrelationships and dependencies. In machine learning (ML), understanding correlated features through covariance and correlation matrices is crucial for feature engineering, dimensionality reduction (e.g., PCA), and model interpretation. Multivariate Analysis of Variance (MANOVA) extends ANOVA to compare groups across multiple dependent variables, useful for assessing complex ML outcomes. These tools are essential for handling high-dimensional data and uncovering patterns in real-world datasets.
This fifteenth lecture in the “Statistics Foundations for AI/ML” series builds on nonparametric statistics and statistical significance, exploring multivariate concepts, covariance matrices, correlation analysis, MANOVA, and their ML applications. We’ll provide intuitive explanations, mathematical derivations, and practical implementations in Python and Rust, preparing you for time-series analysis and causal inference.
1. Why Multivariate Statistics Matter in ML
Section titled “1. Why Multivariate Statistics Matter in ML”ML datasets often involve multiple features (e.g., image pixels, sensor readings), which are frequently correlated. Multivariate statistics:
- Quantify relationships via covariance and correlation.
- Reduce dimensionality by exploiting correlations (e.g., PCA).
- Compare groups across multiple outcomes using MANOVA.
ML Connection
Section titled “ML Connection”- Feature Engineering: Remove redundant correlated features.
- PCA: Use covariance matrix for dimensionality reduction.
- MANOVA: Compare ML models across multiple metrics (e.g., accuracy, F1).
::: info Multivariate statistics are like a 3D map of data, revealing how features dance together and how groups differ across multiple dimensions. :::
Example
Section titled “Example”- Dataset with height, weight, age: Covariance shows how they co-vary; MANOVA tests if groups (e.g., genders) differ in all three.
2. Covariance and Correlation Matrices
Section titled “2. Covariance and Correlation Matrices”Covariance Matrix
Section titled “Covariance Matrix”For p variables X = (X₁,…,Xₚ), the covariance matrix Σ is:
[ \Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)] ]
Sample covariance matrix S:
[ S_{ij} = \frac{1}{n-1} \sum_{k=1}^n (x_{ki} - \bar{x}i)(x{kj} - \bar{x}_j) ]
Diagonal: Variances; off-diagonal: Covariances.
Correlation Matrix
Section titled “Correlation Matrix”Normalized covariance:
[ R_{ij} = \frac{S_{ij}}{\sqrt{S_{ii} S_{jj}}} ]
R_{ij} ∈ [-1, 1], measures linear relationship strength.
Properties
Section titled “Properties”- Σ symmetric, positive semi-definite.
- R diagonal = 1, R_{ij} = ρ(X_i, X_j).
ML Application
Section titled “ML Application”- PCA: Eigen decomposition of Σ for feature reduction.
Example: Height, weight data, S shows positive covariance, R ≈ 0.7 correlation.
3. Multivariate Normal Distribution
Section titled “3. Multivariate Normal Distribution”The multivariate normal N(μ, Σ) models p-dimensional data:
[ f(x) = \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\right) ]
μ mean vector, Σ covariance matrix.
Properties
Section titled “Properties”- Marginals are normal.
- Σ determines feature dependencies.
ML Connection
Section titled “ML Connection”- Gaussian processes, generative models assume multivariate normal.
4. MANOVA: Comparing Groups Across Multiple Variables
Section titled “4. MANOVA: Comparing Groups Across Multiple Variables”MANOVA tests if k groups differ across p dependent variables.
H₀: μ₁ = μ₂ = … = μ_k (mean vectors equal).
Model: Y_{ij} = μ_i + ε_{ij}, ε_{ij} ~ N(0, Σ).
Test Statistics
Section titled “Test Statistics”- Wilks’ Lambda: Λ = |W| / |T|, W within-group, T total covariance.
- Pillai’s Trace, Hotelling’s T², Roy’s Largest Root: Alternative metrics.
Approximated as F-distribution for p-value.
ML Application
Section titled “ML Application”- Compare models across accuracy, precision, recall.
Example: MANOVA on two groups’ feature vectors, p<0.05 suggests differences.
5. Assumptions and Diagnostics
Section titled “5. Assumptions and Diagnostics”- Normality: Variables multivariate normal.
- Homogeneity: Equal covariance matrices across groups.
- Independence: Observations independent.
Diagnostics: Q-Q plots, Box’s M test.
In ML: Robust methods (e.g., permutation-based) if assumptions fail.
6. Applications in Machine Learning
Section titled “6. Applications in Machine Learning”- Feature Engineering: Remove highly correlated features using R.
- PCA: Eigen decomposition of S for dimensionality reduction.
- MANOVA: Compare model performance across multiple metrics.
- Clustering: Use Σ in Gaussian mixture models.
Challenges
Section titled “Challenges”- High-Dimensionality: Σ estimation noisy.
- Non-Normality: Requires robust alternatives.
7. Numerical Multivariate Computations
Section titled “7. Numerical Multivariate Computations”Compute covariance, correlation, MANOVA.
::: code-group
import numpy as npfrom scipy.stats import multivariate_normalfrom statsmodels.multivariate.manova import MANOVA
# Covariance and correlation matricesdata = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], 100)cov_matrix = np.cov(data.T, bias=False)corr_matrix = np.corrcoef(data.T)print("Covariance matrix:", cov_matrix)print("Correlation matrix:", corr_matrix)
# Multivariate normal PDFmvn = multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]])x = np.array([0, 0])print("MVN PDF at (0,0):", mvn.pdf(x))
# MANOVAimport pandas as pdgroup1 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], 50)group2 = np.random.multivariate_normal([1, 1], [[1, 0.5], [0.5, 1]], 50)data = np.vstack([group1, group2])groups = ['A']*50 + ['B']*50df = pd.DataFrame(data, columns=['var1', 'var2'])df['group'] = groupsmanova = MANOVA.from_formula('var1 + var2 ~ group', data=df)print("MANOVA results:", manova.mv_test())
# ML: PCA with covariancefrom sklearn.decomposition import PCApca = PCA(n_components=2)pca.fit(data)print("PCA explained variance:", pca.explained_variance_ratio_)fn cov_matrix(data: &[[f64]]) -> [[f64; 2]; 2] { let n = data.len() as f64; let mean = data.iter().fold([0.0, 0.0], |acc, row| [acc[0] + row[0], acc[1] + row[1]]); let mean = [mean[0] / n, mean[1] / n]; let mut cov = [[0.0; 2]; 2]; for i in 0..2 { for j in 0..2 { cov[i][j] = data.iter().map(|row| (row[i] - mean[i]) * (row[j] - mean[j])).sum::<f64>() / (n - 1.0); } } cov}
fn corr_matrix(cov: &[[f64; 2]; 2]) -> [[f64; 2]; 2] { let mut corr = [[0.0; 2]; 2]; for i in 0..2 { for j in 0..2 { corr[i][j] = cov[i][j] / (cov[i][i] * cov[j][j]).sqrt(); } } corr}
fn main() { let mut rng = rand::thread_rng(); let normal = rand_distr::Normal::new(0.0, 1.0).unwrap(); // Simulate multivariate normal (simplified, ρ=0.5) let data: Vec<[f64; 2]> = (0..100).map(|_| { let x1 = normal.sample(&mut rng); let x2 = 0.5 * x1 + (0.75_f64).sqrt() * normal.sample(&mut rng); [x1, x2] }).collect(); let cov = cov_matrix(&data); let corr = corr_matrix(&cov); println!("Covariance matrix: {:?}", cov); println!("Correlation matrix: {:?}", corr);}:::
Computes covariance, correlation, and MANOVA.
8. Theoretical Insights
Section titled “8. Theoretical Insights”Covariance Matrix: Captures linear dependencies, basis for PCA.
MANOVA: Tests equality of mean vectors, generalizes ANOVA.
Multivariate Normal: Foundation for many ML models.
ML Insight
Section titled “ML Insight”- Σ drives dimensionality reduction and clustering.
9. Challenges in ML Applications
Section titled “9. Challenges in ML Applications”- High-Dimensionality: Σ estimation noisy, requires regularization.
- Non-Normality: Use robust MANOVA or permutation tests.
- Computational Cost: Matrix operations scale poorly.
10. Key ML Takeaways
Section titled “10. Key ML Takeaways”- Covariance captures relationships: For feature analysis.
- Correlation standardizes: Easy interpretation.
- MANOVA compares groups: Across multiple metrics.
- PCA leverages Σ: For dimensionality reduction.
- Code computes matrices: Practical multivariate stats.
Multivariate stats enhance ML’s high-dimensional analysis.
11. Summary
Section titled “11. Summary”Explored multivariate statistics, including covariance, correlation, MANOVA, and their ML applications in feature engineering and group comparisons. Examples and Python/Rust code bridge theory to practice. Prepares for time-series and causal inference.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Anderson, An Introduction to Multivariate Statistical Analysis.
- Hastie, Elements of Statistical Learning (Ch. 14).
- James, Introduction to Statistical Learning (Ch. 10).
- Rust: ‘nalgebra’ for matrices, ‘statrs’ for stats.