Correlation & Covariance in Data
Correlation & Covariance in Data
Section titled “Correlation & Covariance in Data”Covariance and correlation quantify relationships between variables, revealing how they move together or apart. In machine learning (ML), these measures are essential for feature engineering, identifying redundant variables, and performing dimensionality reduction via techniques like Principal Component Analysis (PCA). Covariance indicates the direction of linear relationships, while correlation normalizes this to a scale, making it easier to interpret. Together, they help ML practitioners understand data dependencies, improve model efficiency, and avoid overfitting.
This third lecture in the “Statistics Foundations for AI/ML” series builds on descriptive statistics and distributions, delving into covariance, correlation, their mathematical properties, and their applications in ML. We’ll provide intuitive explanations, rigorous derivations, and practical implementations in Python and Rust, preparing you for sampling and inference topics in the series.
1. Intuition Behind Covariance and Correlation
Section titled “1. Intuition Behind Covariance and Correlation”Covariance: Measures whether two variables increase or decrease together. Positive covariance means they move in the same direction; negative means opposite.
Correlation: Normalizes covariance to [-1,1], indicating strength and direction of a linear relationship. A correlation of 1 means perfect linear increase; -1, perfect decrease; 0, no linear relationship.
ML Connection
Section titled “ML Connection”- Feature Selection: High correlation suggests redundant features.
- PCA: Covariance matrix drives dimensionality reduction.
- Model Diagnostics: Correlated residuals indicate model issues.
::: info Covariance is like tracking how two dancers move together; correlation tells you how perfectly synchronized they are. :::
Example
Section titled “Example”- Heights and weights: Positive covariance (taller people tend to weigh more).
- Correlation ~0.7, strong but not perfect linear relation.
2. Formal Definition of Covariance
Section titled “2. Formal Definition of Covariance”For random variables X, Y with means E[X]=μ_X, E[Y]=μ_Y:
[ \text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - \mu_X \mu_Y ]
Sample covariance (n samples):
[ s_{xy} = \frac{1}{n-1} \sum (x_i - \bar{x})(y_i - \bar{y}) ]
Properties
Section titled “Properties”- Cov(X,X) = Var(X).
- Cov(X,Y) = Cov(Y,X) (symmetric).
- Cov(aX+b,cY+d) = ac Cov(X,Y).
- Cov(X,Y)=0 if X,Y independent (not converse).
ML Insight
Section titled “ML Insight”- Covariance matrix for multivariate data guides PCA.
Example: X=[1,2,3], Y=[2,4,6], \bar{x}=2, \bar{y}=4, s_{xy} = (1·2 + 0·0 + 1·2)/(3-1)=2.
3. Correlation: Normalized Covariance
Section titled “3. Correlation: Normalized Covariance”Pearson Correlation Coefficient:
[ \rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} ]
Sample correlation:
[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} ]
Properties
Section titled “Properties”- -1 ≤ ρ ≤ 1.
- ρ=±1 implies perfect linear relation.
- ρ=0 means no linear relation (nonlinear possible).
ML Application
Section titled “ML Application”- Feature redundancy: High |r| suggests collinearity.
Example: X,Y perfectly linear, r=1.
4. Covariance Matrix for Multiple Variables
Section titled “4. Covariance Matrix for Multiple Variables”For vector X=(X_1,…,X_p), covariance matrix Σ:
[ \Sigma_{ij} = \text{Cov}(X_i,X_j) ]
Diagonal: Variances. Off-diagonal: Covariances.
Sample covariance matrix:
[ S = \frac{1}{n-1} (X - \bar{X})^T (X - \bar{X}) ]
ML Connection
Section titled “ML Connection”- PCA: Eigen decomposition of Σ.
- Gaussian Models: Σ for multivariate normals.
5. Correlation vs. Causation
Section titled “5. Correlation vs. Causation”Correlation ≠ causation. High correlation may reflect confounding.
In ML: Use causal inference for true effects.
Example: Ice cream sales correlate with drownings (summer confounder).
6. Correlation Types Beyond Pearson
Section titled “6. Correlation Types Beyond Pearson”- Spearman: Rank-based, for monotonic relations.
- Kendall: Concordance-based, robust for small samples.
ML Insight
Section titled “ML Insight”- Spearman for non-linear feature relationships.
7. Applications in Machine Learning
Section titled “7. Applications in Machine Learning”- Feature Selection: Remove high |r| features to reduce redundancy.
- PCA: Cov matrix eigenvalues for dim reduction.
- Regularization: Ridge penalizes correlated weights.
- Diagnostics: Correlated errors suggest model misspecification.
Challenges
Section titled “Challenges”- High-dim: Σ estimation noisy.
- Nonlinear relations: Pearson misses.
8. Numerical Computations of Covariance and Correlation
Section titled “8. Numerical Computations of Covariance and Correlation”Compute sample covariance, correlation, matrix.
::: code-group
import numpy as npfrom scipy.stats import spearmanr
# Sample covariance and correlationX = np.array([1, 2, 3, 4])Y = np.array([2, 4, 5, 8])cov = np.cov(X, Y, bias=False)[0,1]corr = np.corrcoef(X, Y)[0,1]print("Cov(X,Y):", cov, "Corr(X,Y):", corr)
# Covariance matrixdata = np.array([[1,2], [2,4], [3,5], [4,8]])cov_matrix = np.cov(data.T, bias=False)print("Cov matrix:", cov_matrix)
# ML: Feature correlationfeatures = np.random.multivariate_normal([0,0], [[1,0.5],[0.5,1]], 100)corr_matrix = np.corrcoef(features.T)print("Feature corr matrix:", corr_matrix)
# Spearmanspearman_corr, _ = spearmanr(X, Y)print("Spearman corr:", spearman_corr)fn cov(x: &[f64], y: &[f64]) -> f64 { let n = x.len() as f64; let mean_x = x.iter().sum::<f64>() / n; let mean_y = y.iter().sum::<f64>() / n; x.iter().zip(y.iter()).map(|(&xi, &yi)| (xi - mean_x) * (yi - mean_y)).sum::<f64>() / (n - 1.0)}
fn corr(x: &[f64], y: &[f64]) -> f64 { let cov_xy = cov(x, y); let var_x = cov(x, x); let var_y = cov(y, y); cov_xy / (var_x * var_y).sqrt()}
fn main() { let x = [1.0, 2.0, 3.0, 4.0]; let y = [2.0, 4.0, 5.0, 8.0]; println!("Cov(X,Y): {}", cov(&x, &y)); println!("Corr(X,Y): {}", corr(&x, &y));
// Cov matrix let data = [[1.0, 2.0], [2.0, 4.0], [3.0, 5.0], [4.0, 8.0]]; let n = data.len() as f64; let mean = data.iter().fold([0.0, 0.0], |acc, row| [acc[0] + row[0], acc[1] + row[1]]); let mean = [mean[0] / n, mean[1] / n]; let mut cov_matrix = [[0.0; 2]; 2]; for i in 0..2 { for j in 0..2 { cov_matrix[i][j] = data.iter().map(|row| (row[i] - mean[i]) * (row[j] - mean[j])).sum::<f64>() / (n - 1.0); } } println!("Cov matrix: {:?}", cov_matrix);}:::
Computes covariance, correlation, and matrices.
9. Symbolic Computations with SymPy
Section titled “9. Symbolic Computations with SymPy”Derive covariance, correlation.
::: code-group
from sympy import symbols, Sum, IndexedBase, Indexed
n = symbols('n', integer=True, positive=True)x, y = IndexedBase('x'), IndexedBase('y')mean_x = (1/n) * Sum(x[i], (i, 1, n))mean_y = (1/n) * Sum(y[i], (i, 1, n))cov = (1/(n-1)) * Sum((x[i] - mean_x) * (y[i] - mean_y), (i, 1, n))print("Cov(X,Y):", cov)
var_x = (1/(n-1)) * Sum((x[i] - mean_x)**2, (i, 1, n))var_y = (1/(n-1)) * Sum((y[i] - mean_y)**2, (i, 1, n))corr = cov / sqrt(var_x * var_y)print("Corr(X,Y):", corr)fn main() { println!("Cov(X,Y): (1/(n-1)) sum (x_i - mean_x)(y_i - mean_y)"); println!("Corr(X,Y): Cov(X,Y) / sqrt(Var(X) Var(Y))");}:::
10. Challenges in ML Applications
Section titled “10. Challenges in ML Applications”- High-Dimensionality: Cov matrix estimation noisy.
- Nonlinear Relationships: Pearson misses; use Spearman.
- Causation Misinterpretation: Requires careful analysis.
11. Key ML Takeaways
Section titled “11. Key ML Takeaways”- Covariance tracks co-movement: For multivariate models.
- Correlation standardizes: Easy interpretation.
- Cov matrix powers PCA: Dim reduction.
- Feature selection critical: Avoid redundancy.
- Code computes: Practical correlations.
Covariance and correlation shape ML data understanding.
12. Summary
Section titled “12. Summary”Explored covariance and correlation, their properties, derivations, and ML applications like PCA and feature selection. Examples and Python/Rust code bridge theory to practice. Prepares for sampling and inference.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Wasserman, All of Statistics (Ch. 4).
- James, Introduction to Statistical Learning (Ch. 10).
- 3Blue1Brown: Correlation videos.
- Rust: ‘nalgebra’ for matrices, ‘statrs’ for stats.