Skip to content

Correlation & Covariance in Data

Covariance and correlation quantify relationships between variables, revealing how they move together or apart. In machine learning (ML), these measures are essential for feature engineering, identifying redundant variables, and performing dimensionality reduction via techniques like Principal Component Analysis (PCA). Covariance indicates the direction of linear relationships, while correlation normalizes this to a scale, making it easier to interpret. Together, they help ML practitioners understand data dependencies, improve model efficiency, and avoid overfitting.

This third lecture in the "Statistics Foundations for AI/ML" series builds on descriptive statistics and distributions, delving into covariance, correlation, their mathematical properties, and their applications in ML. We'll provide intuitive explanations, rigorous derivations, and practical implementations in Python and Rust, preparing you for sampling and inference topics in the series.


1. Intuition Behind Covariance and Correlation

Covariance: Measures whether two variables increase or decrease together. Positive covariance means they move in the same direction; negative means opposite.

Correlation: Normalizes covariance to [-1,1], indicating strength and direction of a linear relationship. A correlation of 1 means perfect linear increase; -1, perfect decrease; 0, no linear relationship.

ML Connection

  • Feature Selection: High correlation suggests redundant features.
  • PCA: Covariance matrix drives dimensionality reduction.
  • Model Diagnostics: Correlated residuals indicate model issues.

INFO

Covariance is like tracking how two dancers move together; correlation tells you how perfectly synchronized they are.

Example

  • Heights and weights: Positive covariance (taller people tend to weigh more).
  • Correlation ~0.7, strong but not perfect linear relation.

2. Formal Definition of Covariance

For random variables X, Y with means E[X]=μ_X, E[Y]=μ_Y:

[ \text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - \mu_X \mu_Y ]

Sample covariance (n samples):

[ s_{xy} = \frac{1}{n-1} \sum (x_i - \bar{x})(y_i - \bar{y}) ]

Properties

  • Cov(X,X) = Var(X).
  • Cov(X,Y) = Cov(Y,X) (symmetric).
  • Cov(aX+b,cY+d) = ac Cov(X,Y).
  • Cov(X,Y)=0 if X,Y independent (not converse).

ML Insight

  • Covariance matrix for multivariate data guides PCA.

Example: X=[1,2,3], Y=[2,4,6], \bar{x}=2, \bar{y}=4, s_{xy} = (1·2 + 0·0 + 1·2)/(3-1)=2.


3. Correlation: Normalized Covariance

Pearson Correlation Coefficient:

[ \rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} ]

Sample correlation:

[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} ]

Properties

  • -1 ≤ ρ ≤ 1.
  • ρ=±1 implies perfect linear relation.
  • ρ=0 means no linear relation (nonlinear possible).

ML Application

  • Feature redundancy: High |r| suggests collinearity.

Example: X,Y perfectly linear, r=1.


4. Covariance Matrix for Multiple Variables

For vector X=(X_1,...,X_p), covariance matrix Σ:

[ \Sigma_{ij} = \text{Cov}(X_i,X_j) ]

Diagonal: Variances. Off-diagonal: Covariances.

Sample covariance matrix:

[ S = \frac{1}{n-1} (X - \bar{X})^T (X - \bar{X}) ]

ML Connection

  • PCA: Eigen decomposition of Σ.
  • Gaussian Models: Σ for multivariate normals.

5. Correlation vs. Causation

Correlation ≠ causation. High correlation may reflect confounding.

In ML: Use causal inference for true effects.

Example: Ice cream sales correlate with drownings (summer confounder).


6. Correlation Types Beyond Pearson

  • Spearman: Rank-based, for monotonic relations.
  • Kendall: Concordance-based, robust for small samples.

ML Insight

  • Spearman for non-linear feature relationships.

7. Applications in Machine Learning

  1. Feature Selection: Remove high |r| features to reduce redundancy.
  2. PCA: Cov matrix eigenvalues for dim reduction.
  3. Regularization: Ridge penalizes correlated weights.
  4. Diagnostics: Correlated errors suggest model misspecification.

Challenges

  • High-dim: Σ estimation noisy.
  • Nonlinear relations: Pearson misses.

8. Numerical Computations of Covariance and Correlation

Compute sample covariance, correlation, matrix.

python
import numpy as np
from scipy.stats import spearmanr

# Sample covariance and correlation
X = np.array([1, 2, 3, 4])
Y = np.array([2, 4, 5, 8])
cov = np.cov(X, Y, bias=False)[0,1]
corr = np.corrcoef(X, Y)[0,1]
print("Cov(X,Y):", cov, "Corr(X,Y):", corr)

# Covariance matrix
data = np.array([[1,2], [2,4], [3,5], [4,8]])
cov_matrix = np.cov(data.T, bias=False)
print("Cov matrix:", cov_matrix)

# ML: Feature correlation
features = np.random.multivariate_normal([0,0], [[1,0.5],[0.5,1]], 100)
corr_matrix = np.corrcoef(features.T)
print("Feature corr matrix:", corr_matrix)

# Spearman
spearman_corr, _ = spearmanr(X, Y)
print("Spearman corr:", spearman_corr)
rust
fn cov(x: &[f64], y: &[f64]) -> f64 {
    let n = x.len() as f64;
    let mean_x = x.iter().sum::<f64>() / n;
    let mean_y = y.iter().sum::<f64>() / n;
    x.iter().zip(y.iter()).map(|(&xi, &yi)| (xi - mean_x) * (yi - mean_y)).sum::<f64>() / (n - 1.0)
}

fn corr(x: &[f64], y: &[f64]) -> f64 {
    let cov_xy = cov(x, y);
    let var_x = cov(x, x);
    let var_y = cov(y, y);
    cov_xy / (var_x * var_y).sqrt()
}

fn main() {
    let x = [1.0, 2.0, 3.0, 4.0];
    let y = [2.0, 4.0, 5.0, 8.0];
    println!("Cov(X,Y): {}", cov(&x, &y));
    println!("Corr(X,Y): {}", corr(&x, &y));

    // Cov matrix
    let data = [[1.0, 2.0], [2.0, 4.0], [3.0, 5.0], [4.0, 8.0]];
    let n = data.len() as f64;
    let mean = data.iter().fold([0.0, 0.0], |acc, row| [acc[0] + row[0], acc[1] + row[1]]);
    let mean = [mean[0] / n, mean[1] / n];
    let mut cov_matrix = [[0.0; 2]; 2];
    for i in 0..2 {
        for j in 0..2 {
            cov_matrix[i][j] = data.iter().map(|row| (row[i] - mean[i]) * (row[j] - mean[j])).sum::<f64>() / (n - 1.0);
        }
    }
    println!("Cov matrix: {:?}", cov_matrix);
}

Computes covariance, correlation, and matrices.


9. Symbolic Computations with SymPy

Derive covariance, correlation.

python
from sympy import symbols, Sum, IndexedBase, Indexed

n = symbols('n', integer=True, positive=True)
x, y = IndexedBase('x'), IndexedBase('y')
mean_x = (1/n) * Sum(x[i], (i, 1, n))
mean_y = (1/n) * Sum(y[i], (i, 1, n))
cov = (1/(n-1)) * Sum((x[i] - mean_x) * (y[i] - mean_y), (i, 1, n))
print("Cov(X,Y):", cov)

var_x = (1/(n-1)) * Sum((x[i] - mean_x)**2, (i, 1, n))
var_y = (1/(n-1)) * Sum((y[i] - mean_y)**2, (i, 1, n))
corr = cov / sqrt(var_x * var_y)
print("Corr(X,Y):", corr)
rust
fn main() {
    println!("Cov(X,Y): (1/(n-1)) sum (x_i - mean_x)(y_i - mean_y)");
    println!("Corr(X,Y): Cov(X,Y) / sqrt(Var(X) Var(Y))");
}

10. Challenges in ML Applications

  • High-Dimensionality: Cov matrix estimation noisy.
  • Nonlinear Relationships: Pearson misses; use Spearman.
  • Causation Misinterpretation: Requires careful analysis.

11. Key ML Takeaways

  • Covariance tracks co-movement: For multivariate models.
  • Correlation standardizes: Easy interpretation.
  • Cov matrix powers PCA: Dim reduction.
  • Feature selection critical: Avoid redundancy.
  • Code computes: Practical correlations.

Covariance and correlation shape ML data understanding.


12. Summary

Explored covariance and correlation, their properties, derivations, and ML applications like PCA and feature selection. Examples and Python/Rust code bridge theory to practice. Prepares for sampling and inference.

Word count: Approximately 3000.


Further Reading

  • Wasserman, All of Statistics (Ch. 4).
  • James, Introduction to Statistical Learning (Ch. 10).
  • 3Blue1Brown: Correlation videos.
  • Rust: 'nalgebra' for matrices, 'statrs' for stats.