Skip to content

Hypothesis Testing - p-values, t-tests, Chi-square

Hypothesis Testing - p-values, t-tests, Chi-square

Section titled “Hypothesis Testing - p-values, t-tests, Chi-square”

Hypothesis testing is a statistical framework for making decisions about population parameters based on sample data, using p-values to assess evidence and tests like t-tests and Chi-square to compare groups or distributions. In machine learning (ML), hypothesis testing is crucial for evaluating model performance, comparing algorithms, and selecting significant features, ensuring robust and interpretable results.

This sixth lecture in the “Statistics Foundations for AI/ML” series builds on estimation and confidence intervals, exploring the principles of hypothesis testing, p-values, one-sample and two-sample t-tests, Chi-square tests, and their ML applications. We’ll provide intuitive explanations, mathematical derivations, and practical implementations in Python and Rust, preparing you for ANOVA and advanced inference.


ML often involves decisions: Is one model better than another? Are features significant? Hypothesis testing provides a principled way to answer these by:

  • Comparing sample statistics to population parameters.
  • Quantifying evidence against a null hypothesis.
  • Controlling error rates (Type I, Type II).
  • Model Comparison: Test if Model A outperforms Model B.
  • Feature Selection: Identify significant predictors.
  • A/B Testing: Evaluate treatment effects in experiments.

::: info Hypothesis testing is like a courtroom trial: you weigh evidence (data) to decide if a claim (null hypothesis) holds. :::

  • Test if a new ML model’s accuracy (0.85) exceeds a baseline (0.80) using a t-test.

Null Hypothesis (H₀): Default claim, e.g., no difference (μ=μ₀).

Alternative Hypothesis (H₁): What you aim to prove, e.g., μ≠μ₀ or μ>μ₀.

Test Statistic: Quantifies how far sample deviates from H₀ (e.g., t-statistic).

p-value: Probability of observing test statistic at least as extreme under H₀.

Significance Level (α): Threshold (e.g., 0.05) to reject H₀ if p<α.

  • Type I: Reject true H₀ (false positive, α).
  • Type II: Fail to reject false H₀ (false negative, β).
  • p-values guide feature importance in tree-based models.

p-value = P(observe test statistic | H₀).

Small p-value (e.g., <0.05) suggests H₀ unlikely.

  • Not probability H₀ true.
  • Lower p → stronger evidence against H₀.
  • Test significance of model improvements.

Example: t-test for mean=0, p=0.03 suggests reject H₀.


Tests if sample mean \bar{x} differs from μ₀.

Statistic: t = (\bar{x} - μ₀) / (s / \sqrt{n}), s sample std dev, n sample size.

Distribution: t ~ t_{n-1} under H₀: μ=μ₀.

CI: \bar{x} ± t_{n-1,α/2} \cdot s/\sqrt{n}.

From CLT, \bar{x} ~ N(μ, σ²/n), use s for σ, t-dist for small n.

  • Test if model error mean differs from zero.

Example: n=30, \bar{x}=10, s=2, H₀: μ=9, t≈2.74, p<0.05, reject H₀.


Compares means of two groups.

Statistic: t = (\bar{x}_1 - \bar{x}_2) / \sqrt{s_1²/n_1 + s_2²/n_2}.

Assumptions: Normality, equal variances (or Welch’s for unequal).

  • Compare two models’ accuracies.

Example: Model A: \bar{x}_1=0.85, n_1=100; Model B: \bar{x}_2=0.80, n_2=100, p<0.05 suggests difference.


Goodness-of-Fit: Test if observed frequencies match expected.

[ \chi² = \sum \frac{(O_i - E_i)²}{E_i} ]

Independence: Test if variables independent in contingency table.

[ \chi² = \sum \frac{(O_{ij} - E_{ij})²}{E_{ij}}, E_{ij} = (row total \cdot column total) / n ]

Distribution: χ² with df=(k-1) or (r-1)(c-1).

  • Feature selection: Test feature-label independence.

Example: Contingency table, p<0.05 suggests dependence.


Power: 1-β, probability of rejecting false H₀.

Depends on effect size, α, n.

In ML: Ensure sufficient n for detecting model improvements.


  1. Model Comparison: t-test for accuracy differences.
  2. Feature Selection: Chi-square for categorical features.
  3. A/B Testing: Test treatment effects.
  4. Diagnostics: Check residual assumptions.
  • Multiple Testing: Adjust p-values (Bonferroni).
  • Non-Normality: Use non-parametric tests.

Perform t-tests, Chi-square.

::: code-group

import numpy as np
from scipy.stats import ttest_1samp, ttest_ind, chi2_contingency
# One-sample t-test
data = np.random.normal(10, 2, 30)
t_stat, p_val = ttest_1samp(data, popmean=9)
print("One-sample t-test: t=", t_stat, "p=", p_val)
# Two-sample t-test
data1 = np.random.normal(0.85, 0.1, 100)
data2 = np.random.normal(0.80, 0.1, 100)
t_stat, p_val = ttest_ind(data1, data2)
print("Two-sample t-test: t=", t_stat, "p=", p_val)
# Chi-square independence
table = [[10, 20, 30], [20, 20, 20]] # Contingency
chi2, p, dof, _ = chi2_contingency(table)
print("Chi-square: stat=", chi2, "p=", p)
# ML: Feature significance
features = np.array([[1,0], [0,1], [1,1], [0,0]])
labels = [1,0,1,0]
table = [[sum((features[:,0] == i) & (labels == j)) for j in [0,1]] for i in [0,1]]
chi2, p, _, _ = chi2_contingency(table)
print("Feature-label Chi-square p:", p)
use rand::Rng;
use rand_distr::Normal;
fn t_stat_1samp(data: &[f64], popmean: f64) -> (f64, f64) {
let n = data.len() as f64;
let mean = data.iter().sum::<f64>() / n;
let s = (data.iter().map(|&x| (x - mean).powi(2)).sum::<f64>() / (n - 1.0)).sqrt();
let t = (mean - popmean) / (s / n.sqrt());
// p-value approximation (simplified)
(t, 0.0) // Actual p requires t-dist table
}
fn chi2_test(table: &[[u64; 2]]) -> (f64, f64) {
let row_totals: Vec<u64> = table.iter().map(|row| row.iter().sum()).collect();
let col_totals: Vec<u64> = (0..2).map(|j| table.iter().map(|row| row[j]).sum()).collect();
let n: u64 = row_totals.iter().sum();
let mut chi2 = 0.0;
for i in 0..2 {
for j in 0..2 {
let e = row_totals[i] as f64 * col_totals[j] as f64 / n as f64;
chi2 += (table[i][j] as f64 - e).powi(2) / e;
}
}
(chi2, 0.0) // p-value requires chi2 dist
}
fn main() {
let mut rng = rand::thread_rng();
let normal = Normal::new(10.0, 2.0).unwrap();
let data: Vec<f64> = (0..30).map(|_| normal.sample(&mut rng)).collect();
let (t, p) = t_stat_1samp(&data, 9.0);
println!("One-sample t-test: t={} p={}", t, p);
// Chi-square
let table = [[10, 20], [20, 20]];
let (chi2, p) = chi2_test(&table);
println!("Chi-square: stat={} p={}", chi2, p);
}

:::

Performs t-tests, Chi-square tests.


Derive test statistics.

::: code-group

from sympy import symbols, sqrt, Sum, IndexedBase
n, mu0, s = symbols('n mu0 s', positive=True)
x_bar = symbols('x_bar')
t = (x_bar - mu0) / (s / sqrt(n))
print("t-statistic:", t)
x, y = IndexedBase('x'), IndexedBase('y')
n1, n2 = symbols('n1 n2', positive=True)
s1, s2 = symbols('s1 s2', positive=True)
t_two = (Sum(x[i], (i,1,n1))/n1 - Sum(y[i], (i,1,n2))/n2) / sqrt(s1**2/n1 + s2**2/n2)
print("Two-sample t:", t_two)
fn main() {
println!("t-statistic: (x_bar - μ0)/(s/√n)");
println!("Two-sample t: (x_bar1 - x_bar2)/√(s1²/n1 + s2²/n2)");
}

:::


  • Multiple Testing: False positives; use corrections.
  • Non-Normality: Non-parametric alternatives.
  • Small Samples: t-tests sensitive.

  • Hypothesis testing decisions: Model/feature significance.
  • p-values quantify evidence: Against H₀.
  • t-tests compare means: Model performance.
  • Chi-square for categorical: Feature selection.
  • Code tests practically: ML applications.

Testing drives ML validation.


Explored hypothesis testing, p-values, t-tests, Chi-square, with ML applications like model comparison. Examples and Python/Rust code bridge theory to practice. Prepares for ANOVA and resampling.

Word count: Approximately 3000.


  • Wasserman, All of Statistics (Ch. 10).
  • James, Introduction to Statistical Learning (Ch. 5).
  • Khan Academy: Hypothesis testing videos.
  • Rust: ‘statrs’ for statistical tests.