Experimental Design & A/B Testing in ML

Experimental design provides a structured approach to testing hypotheses and evaluating interventions, with A/B testing as a key method in machine learning (ML) to compare models, features, or system changes. By controlling variables, randomizing assignments, and analyzing outcomes, experimental design ensures reliable conclusions about causal effects. In ML, A/B testing is used to optimize algorithms, validate feature impacts, and improve user-facing systems like recommendation engines.

This eighteenth and final lecture in the “Statistics Foundations for AI/ML” series builds on causal inference and time series, exploring principles of experimental design, A/B testing, factorial designs, power analysis, and their ML applications. We’ll provide intuitive explanations, mathematical foundations, and practical implementations in Python and Rust, concluding the series with a robust toolkit for ML experimentation.

1. Why Experimental Design and A/B Testing Matter in ML

ML systems require rigorous evaluation to ensure improvements are real, not due to chance or bias. Experimental design and A/B testing:

Isolate causal effects of changes (e.g., new model vs. baseline).
Control confounding variables through randomization.
Optimize systems via data-driven decisions.

ML Connection

Model Comparison: Test if a new algorithm improves accuracy.
Feature Evaluation: Assess feature impact on performance.
System Optimization: Improve user metrics (e.g., click-through rates).

::: info A/B testing is like a taste test: you compare two recipes to see which one customers prefer, ensuring fairness with randomization. :::

Example

Test two recommendation algorithms: A/B test measures click-through rate differences.

2. Principles of Experimental Design

Randomization: Assign units (e.g., users, samples) randomly to treatments to eliminate bias.

Control: Include a baseline group to measure effect size.

Replication: Multiple observations per treatment for reliable estimates.

Blocking: Group similar units to reduce variability.

ML Insight

Randomization ensures unbiased model comparisons.

3. A/B Testing: The Core Framework

A/B Testing: Randomly assign units to two groups (A: control, B: treatment), compare outcomes.

Hypothesis:

H₀: No difference (e.g., μ_A = μ_B).
H₁: Difference exists (e.g., μ_B > μ_A).

Test Statistic: Often t-test for means or z-test for proportions.

p-value: Probability of observing result under H₀.

ML Application

Compare model accuracies or user engagement metrics.

Example: A/B test on click-through rates, p<0.05 suggests new version improves.

4. Factorial Designs: Testing Multiple Factors

Factorial Design: Test multiple factors (e.g., model type, learning rate) and interactions.

2^k Design: k factors, each at two levels (e.g., high/low).

Model: y = μ + α_i + β_j + (αβ)_{ij} + ε, for two factors.

ML Connection

Test combinations of hyperparameters (e.g., optimizer, batch size).

Example: 2² design tests model type and regularization strength.

5. Power Analysis and Sample Size

Power: 1-β, probability of rejecting false H₀.

Factors:

Effect size (e.g., mean difference).
α (significance level, e.g., 0.05).
Sample size n.
Variance σ².

Sample Size Formula (two-sample t-test):

[ n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2} ]

δ effect size, z critical values.

ML Application

Determine n for detecting model improvement.

6. Multiple Testing in Experiments

Multiple A/B tests inflate Type I errors.

Corrections:

Bonferroni: α’ = α/m.
FDR (Benjamini-Hochberg): Control false positives.

In ML: Adjust for multiple feature tests.

7. Applications in Machine Learning

Model Comparison: A/B test new algorithms vs. baseline.
Feature Selection: Test feature impact on performance.
System Optimization: Improve metrics (e.g., conversion rates).
Hyperparameter Tuning: Factorial designs for combinations.

Challenges

Non-i.i.d. Data: Time-series, user interactions.
Small Effects: Require large n.
Multiple Testing: False positives.

8. Numerical A/B Testing and Design

Implement A/B tests, factorial designs.

::: code-group

import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower

# A/B test: Compare model accuracies
model_a = np.random.normal(0.85, 0.05, 100)  # Control
model_b = np.random.normal(0.87, 0.05, 100)  # Treatment
t_stat, p_val = ttest_ind(model_a, model_b)
print("A/B Test: t=", t_stat, "p=", p_val)

# Power analysis
power = TTestIndPower()
n = power.solve_power(effect_size=0.2, alpha=0.05, power=0.8, ratio=1)
print("Sample size needed:", n)

# Factorial design (simplified 2x2)
data = {
    'model1_low': np.random.normal(0.85, 0.05, 50),
    'model1_high': np.random.normal(0.87, 0.05, 50),
    'model2_low': np.random.normal(0.90, 0.05, 50),
    'model2_high': np.random.normal(0.92, 0.05, 50)
}
from statsmodels.formula.api import ols
import pandas as pd
df = pd.DataFrame({
    'y': np.concatenate([v for v in data.values()]),
    'model': ['model1']*100 + ['model2']*100,
    'reg': ['low']*50 + ['high']*50 + ['low']*50 + ['high']*50
})
model = ols('y ~ model + reg + model:reg', data=df).fit()
print("Factorial ANOVA:", model.summary())

# ML: Feature A/B test
from sklearn.ensemble import RandomForestClassifier
X = np.random.rand(200, 2)
y = (X[:,0] + X[:,1] > 1).astype(int)
X_new = np.hstack([X, np.random.rand(200, 1)])
model_base = RandomForestClassifier(random_state=0).fit(X, y)
model_new = RandomForestClassifier(random_state=0).fit(X_new, y)
base_acc = model_base.score(X, y)
new_acc = model_new.score(X_new, y)
print("Feature A/B test: Base=", base_acc, "New=", new_acc)

use rand::Rng;
use rand_distr::Normal;

fn t_test_two_sample(x1: &[f64], x2: &[f64]) -> (f64, f64) {
    let n1 = x1.len() as f64;
    let n2 = x2.len() as f64;
    let mean1 = x1.iter().sum::<f64>() / n1;
    let mean2 = x2.iter().sum::<f64>() / n2;
    let var1 = x1.iter().map(|&x| (x - mean1).powi(2)).sum::<f64>() / (n1 - 1.0);
    let var2 = x2.iter().map(|&x| (x - mean2).powi(2)).sum::<f64>() / (n2 - 1.0);
    let t = (mean1 - mean2) / (var1 / n1 + var2 / n2).sqrt();
    (t, 0.0)  // p-value requires t-dist table
}

fn main() {
    let mut rng = rand::thread_rng();
    let model_a: Vec<f64> = (0..100).map(|_| Normal::new(0.85, 0.05).unwrap().sample(&mut rng)).collect();
    let model_b: Vec<f64> = (0..100).map(|_| Normal::new(0.87, 0.05).unwrap().sample(&mut rng)).collect();
    let (t, p) = t_test_two_sample(&model_a, &model_b);
    println!("A/B Test: t={} p={}", t, p);

    // Simplified factorial design (2x2 ANOVA-like)
    let data = [
        (0..50).map(|_| Normal::new(0.85, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
        (0..50).map(|_| Normal::new(0.87, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
        (0..50).map(|_| Normal::new(0.90, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
        (0..50).map(|_| Normal::new(0.92, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>()
    ];
    let means = data.iter().map(|d| d.iter().sum::<f64>() / d.len() as f64).collect::<Vec<f64>>();
    println!("Factorial means: {:?}", means);
}

:::

Implements A/B testing and factorial design.

8. Theoretical Foundations

Randomization: Eliminates confounding.

Factorial Design: Models main effects and interactions.

Power Analysis: Ensures sufficient n for detection.

ML Insight

A/B testing validates ML system improvements.

9. Challenges in ML Experiments

Non-i.i.d. Data: Time-series, user dependencies.
Multiple Testing: False positives require correction.
Small Effect Sizes: Large n needed.

10. Key ML Takeaways

Experimental design controls bias: Randomization.
A/B testing compares: Models, features.
Factorial designs test interactions: Multiple factors.
Power ensures detection: Sample size.
Code implements experiments: Practical ML.

Experimental design drives ML validation.

11. Summary

Explored experimental design, A/B testing, factorial designs, and power analysis, with ML applications in model optimization and evaluation. Examples and Python/Rust code bridge theory to practice. Concludes series with ML experimentation toolkit.

Word count: Approximately 3000.

Experimental Design & A/B Testing in ML

Experimental Design & A/B Testing in ML

1. Why Experimental Design and A/B Testing Matter in ML

ML Connection

Example

2. Principles of Experimental Design

ML Insight

3. A/B Testing: The Core Framework

ML Application

4. Factorial Designs: Testing Multiple Factors

ML Connection

5. Power Analysis and Sample Size

ML Application

6. Multiple Testing in Experiments

7. Applications in Machine Learning

Challenges

8. Numerical A/B Testing and Design

8. Theoretical Foundations

ML Insight

9. Challenges in ML Experiments

10. Key ML Takeaways

11. Summary

Further Reading