Experimental Design & A/B Testing in ML
Experimental design provides a structured approach to testing hypotheses and evaluating interventions, with A/B testing as a key method in machine learning (ML) to compare models, features, or system changes. By controlling variables, randomizing assignments, and analyzing outcomes, experimental design ensures reliable conclusions about causal effects. In ML, A/B testing is used to optimize algorithms, validate feature impacts, and improve user-facing systems like recommendation engines.
This eighteenth and final lecture in the "Statistics Foundations for AI/ML" series builds on causal inference and time series, exploring principles of experimental design, A/B testing, factorial designs, power analysis, and their ML applications. We'll provide intuitive explanations, mathematical foundations, and practical implementations in Python and Rust, concluding the series with a robust toolkit for ML experimentation.
1. Why Experimental Design and A/B Testing Matter in ML
ML systems require rigorous evaluation to ensure improvements are real, not due to chance or bias. Experimental design and A/B testing:
- Isolate causal effects of changes (e.g., new model vs. baseline).
- Control confounding variables through randomization.
- Optimize systems via data-driven decisions.
ML Connection
- Model Comparison: Test if a new algorithm improves accuracy.
- Feature Evaluation: Assess feature impact on performance.
- System Optimization: Improve user metrics (e.g., click-through rates).
INFO
A/B testing is like a taste test: you compare two recipes to see which one customers prefer, ensuring fairness with randomization.
Example
- Test two recommendation algorithms: A/B test measures click-through rate differences.
2. Principles of Experimental Design
Randomization: Assign units (e.g., users, samples) randomly to treatments to eliminate bias.
Control: Include a baseline group to measure effect size.
Replication: Multiple observations per treatment for reliable estimates.
Blocking: Group similar units to reduce variability.
ML Insight
- Randomization ensures unbiased model comparisons.
3. A/B Testing: The Core Framework
A/B Testing: Randomly assign units to two groups (A: control, B: treatment), compare outcomes.
Hypothesis:
- H₀: No difference (e.g., μ_A = μ_B).
- H₁: Difference exists (e.g., μ_B > μ_A).
Test Statistic: Often t-test for means or z-test for proportions.
p-value: Probability of observing result under H₀.
ML Application
- Compare model accuracies or user engagement metrics.
Example: A/B test on click-through rates, p<0.05 suggests new version improves.
4. Factorial Designs: Testing Multiple Factors
Factorial Design: Test multiple factors (e.g., model type, learning rate) and interactions.
2^k Design: k factors, each at two levels (e.g., high/low).
Model: y = μ + α_i + β_j + (αβ)_{ij} + ε, for two factors.
ML Connection
- Test combinations of hyperparameters (e.g., optimizer, batch size).
Example: 2² design tests model type and regularization strength.
5. Power Analysis and Sample Size
Power: 1-β, probability of rejecting false H₀.
Factors:
- Effect size (e.g., mean difference).
- α (significance level, e.g., 0.05).
- Sample size n.
- Variance σ².
Sample Size Formula (two-sample t-test):
[ n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2} ]
δ effect size, z critical values.
ML Application
- Determine n for detecting model improvement.
6. Multiple Testing in Experiments
Multiple A/B tests inflate Type I errors.
Corrections:
- Bonferroni: α' = α/m.
- FDR (Benjamini-Hochberg): Control false positives.
In ML: Adjust for multiple feature tests.
7. Applications in Machine Learning
- Model Comparison: A/B test new algorithms vs. baseline.
- Feature Selection: Test feature impact on performance.
- System Optimization: Improve metrics (e.g., conversion rates).
- Hyperparameter Tuning: Factorial designs for combinations.
Challenges
- Non-i.i.d. Data: Time-series, user interactions.
- Small Effects: Require large n.
- Multiple Testing: False positives.
8. Numerical A/B Testing and Design
Implement A/B tests, factorial designs.
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower
# A/B test: Compare model accuracies
model_a = np.random.normal(0.85, 0.05, 100) # Control
model_b = np.random.normal(0.87, 0.05, 100) # Treatment
t_stat, p_val = ttest_ind(model_a, model_b)
print("A/B Test: t=", t_stat, "p=", p_val)
# Power analysis
power = TTestIndPower()
n = power.solve_power(effect_size=0.2, alpha=0.05, power=0.8, ratio=1)
print("Sample size needed:", n)
# Factorial design (simplified 2x2)
data = {
'model1_low': np.random.normal(0.85, 0.05, 50),
'model1_high': np.random.normal(0.87, 0.05, 50),
'model2_low': np.random.normal(0.90, 0.05, 50),
'model2_high': np.random.normal(0.92, 0.05, 50)
}
from statsmodels.formula.api import ols
import pandas as pd
df = pd.DataFrame({
'y': np.concatenate([v for v in data.values()]),
'model': ['model1']*100 + ['model2']*100,
'reg': ['low']*50 + ['high']*50 + ['low']*50 + ['high']*50
})
model = ols('y ~ model + reg + model:reg', data=df).fit()
print("Factorial ANOVA:", model.summary())
# ML: Feature A/B test
from sklearn.ensemble import RandomForestClassifier
X = np.random.rand(200, 2)
y = (X[:,0] + X[:,1] > 1).astype(int)
X_new = np.hstack([X, np.random.rand(200, 1)])
model_base = RandomForestClassifier(random_state=0).fit(X, y)
model_new = RandomForestClassifier(random_state=0).fit(X_new, y)
base_acc = model_base.score(X, y)
new_acc = model_new.score(X_new, y)
print("Feature A/B test: Base=", base_acc, "New=", new_acc)
use rand::Rng;
use rand_distr::Normal;
fn t_test_two_sample(x1: &[f64], x2: &[f64]) -> (f64, f64) {
let n1 = x1.len() as f64;
let n2 = x2.len() as f64;
let mean1 = x1.iter().sum::<f64>() / n1;
let mean2 = x2.iter().sum::<f64>() / n2;
let var1 = x1.iter().map(|&x| (x - mean1).powi(2)).sum::<f64>() / (n1 - 1.0);
let var2 = x2.iter().map(|&x| (x - mean2).powi(2)).sum::<f64>() / (n2 - 1.0);
let t = (mean1 - mean2) / (var1 / n1 + var2 / n2).sqrt();
(t, 0.0) // p-value requires t-dist table
}
fn main() {
let mut rng = rand::thread_rng();
let model_a: Vec<f64> = (0..100).map(|_| Normal::new(0.85, 0.05).unwrap().sample(&mut rng)).collect();
let model_b: Vec<f64> = (0..100).map(|_| Normal::new(0.87, 0.05).unwrap().sample(&mut rng)).collect();
let (t, p) = t_test_two_sample(&model_a, &model_b);
println!("A/B Test: t={} p={}", t, p);
// Simplified factorial design (2x2 ANOVA-like)
let data = [
(0..50).map(|_| Normal::new(0.85, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
(0..50).map(|_| Normal::new(0.87, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
(0..50).map(|_| Normal::new(0.90, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
(0..50).map(|_| Normal::new(0.92, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>()
];
let means = data.iter().map(|d| d.iter().sum::<f64>() / d.len() as f64).collect::<Vec<f64>>();
println!("Factorial means: {:?}", means);
}
Implements A/B testing and factorial design.
8. Theoretical Foundations
Randomization: Eliminates confounding.
Factorial Design: Models main effects and interactions.
Power Analysis: Ensures sufficient n for detection.
ML Insight
- A/B testing validates ML system improvements.
9. Challenges in ML Experiments
- Non-i.i.d. Data: Time-series, user dependencies.
- Multiple Testing: False positives require correction.
- Small Effect Sizes: Large n needed.
10. Key ML Takeaways
- Experimental design controls bias: Randomization.
- A/B testing compares: Models, features.
- Factorial designs test interactions: Multiple factors.
- Power ensures detection: Sample size.
- Code implements experiments: Practical ML.
Experimental design drives ML validation.
11. Summary
Explored experimental design, A/B testing, factorial designs, and power analysis, with ML applications in model optimization and evaluation. Examples and Python/Rust code bridge theory to practice. Concludes series with ML experimentation toolkit.
Word count: Approximately 3000.
Further Reading
- Montgomery, Design and Analysis of Experiments.
- Kohavi, Trustworthy Online Controlled Experiments.
- James, Introduction to Statistical Learning (Ch. 5).
- Rust: 'rand' for randomization, 'statrs' for tests.