Skip to content

Experimental Design & A/B Testing in ML

Experimental design provides a structured approach to testing hypotheses and evaluating interventions, with A/B testing as a key method in machine learning (ML) to compare models, features, or system changes. By controlling variables, randomizing assignments, and analyzing outcomes, experimental design ensures reliable conclusions about causal effects. In ML, A/B testing is used to optimize algorithms, validate feature impacts, and improve user-facing systems like recommendation engines.

This eighteenth and final lecture in the “Statistics Foundations for AI/ML” series builds on causal inference and time series, exploring principles of experimental design, A/B testing, factorial designs, power analysis, and their ML applications. We’ll provide intuitive explanations, mathematical foundations, and practical implementations in Python and Rust, concluding the series with a robust toolkit for ML experimentation.


1. Why Experimental Design and A/B Testing Matter in ML

Section titled “1. Why Experimental Design and A/B Testing Matter in ML”

ML systems require rigorous evaluation to ensure improvements are real, not due to chance or bias. Experimental design and A/B testing:

  • Isolate causal effects of changes (e.g., new model vs. baseline).
  • Control confounding variables through randomization.
  • Optimize systems via data-driven decisions.
  • Model Comparison: Test if a new algorithm improves accuracy.
  • Feature Evaluation: Assess feature impact on performance.
  • System Optimization: Improve user metrics (e.g., click-through rates).

::: info A/B testing is like a taste test: you compare two recipes to see which one customers prefer, ensuring fairness with randomization. :::

  • Test two recommendation algorithms: A/B test measures click-through rate differences.

Randomization: Assign units (e.g., users, samples) randomly to treatments to eliminate bias.

Control: Include a baseline group to measure effect size.

Replication: Multiple observations per treatment for reliable estimates.

Blocking: Group similar units to reduce variability.

  • Randomization ensures unbiased model comparisons.

A/B Testing: Randomly assign units to two groups (A: control, B: treatment), compare outcomes.

Hypothesis:

  • H₀: No difference (e.g., μ_A = μ_B).
  • H₁: Difference exists (e.g., μ_B > μ_A).

Test Statistic: Often t-test for means or z-test for proportions.

p-value: Probability of observing result under H₀.

  • Compare model accuracies or user engagement metrics.

Example: A/B test on click-through rates, p<0.05 suggests new version improves.


4. Factorial Designs: Testing Multiple Factors

Section titled “4. Factorial Designs: Testing Multiple Factors”

Factorial Design: Test multiple factors (e.g., model type, learning rate) and interactions.

2^k Design: k factors, each at two levels (e.g., high/low).

Model: y = μ + α_i + β_j + (αβ)_{ij} + ε, for two factors.

  • Test combinations of hyperparameters (e.g., optimizer, batch size).

Example: 2² design tests model type and regularization strength.


Power: 1-β, probability of rejecting false H₀.

Factors:

  • Effect size (e.g., mean difference).
  • α (significance level, e.g., 0.05).
  • Sample size n.
  • Variance σ².

Sample Size Formula (two-sample t-test):

[ n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2} ]

δ effect size, z critical values.

  • Determine n for detecting model improvement.

Multiple A/B tests inflate Type I errors.

Corrections:

  • Bonferroni: α’ = α/m.
  • FDR (Benjamini-Hochberg): Control false positives.

In ML: Adjust for multiple feature tests.


  1. Model Comparison: A/B test new algorithms vs. baseline.
  2. Feature Selection: Test feature impact on performance.
  3. System Optimization: Improve metrics (e.g., conversion rates).
  4. Hyperparameter Tuning: Factorial designs for combinations.
  • Non-i.i.d. Data: Time-series, user interactions.
  • Small Effects: Require large n.
  • Multiple Testing: False positives.

Implement A/B tests, factorial designs.

::: code-group

import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower
# A/B test: Compare model accuracies
model_a = np.random.normal(0.85, 0.05, 100) # Control
model_b = np.random.normal(0.87, 0.05, 100) # Treatment
t_stat, p_val = ttest_ind(model_a, model_b)
print("A/B Test: t=", t_stat, "p=", p_val)
# Power analysis
power = TTestIndPower()
n = power.solve_power(effect_size=0.2, alpha=0.05, power=0.8, ratio=1)
print("Sample size needed:", n)
# Factorial design (simplified 2x2)
data = {
'model1_low': np.random.normal(0.85, 0.05, 50),
'model1_high': np.random.normal(0.87, 0.05, 50),
'model2_low': np.random.normal(0.90, 0.05, 50),
'model2_high': np.random.normal(0.92, 0.05, 50)
}
from statsmodels.formula.api import ols
import pandas as pd
df = pd.DataFrame({
'y': np.concatenate([v for v in data.values()]),
'model': ['model1']*100 + ['model2']*100,
'reg': ['low']*50 + ['high']*50 + ['low']*50 + ['high']*50
})
model = ols('y ~ model + reg + model:reg', data=df).fit()
print("Factorial ANOVA:", model.summary())
# ML: Feature A/B test
from sklearn.ensemble import RandomForestClassifier
X = np.random.rand(200, 2)
y = (X[:,0] + X[:,1] > 1).astype(int)
X_new = np.hstack([X, np.random.rand(200, 1)])
model_base = RandomForestClassifier(random_state=0).fit(X, y)
model_new = RandomForestClassifier(random_state=0).fit(X_new, y)
base_acc = model_base.score(X, y)
new_acc = model_new.score(X_new, y)
print("Feature A/B test: Base=", base_acc, "New=", new_acc)
use rand::Rng;
use rand_distr::Normal;
fn t_test_two_sample(x1: &[f64], x2: &[f64]) -> (f64, f64) {
let n1 = x1.len() as f64;
let n2 = x2.len() as f64;
let mean1 = x1.iter().sum::<f64>() / n1;
let mean2 = x2.iter().sum::<f64>() / n2;
let var1 = x1.iter().map(|&x| (x - mean1).powi(2)).sum::<f64>() / (n1 - 1.0);
let var2 = x2.iter().map(|&x| (x - mean2).powi(2)).sum::<f64>() / (n2 - 1.0);
let t = (mean1 - mean2) / (var1 / n1 + var2 / n2).sqrt();
(t, 0.0) // p-value requires t-dist table
}
fn main() {
let mut rng = rand::thread_rng();
let model_a: Vec<f64> = (0..100).map(|_| Normal::new(0.85, 0.05).unwrap().sample(&mut rng)).collect();
let model_b: Vec<f64> = (0..100).map(|_| Normal::new(0.87, 0.05).unwrap().sample(&mut rng)).collect();
let (t, p) = t_test_two_sample(&model_a, &model_b);
println!("A/B Test: t={} p={}", t, p);
// Simplified factorial design (2x2 ANOVA-like)
let data = [
(0..50).map(|_| Normal::new(0.85, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
(0..50).map(|_| Normal::new(0.87, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
(0..50).map(|_| Normal::new(0.90, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>(),
(0..50).map(|_| Normal::new(0.92, 0.05).unwrap().sample(&mut rng)).collect::<Vec<f64>>()
];
let means = data.iter().map(|d| d.iter().sum::<f64>() / d.len() as f64).collect::<Vec<f64>>();
println!("Factorial means: {:?}", means);
}

:::

Implements A/B testing and factorial design.


Randomization: Eliminates confounding.

Factorial Design: Models main effects and interactions.

Power Analysis: Ensures sufficient n for detection.

  • A/B testing validates ML system improvements.

  • Non-i.i.d. Data: Time-series, user dependencies.
  • Multiple Testing: False positives require correction.
  • Small Effect Sizes: Large n needed.

  • Experimental design controls bias: Randomization.
  • A/B testing compares: Models, features.
  • Factorial designs test interactions: Multiple factors.
  • Power ensures detection: Sample size.
  • Code implements experiments: Practical ML.

Experimental design drives ML validation.


Explored experimental design, A/B testing, factorial designs, and power analysis, with ML applications in model optimization and evaluation. Examples and Python/Rust code bridge theory to practice. Concludes series with ML experimentation toolkit.

Word count: Approximately 3000.


  • Montgomery, Design and Analysis of Experiments.
  • Kohavi, Trustworthy Online Controlled Experiments.
  • James, Introduction to Statistical Learning (Ch. 5).
  • Rust: ‘rand’ for randomization, ‘statrs’ for tests.