Why Probability in ML?
Machine learning (ML) thrives in environments of uncertainty—data is noisy, predictions are probabilistic, and models must generalize from incomplete information. Probability provides the mathematical language to quantify, manage, and reason about this uncertainty. Whether classifying images, forecasting stock prices, or training generative models, probability underpins every decision an ML model makes. Understanding why and how probability integrates with ML is the first step to mastering the foundations of artificial intelligence.
This lecture kicks off the "Probability Foundations for AI/ML" series, introducing the necessity of probability, basic concepts like sample spaces and events, fundamental rules (axioms, addition, multiplication), and their immediate relevance to ML tasks. We'll ground these ideas in intuitive explanations, mathematical rigor, and practical examples, with implementations in Python and Rust to connect theory to code. By the end, you'll see why probability is the backbone of ML and be ready for deeper probabilistic concepts.
1. The Role of Probability in Machine Learning
ML models learn patterns from data, but real-world data is inherently uncertain:
- Noisy Inputs: Sensor errors, missing values, or outliers.
- Incomplete Information: Training data never covers all scenarios.
- Stochastic Processes: User behavior, market trends, or physical systems evolve randomly.
Probability quantifies this uncertainty, enabling models to:
- Predict with confidence intervals (e.g., 95% chance of rain).
- Optimize decisions under risk (e.g., expected rewards in reinforcement learning).
- Update beliefs with new data (e.g., Bayesian inference in spam filtering).
ML Connection
- Classification: Probabilities over classes (softmax outputs).
- Generative Models: Sample from distributions (GANs, VAEs).
- Uncertainty Quantification: Confidence in predictions for safety-critical applications.
INFO
Probability turns guesswork into principled reasoning, like a weather forecast assigning chances to rain vs. sun.
Everyday Example
- Predicting if an email is spam: Probability combines features (words, sender) to estimate P(spam|email).
2. Uncertainty in Machine Learning: Why It Matters
Uncertainty arises in:
- Aleatoric: Inherent randomness (e.g., dice rolls).
- Epistemic: Lack of knowledge (e.g., limited training data).
Probability models both:
- P(y|x): Likelihood of label y given input x.
- P(θ|D): Uncertainty in model parameters θ given data D.
In practice:
- Regression: Predict house prices with error bounds.
- Reinforcement Learning: Balance exploration vs. exploitation via probabilities.
3. Foundations: Sample Spaces and Events
Sample Space (Ω): Set of all possible outcomes of a random experiment.
- Discrete: {1,2,3,4,5,6} for a die.
- Continuous: [0,∞) for waiting times.
Event: Subset of Ω (e.g., {2,4,6} for even rolls).
Sigma-Algebra (Σ): Collection of measurable events, closed under unions, intersections, complements.
ML Connection
- Features as sample spaces: Pixel values, word embeddings.
- Events as predictions: "Image is a cat" = subset of image space.
Example: Coin toss, Ω={H,T}, event {H} for heads.
4. Probability Measures and Axioms
Probability P: Σ → [0,1], satisfying:
- P(A)≥0 for event A.
- P(Ω)=1.
- Countable additivity: P(∪ A_i)=sum P(A_i) for disjoint A_i.
Derived Rules
- Complement: P(A^c)=1-P(A).
- Union: P(A∪B)=P(A)+P(B)-P(A∩B).
- Inclusion-Exclusion for n events.
ML Insight
- P(class|features) computed via training data or models.
Example: Two dice, P(sum=7)=6/36=1/6 (pairs (1,6),(2,5),...).
5. Discrete vs. Continuous Probability
Discrete: Finite/countable outcomes, P(X=x) direct.
- E.g., Bernoulli (coin flip), P(X=1)=p.
Continuous: Uncountable, use PDF f(x), P(a≤X≤b)=∫_a^b f(x) dx.
- E.g., Normal N(μ,σ^2).
ML Application
- Discrete: Classification labels.
- Continuous: Regression outputs, latent variables in VAEs.
6. Basic Probability Rules in ML Contexts
Addition Rule: P(A∪B)=P(A)+P(B)-P(A∩B).
- ML: P(error or correct) in classification.
Multiplication Rule: P(A∩B)=P(A)P(B|A).
- ML: Joint probs in graphical models.
Law of Total Probability: P(B)=sum P(B|A_i)P(A_i).
- ML: Marginalize over hidden variables.
Example: Spam filter, P(spam ∩ keyword)=P(keyword|spam)P(spam).
7. Modeling Uncertainty in ML: Examples
- Classification: Softmax outputs P(y_i|x) for class i.
- Regression: Gaussian noise model, y~N(w^T x, σ^2).
- Decision Trees: Probabilities at leaf nodes.
- Bayesian Nets: Model P(data|hypothesis).
8. Numerical Probability Computations
Simulate probs, estimate via Monte Carlo.
import numpy as np
# Discrete: Coin toss
def coin_toss_prob(n_trials=10000, p_head=0.5):
outcomes = np.random.binomial(1, p_head, n_trials)
return np.mean(outcomes)
print("P(head) ≈", coin_toss_prob()) # ~0.5
# Continuous: Normal P(X≤1)
from scipy.stats import norm
p = norm.cdf(1, loc=0, scale=1)
print("P(X≤1) normal:", p)
# ML: Naive Bayes toy
def naive_bayes(X, y, test):
# Simplified: Binary features, two classes
p_class = np.mean(y) # P(C=1)
p_features = np.zeros(X.shape[1])
for i in range(X.shape[1]):
p_features[i] = np.mean(X[y==1,i]) # P(f_i|C=1)
prob = p_class * np.prod([p if t==1 else 1-p for p,t in zip(p_features, test)])
return prob
X = np.array([[1,0],[0,1],[1,1],[0,0]])
y = np.array([1,0,1,0])
test = [1,1]
print("P(C=1|test):", naive_bayes(X, y, test))
use rand::Rng;
use rand_distr::{Binomial, Normal, Distribution};
fn coin_toss_prob(n_trials: usize, p_head: f64) -> f64 {
let binom = Binomial::new(1, p_head).unwrap();
let mut sum = 0.0;
let mut rng = rand::thread_rng();
for _ in 0..n_trials {
sum += binom.sample(&mut rng) as f64;
}
sum / n_trials as f64
}
fn main() {
println!("P(head) ≈ {}", coin_toss_prob(10000, 0.5));
// Normal CDF (approximate)
let normal = Normal::new(0.0, 1.0).unwrap();
let n = 10000;
let mut count = 0;
let mut rng = rand::thread_rng();
for _ in 0..n {
if normal.sample(&mut rng) <= 1.0 {
count += 1;
}
}
println!("P(X≤1) normal: {}", count as f64 / n as f64);
// Naive Bayes toy
fn naive_bayes(x: &[[u8; 2]], y: &[u8], test: &[u8]) -> f64 {
let p_class = y.iter().sum::<u8>() as f64 / y.len() as f64;
let mut p_features = [0.0; 2];
for i in 0..2 {
let mut sum = 0;
let mut count = 0;
for (xi, &yi) in x.iter().zip(y.iter()) {
if yi == 1 {
sum += xi[i] as u64;
count += 1;
}
}
p_features[i] = sum as f64 / count as f64;
}
p_class * test.iter().zip(p_features.iter()).map(|(&t, &p)| if t==1 {p} else {1.0-p}).product::<f64>()
}
let x = [[1,0],[0,1],[1,1],[0,0]];
let y = [1,0,1,0];
let test = [1,1];
println!("P(C=1|test): {}", naive_bayes(&x, &y, &test));
}
Simulates coin toss, normal CDF, naive Bayes.
9. Symbolic Probability Computations
Exact calculations with SymPy.
from sympy import symbols, Rational
p = Rational(1,2)
n = symbols('n')
binomial = p**n * (1-p)**(1-n)
print("Bernoulli P(X=1):", binomial.subs(n,1))
# ML: Joint prob
p_spam, p_keyword = symbols('p_spam p_keyword')
joint = p_keyword * p_spam
print("P(spam ∩ keyword):", joint)
fn main() {
println!("Bernoulli P(X=1): 0.5");
println!("P(spam ∩ keyword): p_keyword * p_spam");
}
10. Probability Axioms in ML Modeling
Axioms ensure consistency:
- Classification: Sum P(y_i|x)=1.
- Monte Carlo: Additivity for expectations.
11. Challenges in Probabilistic ML
- High-dim: Curse affects density estimation.
- Non-measurable sets: Rare, but measure theory avoids.
12. Key ML Takeaways
- Probability quantifies uncertainty: Core to predictions.
- Sample spaces define problems: Discrete/continuous.
- Rules structure models: Addition, multiplication.
- Code computes probs: Simulations, exact.
- Foundation for advanced: Sets stage for distributions.
Probability is ML's language of uncertainty.
13. Summary
Introduced why probability drives ML, from sample spaces to rules, with applications in classification, regression. Examples and Python/Rust code connect concepts to practice. Prepares for random variables and distributions.
Word count: Approximately 2950.
Further Reading
- Wasserman, All of Statistics (Ch. 1).
- Bishop, Pattern Recognition (Ch. 1).
- 3Blue1Brown: Probability videos.
- Rust: 'rand', 'rand_distr' crates.