Naive Bayes & Probabilistic Models
Naive Bayes & Probabilistic Models
Section titled “Naive Bayes & Probabilistic Models”Naive Bayes classifiers are probabilistic models based on Bayes’ theorem, assuming independence between features (the “naive” part). They are efficient for classification tasks, particularly in high-dimensional spaces like text data, and remain relevant in 2025 for applications in spam filtering, sentiment analysis, and as baselines or components in hybrid systems with large language models (LLMs). Probabilistic models, in general, incorporate uncertainty, enabling robust predictions and interpretability.
This lecture in the “Foundations for AI/ML” series (core-ml cluster) builds on prior topics like logistic regression and k-NN, exploring Naive Bayes classifiers, their theoretical foundations, derivations, and applications. We’ll provide intuitive explanations, mathematical insights, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.
1. Motivation and Introduction
Section titled “1. Motivation and Introduction”Naive Bayes is a family of simple probabilistic classifiers based on Bayes’ theorem with the naive assumption of conditional independence between features. Despite this simplification, it performs surprisingly well on many tasks, especially with high-dimensional data.
Why Naive Bayes in 2025?
- Efficiency: Fast training and prediction, ideal for edge devices.
- Interpretability: Feature probabilities reveal insights.
- Baseline: Compares against complex models like transformers.
- Modern Applications: Used in hybrid systems, e.g., Naive Bayes on LLM embeddings for quick classification.
Naive Bayes models P(class|features) using Bayes’ theorem:
[ P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) P(C_k)}{P(\mathbf{x})} ]
Under independence: P(\mathbf{x} | C_k) = ∏ P(x_i | C_k).
2. Mathematical Formulation
Section titled “2. Mathematical Formulation”Bayes’ Theorem Recap
Section titled “Bayes’ Theorem Recap”From Bayes’ theorem:
[ P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) P(C_k)}{\sum_j P(\mathbf{x} | C_j) P(C_j)} ]
- P(C_k): Prior probability of class k.
- P(\mathbf{x} | C_k): Likelihood.
- P(\mathbf{x}): Evidence (normalizer).
Naive Assumption
Section titled “Naive Assumption”Assume features independent given class:
[ P(\mathbf{x} | C_k) = \prod_{i=1}^n P(x_i | C_k) ]
Variants
Section titled “Variants”- Gaussian Naive Bayes: Features ~ Normal.
- Multinomial Naive Bayes: For counts (e.g., text).
- Bernoulli Naive Bayes: For binary features.
ML Connection
Section titled “ML Connection”- In 2025, Naive Bayes serves as a lightweight classifier in federated learning or on embeddings from LLMs.
3. Deriving the Classifier
Section titled “3. Deriving the Classifier”Prior: P(C_k) = N_k / N.
Likelihood: For Gaussian:
[ P(x_i | C_k) = \frac{1}{\sqrt{2\pi \sigma_{k,i}^2}} \exp\left( -\frac{(x_i - \mu_{k,i})^2}{2\sigma_{k,i}^2} \right) ]
Take log for stability:
[ \log P(C_k | \mathbf{x}) = \log P(C_k) + \sum_i \log P(x_i | C_k) - \log P(\mathbf{x}) ]
Classify as argmax_k \log P(C_k | \mathbf{x}).
Laplace Smoothing
Section titled “Laplace Smoothing”For zero counts: P(x_i | C_k) = (count + 1) / (N_k + V), V vocabulary size.
4. Training and Prediction
Section titled “4. Training and Prediction”Training:
- Compute priors P(C_k).
- For each feature, estimate P(x_i | C_k) per variant.
Prediction:
- Compute posteriors for each class.
- Select max posterior class.
Complexity: O(n d), n samples, d features.
ML Connection
Section titled “ML Connection”- Fast for high-d text data in NLP.
5. Regularization and Smoothing
Section titled “5. Regularization and Smoothing”Laplace (Add-One): Avoid zero probabilities.
Dirichlet Prior: Generalizes Laplace for multinomial.
In ML: Smoothing essential for sparse data.
6. Evaluation Metrics
Section titled “6. Evaluation Metrics”- Accuracy: Fraction correct.
- Precision/Recall/F1: For imbalanced classes.
- ROC-AUC: Probability calibration.
- Log-Loss: Measures confidence.
In 2025, Brier score for calibration in probabilistic ML.
7. Applications in Machine Learning (2025)
Section titled “7. Applications in Machine Learning (2025)”- Text Classification: Spam detection, sentiment analysis.
- Medical Diagnosis: Disease classification from symptoms.
- Recommendation: User categorization.
- LLM Integration: Naive Bayes on embeddings for lightweight tasks.
- Federated Learning: Efficient on-device classifier.
- Anomaly Detection: Detect deviations from class distributions.
Challenges
Section titled “Challenges”- Independence Assumption: Rarely true; mitigated by feature engineering.
- High-D: Curse of dimensionality; use dimensionality reduction.
- Imbalanced Data: Use weighted priors or oversampling.
8. Numerical Implementations
Section titled “8. Numerical Implementations”Implement Naive Bayes for classification.
::: code-group
import numpy as npfrom sklearn.naive_bayes import GaussianNB, MultinomialNBfrom sklearn.metrics import accuracy_score, classification_reportfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_split
# Gaussian Naive Bayesiris = load_iris()X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)
gnb = GaussianNB()gnb.fit(X_train, y_train)y_pred = gnb.predict(X_test)print("Gaussian NB Accuracy:", accuracy_score(y_test, y_pred))print(classification_report(y_test, y_pred))
# Multinomial Naive Bayes (text example)from sklearn.feature_extraction.text import CountVectorizertexts = ["love this movie", "hate this film", "great story", "bad acting"]labels = [1, 0, 1, 0]vectorizer = CountVectorizer()X_text = vectorizer.fit_transform(texts)mnb = MultinomialNB()mnb.fit(X_text, labels)test_text = vectorizer.transform(["love great story"])print("Multinomial NB Prediction:", mnb.predict(test_text)[0])use linfa::prelude::*;use linfa_bayes::GaussianNb;use ndarray::{Array2, Array1};
fn main() { // Placeholder: Iris dataset not natively in Rust; load via CSV or Python let x_train: Array2<f64> = Array2::zeros((120, 4)); // Simplified let y_train: Array1<i32> = Array1::zeros(120); let x_test: Array2<f64> = Array2::zeros((30, 4)); let y_test: Array1<i32> = Array1::zeros(30);
let dataset = Dataset::new(x_train, y_train); let model = GaussianNb::params().fit(&dataset).unwrap(); let preds = model.predict(&x_test); let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64; println!("Gaussian NB Accuracy: {}", accuracy);
// Multinomial NB placeholder}:::
Note: Rust requires external data loading; use Python for full Iris example.
9. Case Study: SMS Spam Detection (Text Classification)
Section titled “9. Case Study: SMS Spam Detection (Text Classification)”::: code-group
from sklearn.datasets import fetch_20newsgroups # Placeholder for SMS datafrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report
# Load data (use SMS spam dataset in practice)newsgroups = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)
# Vectorizevectorizer = TfidfVectorizer()X_train_vec = vectorizer.fit_transform(X_train)X_test_vec = vectorizer.transform(X_test)
# Trainmnb = MultinomialNB()mnb.fit(X_train_vec, y_train)
# Evaluatey_pred = mnb.predict(X_test_vec)print(classification_report(y_test, y_pred))use linfa::prelude::*;use linfa_bayes::MultinomialNb;use ndarray::{Array2, Array1};
fn main() { // Placeholder: Text data not natively supported; use vectorized features let x_train: Array2<f64> = Array2::zeros((800, 1000)); // TF-IDF example let y_train: Array1<i32> = Array1::zeros(800); let x_test: Array2<f64> = Array2::zeros((200, 1000)); let y_test: Array1<i32> = Array1::zeros(200);
let dataset = Dataset::new(x_train, y_train); let model = MultinomialNb::params().fit(&dataset).unwrap(); let preds = model.predict(&x_test); let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64; println!("Multinomial NB Accuracy: {}", accuracy);}:::
Note: Rust requires TF-IDF implementation; use Python for full text example.
10. Under the Hood Insights
Section titled “10. Under the Hood Insights”- Generative Model: Models joint P(\mathbf{x}, y), unlike discriminative logistic regression.
- Independence Assumption: “Naive” but effective with smoothing.
- Scalability: Handles high-d sparse data well.
- Probability Calibration: Outputs well-calibrated probabilities with smoothing.
11. Limitations
Section titled “11. Limitations”- Independence Assumption: Rarely true, reduces accuracy for correlated features.
- High-D: Benefits from feature selection.
- Zero Probability: Smoothing essential.
- Imbalanced Data: Requires class weighting.
12. Summary
Section titled “12. Summary”Naive Bayes is a fast, probabilistic classifier excelling in high-dimensional data. In 2025, its efficiency in federated learning, LLM integration, and anomaly detection keeps it vital. Its simplicity and interpretability make it a strong baseline.
Further Reading
Section titled “Further Reading”- James, Introduction to Statistical Learning (Ch. 4).
- Géron, Hands-On ML (Ch. 4).
linfa-bayesdocs: github.com/rust-ml/linfa.- Zhang, “The Naïve Bayes Classifier” (2005).