Skip to content

Naive Bayes & Probabilistic Models

Naive Bayes classifiers are probabilistic models based on Bayes' theorem, assuming independence between features (the "naive" part). They are efficient for classification tasks, particularly in high-dimensional spaces like text data, and remain relevant in 2025 for applications in spam filtering, sentiment analysis, and as baselines or components in hybrid systems with large language models (LLMs). Probabilistic models, in general, incorporate uncertainty, enabling robust predictions and interpretability.

This lecture in the "Foundations for AI/ML" series (core-ml cluster) builds on prior topics like logistic regression and k-NN, exploring Naive Bayes classifiers, their theoretical foundations, derivations, and applications. We’ll provide intuitive explanations, mathematical insights, and practical implementations in Python (scikit-learn) and Rust (linfa), ensuring a rigorous yet practical guide aligned with 2025 ML trends.


1. Motivation and Introduction

Naive Bayes is a family of simple probabilistic classifiers based on Bayes' theorem with the naive assumption of conditional independence between features. Despite this simplification, it performs surprisingly well on many tasks, especially with high-dimensional data.

Why Naive Bayes in 2025?

  • Efficiency: Fast training and prediction, ideal for edge devices.
  • Interpretability: Feature probabilities reveal insights.
  • Baseline: Compares against complex models like transformers.
  • Modern Applications: Used in hybrid systems, e.g., Naive Bayes on LLM embeddings for quick classification.

Naive Bayes models P(class|features) using Bayes' theorem:

[ P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) P(C_k)}{P(\mathbf{x})} ]

Under independence: P(\mathbf{x} | C_k) = ∏ P(x_i | C_k).


2. Mathematical Formulation

Bayes' Theorem Recap

From Bayes' theorem:

[ P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) P(C_k)}{\sum_j P(\mathbf{x} | C_j) P(C_j)} ]

  • P(C_k): Prior probability of class k.
  • P(\mathbf{x} | C_k): Likelihood.
  • P(\mathbf{x}): Evidence (normalizer).

Naive Assumption

Assume features independent given class:

[ P(\mathbf{x} | C_k) = \prod_{i=1}^n P(x_i | C_k) ]

Variants

  • Gaussian Naive Bayes: Features ~ Normal.
  • Multinomial Naive Bayes: For counts (e.g., text).
  • Bernoulli Naive Bayes: For binary features.

ML Connection

  • In 2025, Naive Bayes serves as a lightweight classifier in federated learning or on embeddings from LLMs.

3. Deriving the Classifier

Prior: P(C_k) = N_k / N.

Likelihood: For Gaussian:

[ P(x_i | C_k) = \frac{1}{\sqrt{2\pi \sigma_{k,i}^2}} \exp\left( -\frac{(x_i - \mu_{k,i})^2}{2\sigma_{k,i}^2} \right) ]

Take log for stability:

[ \log P(C_k | \mathbf{x}) = \log P(C_k) + \sum_i \log P(x_i | C_k) - \log P(\mathbf{x}) ]

Classify as argmax_k \log P(C_k | \mathbf{x}).

Laplace Smoothing

For zero counts: P(x_i | C_k) = (count + 1) / (N_k + V), V vocabulary size.


4. Training and Prediction

Training:

  1. Compute priors P(C_k).
  2. For each feature, estimate P(x_i | C_k) per variant.

Prediction:

  1. Compute posteriors for each class.
  2. Select max posterior class.

Complexity: O(n d), n samples, d features.

ML Connection

  • Fast for high-d text data in NLP.

5. Regularization and Smoothing

Laplace (Add-One): Avoid zero probabilities.

Dirichlet Prior: Generalizes Laplace for multinomial.

In ML: Smoothing essential for sparse data.


6. Evaluation Metrics

  • Accuracy: Fraction correct.
  • Precision/Recall/F1: For imbalanced classes.
  • ROC-AUC: Probability calibration.
  • Log-Loss: Measures confidence.

In 2025, Brier score for calibration in probabilistic ML.


7. Applications in Machine Learning (2025)

  1. Text Classification: Spam detection, sentiment analysis.
  2. Medical Diagnosis: Disease classification from symptoms.
  3. Recommendation: User categorization.
  4. LLM Integration: Naive Bayes on embeddings for lightweight tasks.
  5. Federated Learning: Efficient on-device classifier.
  6. Anomaly Detection: Detect deviations from class distributions.

Challenges

  • Independence Assumption: Rarely true; mitigated by feature engineering.
  • High-D: Curse of dimensionality; use dimensionality reduction.
  • Imbalanced Data: Use weighted priors or oversampling.

8. Numerical Implementations

Implement Naive Bayes for classification.

python
import numpy as np
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Gaussian Naive Bayes
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("Gaussian NB Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Multinomial Naive Bayes (text example)
from sklearn.feature_extraction.text import CountVectorizer
texts = ["love this movie", "hate this film", "great story", "bad acting"]
labels = [1, 0, 1, 0]
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(texts)
mnb = MultinomialNB()
mnb.fit(X_text, labels)
test_text = vectorizer.transform(["love great story"])
print("Multinomial NB Prediction:", mnb.predict(test_text)[0])
rust
use linfa::prelude::*;
use linfa_bayes::GaussianNb;
use ndarray::{Array2, Array1};

fn main() {
    // Placeholder: Iris dataset not natively in Rust; load via CSV or Python
    let x_train: Array2<f64> = Array2::zeros((120, 4)); // Simplified
    let y_train: Array1<i32> = Array1::zeros(120);
    let x_test: Array2<f64> = Array2::zeros((30, 4));
    let y_test: Array1<i32> = Array1::zeros(30);

    let dataset = Dataset::new(x_train, y_train);
    let model = GaussianNb::params().fit(&dataset).unwrap();
    let preds = model.predict(&x_test);
    let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64;
    println!("Gaussian NB Accuracy: {}", accuracy);

    // Multinomial NB placeholder
}

Note: Rust requires external data loading; use Python for full Iris example.


9. Case Study: SMS Spam Detection (Text Classification)

python
from sklearn.datasets import fetch_20newsgroups  # Placeholder for SMS data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load data (use SMS spam dataset in practice)
newsgroups = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Vectorize
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train
mnb = MultinomialNB()
mnb.fit(X_train_vec, y_train)

# Evaluate
y_pred = mnb.predict(X_test_vec)
print(classification_report(y_test, y_pred))
rust
use linfa::prelude::*;
use linfa_bayes::MultinomialNb;
use ndarray::{Array2, Array1};

fn main() {
    // Placeholder: Text data not natively supported; use vectorized features
    let x_train: Array2<f64> = Array2::zeros((800, 1000)); // TF-IDF example
    let y_train: Array1<i32> = Array1::zeros(800);
    let x_test: Array2<f64> = Array2::zeros((200, 1000));
    let y_test: Array1<i32> = Array1::zeros(200);

    let dataset = Dataset::new(x_train, y_train);
    let model = MultinomialNb::params().fit(&dataset).unwrap();
    let preds = model.predict(&x_test);
    let accuracy = preds.iter().zip(y_test.iter()).filter(|(&p, &t)| p == t).count() as f64 / y_test.len() as f64;
    println!("Multinomial NB Accuracy: {}", accuracy);
}

Note: Rust requires TF-IDF implementation; use Python for full text example.


10. Under the Hood Insights

  • Generative Model: Models joint P(\mathbf{x}, y), unlike discriminative logistic regression.
  • Independence Assumption: "Naive" but effective with smoothing.
  • Scalability: Handles high-d sparse data well.
  • Probability Calibration: Outputs well-calibrated probabilities with smoothing.

11. Limitations

  • Independence Assumption: Rarely true, reduces accuracy for correlated features.
  • High-D: Benefits from feature selection.
  • Zero Probability: Smoothing essential.
  • Imbalanced Data: Requires class weighting.

12. Summary

Naive Bayes is a fast, probabilistic classifier excelling in high-dimensional data. In 2025, its efficiency in federated learning, LLM integration, and anomaly detection keeps it vital. Its simplicity and interpretability make it a strong baseline.


Further Reading

  • James, Introduction to Statistical Learning (Ch. 4).
  • Géron, Hands-On ML (Ch. 4).
  • linfa-bayes docs: github.com/rust-ml/linfa.
  • Zhang, "The Naïve Bayes Classifier" (2005).