Skip to content

Statistics

Statistics provides tools to analyze data and evaluate machine learning (ML) models. This section introduces descriptive statistics, hypothesis testing, and confidence intervals, with a Rust lab using the statrs crate.

Descriptive Statistics

Descriptive statistics summarize data through measures like:

  • Mean: Average value, μ=1ni=1nxi.
  • Variance: Spread of data, σ2=1ni=1n(xiμ)2.
  • Standard Deviation: σ=σ2.

In ML, these describe datasets and model performance.

Hypothesis Testing

Hypothesis testing assesses if observed data supports a hypothesis. A t-test compares means of two groups to determine if they differ significantly.

Example: Test if a model’s predictions have a different mean error than a baseline. The t-statistic is:

t=x¯1x¯2s12n1+s22n2

where x¯i, si2, ni are the mean, variance, and size of group i.

Confidence Intervals

A confidence interval estimates a parameter’s range (e.g., mean) with a confidence level (e.g., 95%). For a mean:

x¯±zsn

where z is the z-score (e.g., 1.96 for 95%), s is the standard deviation, and n is the sample size.

Lab: T-Test with statrs

You’ll perform a t-test on two synthetic datasets using statrs to compare their means.

  1. Edit src/main.rs in your rust_ml_tutorial project:

    rust
    use statrs::statistics::{Data, Distribution};
    use statrs::distribution::StudentsT;
    
    fn main() {
        // Synthetic datasets
        let data1 = vec![2.1, 2.3, 2.0, 2.2, 2.4]; // Group 1
        let data2 = vec![2.5, 2.7, 2.4, 2.6, 2.8]; // Group 2
    
        // Compute means
        let mean1 = Data::new(data1.clone()).mean().unwrap();
        let mean2 = Data::new(data2.clone()).mean().unwrap();
        println!("Mean1: {}, Mean2: {}", mean1, mean2);
    
        // Perform t-test (assuming equal variances)
        let t_stat = t_test(&data1, &data2);
        println!("T-statistic: {}", t_stat);
    
        // Check p-value (two-tailed, df=8)
        let t_dist = StudentsT::new(0.0, 1.0, 8.0).unwrap();
        let p_value = 2.0 * (1.0 - t_dist.cdf(t_stat.abs()));
        println!("P-value: {}", p_value);
    }
    
    fn t_test(data1: &[f64], data2: &[f64]) -> f64 {
        let n1 = data1.len() as f64;
        let n2 = data2.len() as f64;
        let mean1 = data1.iter().sum::<f64>() / n1;
        let mean2 = data2.iter().sum::<f64>() / n2;
    
        let var1 = data1.iter().map(|x| (x - mean1).powi(2)).sum::<f64>() / (n1 - 1.0);
        let var2 = data2.iter().map(|x| (x - mean2).powi(2)).sum::<f64>() / (n2 - 1.0);
    
        let se = ((var1 / n1) + (var2 / n2)).sqrt();
        (mean1 - mean2) / se
    }
  2. Ensure Dependencies:

    • Verify Cargo.toml includes:
      toml
      [dependencies]
      statrs = "0.16.0"
    • Run cargo build.
  3. Run the Program:

    bash
    cargo run

    Expected Output (approximate):

    Mean1: 2.2, Mean2: 2.6
    T-statistic: -3.46
    P-value: 0.008

    A low p-value (< 0.05) suggests the means differ significantly.

Understanding the Results

  • Datasets: Two groups with slightly different means (~2.2 vs. ~2.6).
  • T-Test: The t-statistic and p-value indicate a significant difference, relevant for ML model evaluation.
  • ML Relevance: Hypothesis testing validates model performance (e.g., comparing error rates).

This lab prepares you for statistical ML methods.

Learning from Official Resources

Deepen Rust skills with:

  • The Rust Programming Language (The Book): Free at doc.rust-lang.org/book.
  • Programming Rust: By Blandy, Orendorff, and Tindall.

Next Steps

Proceed to Core Machine Learning for regression techniques, or revisit Probability.

Further Reading

  • An Introduction to Statistical Learning by James et al. (Chapter 5)
  • Andrew Ng’s Machine Learning Specialization (Course 1, Week 3)
  • statrs Documentation: docs.rs/statrs