Statistics

Statistics provides tools to analyze data and evaluate machine learning (ML) models. This section introduces descriptive statistics, hypothesis testing, and confidence intervals, with a Rust lab using the statrs crate.

Descriptive Statistics

Descriptive statistics summarize data through measures like:

Mean: Average value, $μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ .
Variance: Spread of data, $σ^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ)^{2}$ .
Standard Deviation: $σ = \sqrt{σ^{2}}$ .

In ML, these describe datasets and model performance.

Hypothesis Testing

Hypothesis testing assesses if observed data supports a hypothesis. A t-test compares means of two groups to determine if they differ significantly.

Example: Test if a model’s predictions have a different mean error than a baseline. The t-statistic is:

t = \frac{{\bar{x}}_{1} - {\bar{x}}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}

where ${\bar{x}}_{i}$ , $s_{i}^{2}$ , $n_{i}$ are the mean, variance, and size of group $i$ .

Confidence Intervals

A confidence interval estimates a parameter’s range (e.g., mean) with a confidence level (e.g., 95%). For a mean:

\bar{x} \pm z \frac{s}{\sqrt{n}}

where $z$ is the z-score (e.g., 1.96 for 95%), $s$ is the standard deviation, and $n$ is the sample size.

Lab: T-Test with `statrs`

You’ll perform a t-test on two synthetic datasets using statrs to compare their means.

Edit src/main.rs in your rust_ml_tutorial project:

rust

use statrs::statistics::{Data, Distribution};
use statrs::distribution::StudentsT;

fn main() {
    // Synthetic datasets
    let data1 = vec![2.1, 2.3, 2.0, 2.2, 2.4]; // Group 1
    let data2 = vec![2.5, 2.7, 2.4, 2.6, 2.8]; // Group 2

    // Compute means
    let mean1 = Data::new(data1.clone()).mean().unwrap();
    let mean2 = Data::new(data2.clone()).mean().unwrap();
    println!("Mean1: {}, Mean2: {}", mean1, mean2);

    // Perform t-test (assuming equal variances)
    let t_stat = t_test(&data1, &data2);
    println!("T-statistic: {}", t_stat);

    // Check p-value (two-tailed, df=8)
    let t_dist = StudentsT::new(0.0, 1.0, 8.0).unwrap();
    let p_value = 2.0 * (1.0 - t_dist.cdf(t_stat.abs()));
    println!("P-value: {}", p_value);
}

fn t_test(data1: &[f64], data2: &[f64]) -> f64 {
    let n1 = data1.len() as f64;
    let n2 = data2.len() as f64;
    let mean1 = data1.iter().sum::<f64>() / n1;
    let mean2 = data2.iter().sum::<f64>() / n2;

    let var1 = data1.iter().map(|x| (x - mean1).powi(2)).sum::<f64>() / (n1 - 1.0);
    let var2 = data2.iter().map(|x| (x - mean2).powi(2)).sum::<f64>() / (n2 - 1.0);

    let se = ((var1 / n1) + (var2 / n2)).sqrt();
    (mean1 - mean2) / se
}

Ensure Dependencies:
- Verify Cargo.toml includes:
  toml
```
[dependencies]
statrs = "0.16.0"
```
- Run cargo build.
Run the Program:
bash
```
cargo run
```
Expected Output (approximate):
```
Mean1: 2.2, Mean2: 2.6
T-statistic: -3.46
P-value: 0.008
```
A low p-value (< 0.05) suggests the means differ significantly.

Understanding the Results

Datasets: Two groups with slightly different means (~2.2 vs. ~2.6).
T-Test: The t-statistic and p-value indicate a significant difference, relevant for ML model evaluation.
ML Relevance: Hypothesis testing validates model performance (e.g., comparing error rates).

This lab prepares you for statistical ML methods.

Learning from Official Resources

Deepen Rust skills with:

The Rust Programming Language (The Book): Free at doc.rust-lang.org/book.
Programming Rust: By Blandy, Orendorff, and Tindall.

Next Steps

Proceed to Core Machine Learning for regression techniques, or revisit Probability.

Statistics ​

Descriptive Statistics ​

Hypothesis Testing ​

Confidence Intervals ​

Lab: T-Test with statrs ​

Understanding the Results ​

Learning from Official Resources ​

Next Steps ​

Further Reading ​