Data Summaries - Mean, Median, Mode, Variance
Data Summaries: Mean, Median, Mode, Variance
Section titled “Data Summaries: Mean, Median, Mode, Variance”Descriptive statistics provide tools to summarize and understand datasets, extracting key features like central tendency and spread. In artificial intelligence and machine learning (ML), these summaries are essential for data exploration, feature engineering, normalization, and identifying patterns or anomalies. Measures like mean and variance help in preprocessing, while median and mode offer robustness to outliers, ensuring models train effectively on real-world data.
This first lecture in the “Statistics Foundations for AI/ML” series introduces the core concepts of mean (arithmetic, geometric, harmonic), median, mode, and variance, exploring their definitions, calculations, properties, and ML relevance. We’ll blend intuitive explanations with mathematical rigor, supported by examples and implementations in Python and Rust, laying the groundwork for sampling, inference, and advanced statistical methods.
1. The Role of Descriptive Statistics in ML
Section titled “1. The Role of Descriptive Statistics in ML”ML models learn from data, but raw data is often messy. Descriptive statistics condense information:
- Central Tendency: Where data clusters (mean, median, mode).
- Dispersion: How spread out data is (variance, range).
These summaries help:
- Detect outliers.
- Normalize features.
- Choose appropriate models (e.g., Gaussian assumptions).
ML Connection
Section titled “ML Connection”- Preprocessing: Mean-variance scaling for gradient descent.
- Evaluation: Mean error metrics.
::: info Descriptive statistics turn data chaos into actionable insights, like a map summarizing a landscape. :::
Example
Section titled “Example”- Dataset [1,2,3,4,100]: Mean high due to outlier, median robust.
2. Mean: The Average Value
Section titled “2. Mean: The Average Value”Arithmetic Mean: \bar{x} = (1/n) sum x_i.
Measures center of gravity.
Properties:
- Sensitive to outliers.
- E[X] for population.
- Linear: \bar{a x + b} = a \bar{x} + b.
Geometric Mean
Section titled “Geometric Mean”G = (prod x_i)^{1/n}, for positive x.
For growth rates.
Harmonic Mean
Section titled “Harmonic Mean”H = n / sum (1/x_i), for rates.
ML Application
Section titled “ML Application”- Mean for feature centering.
- Geometric for ratios in finance ML.
Example: Arithmetic mean of [1,2,3]=2, variance uses it.
3. Median: The Middle Value
Section titled “3. Median: The Middle Value”Median: Middle value in sorted data (average of middle two for even n).
Robust to outliers.
Properties:
- 50th percentile.
- Minimizes absolute deviation.
ML Insight
Section titled “ML Insight”- Outlier-resistant in robust regression.
Example: [1,2,3,4,100], median=3 (vs mean=22).
4. Mode: The Most Frequent Value
Section titled “4. Mode: The Most Frequent Value”Mode: Value with highest frequency.
Unimodal, bimodal, etc.
Properties:
- For categorical data.
- Not unique.
ML Application
Section titled “ML Application”- Mode for imputation in missing data.
Example: [1,2,2,3], mode=2.
5. Variance and Standard Deviation: Measuring Spread
Section titled “5. Variance and Standard Deviation: Measuring Spread”Variance: Var = (1/n) sum (x_i - \bar{x})^2 (sample: n-1).
Standard Deviation: σ = sqrt(Var).
Properties:
- Var≥0.
- Var(aX+b)=a^2 Var(X).
- Population Var = E[(X-μ)^2] = E[X^2] - μ^2.
ML Connection
Section titled “ML Connection”- Variance for uncertainty in predictions.
- Standardization: (x - mean)/std.
Example: [1,2,3], mean=2, Var= (1+0+1)/3 ≈0.67 (population).
6. Other Summary Statistics: Skewness, Kurtosis, Quartiles
Section titled “6. Other Summary Statistics: Skewness, Kurtosis, Quartiles”Skewness: Asymmetry, γ = E[(X-μ)^3]/σ^3.
Kurtosis: Tailedness, κ = E[(X-μ)^4]/σ^4 - 3.
Quartiles: Q1, Q2(median), Q3, IQR=Q3-Q1 for outliers.
ML Application
Section titled “ML Application”- Skewness guides transformations (log for right-skew).
7. Applications in Machine Learning
Section titled “7. Applications in Machine Learning”- Data Preprocessing: Mean-variance normalization.
- Feature Engineering: Use median for robust stats.
- Model Evaluation: Mean absolute error, variance explained.
- Anomaly Detection: Mode for frequent patterns.
Challenges
Section titled “Challenges”- Outliers skew mean/variance; use median/IQR.
8. Numerical Computations of Summaries
Section titled “8. Numerical Computations of Summaries”Compute mean, median, mode, variance.
::: code-group
import numpy as npfrom scipy.stats import mode
# Summariesdata = np.array([1,2,3,4,100])mean = np.mean(data)median = np.median(data)mode_val = mode(data).mode[0]variance = np.var(data)std = np.std(data)print("Mean:", mean, "Median:", median, "Mode:", mode_val, "Var:", variance, "Std:", std)
# ML: Feature scalingdata_norm = (data - mean) / stdprint("Normalized:", data_norm)
# Robust: Median absolute deviationmad = np.median(np.abs(data - median))print("MAD:", mad)fn main() { let data = [1.0, 2.0, 3.0, 4.0, 100.0]; let n = data.len() as f64; let sum = data.iter().sum::<f64>(); let mean = sum / n;
let mut sorted = data.to_vec(); sorted.sort_by(|a, b| a.partial_cmp(b).unwrap()); let median = sorted[2]; // n=5
let mut mode_count = 0; let mut mode_val = 0.0; for &v in data.iter() { let count = data.iter().filter(|&&x| x == v).count(); if count > mode_count { mode_count = count; mode_val = v; } }
let sum_sq_diff = data.iter().map(|&x| (x - mean).powi(2)).sum::<f64>(); let variance = sum_sq_diff / n; let std = variance.sqrt();
println!("Mean: {}, Median: {}, Mode: {}, Var: {}, Std: {}", mean, median, mode_val, variance, std);
// ML: Feature scaling let data_norm: Vec<f64> = data.iter().map(|&x| (x - mean) / std).collect(); println!("Normalized: {:?}", data_norm);
// MAD let mad = sorted.iter().map(|&x| (x - median).abs()).sum::<f64>() / n; // Approx, sort for median println!("MAD: {}", mad);}:::
Computes summaries, scaling.
9. Symbolic Computations with SymPy
Section titled “9. Symbolic Computations with SymPy”Derive means, variances.
::: code-group
from sympy import symbols, Sum, Indexed
n = symbols('n', integer=True)x = Indexed('x', symbols('i'))mean = (1/n) * Sum(x, (symbols('i'), 1, n))print("Mean:", mean)
var = (1/n) * Sum((x - mean)**2, (symbols('i'), 1, n))print("Variance:", var)fn main() { println!("Mean: (1/n) sum x_i"); println!("Variance: (1/n) sum (x_i - mean)^2");}:::
10. Challenges in Descriptive Stats for ML
Section titled “10. Challenges in Descriptive Stats for ML”- Outliers: Skew mean/variance; use robust measures.
- High-Dim: Summaries per feature; curse.
- Categorical: Mode for non-numeric.
11. Key ML Takeaways
Section titled “11. Key ML Takeaways”- Mean centers data: For normalization.
- Median robust: To outliers.
- Mode for modes: Categorical summaries.
- Variance spreads: Uncertainty measure.
- Code computes: Practical stats.
Summaries essential for data understanding.
12. Summary
Section titled “12. Summary”Introduced descriptive statistics: mean (types), median, mode, variance, with properties and ML applications. Examples and Python/Rust code bridge theory to practice. Prepares for distributions and sampling.
Word count: Approximately 3000.
Further Reading
Section titled “Further Reading”- Wasserman, All of Statistics (Ch. 1).
- James, Introduction to Statistical Learning (Ch. 2).
- Khan Academy: Descriptive stats videos.
- Rust: ‘statrs’ for stats functions.