Skip to content

Data Preprocessing

Data preprocessing is a critical step in machine learning (ML), transforming raw data into a format suitable for modeling to improve performance and stability. This section provides a comprehensive exploration of normalization, encoding, handling missing values, and feature engineering, with a Rust lab using polars. We’ll dive into computational efficiency, pipeline mechanics, and Rust’s performance advantages, starting the Practical ML Skills module.

Theory

Preprocessing addresses issues like inconsistent scales, categorical variables, missing data, and irrelevant features in a dataset with m samples and n features, XRm×n, and targets y. Effective preprocessing ensures models converge faster and generalize better.

Normalization

Normalization scales features to a consistent range, preventing features with large ranges (e.g., house size in square feet) from dominating smaller ones (e.g., number of bedrooms). Common methods include:

  • Min-Max Scaling: Scales feature xj to [0,1]:xj=xjmin(xj)max(xj)min(xj)
  • Standardization: Centers xj to zero mean and unit variance:xj=xjμjσjwhere μj=1mi=1mxij, σj=1m1i=1m(xijμj)2.

Derivation: Standardization ensures xj has mean 0 and variance 1. For a feature xj:

μj=1mi=1mxijμjσj=0σj2=1m1i=1m(xijμjσj)2=1m1i=1m(xijμj)2σj2=1

This stabilizes gradient descent by normalizing feature scales.

Under the Hood: Normalization requires computing statistics (μj, σj), costing O(m) per feature. polars optimizes this with parallelized column operations, leveraging Rust’s rayon crate for multi-threading, unlike Python’s pandas, which processes columns sequentially for large datasets. Rust’s memory safety prevents buffer overflows during scaling, a risk in C++.

Encoding Categorical Variables

Categorical features (e.g., color: red, blue) must be converted to numerical form:

  • One-Hot Encoding: Creates binary columns for each category. For a feature with K categories, sample i with category k gets a vector ek{0,1}K.
  • Label Encoding: Assigns integers (e.g., red=0, blue=1), suitable for ordinal data.

Under the Hood: One-hot encoding increases dimensionality (O(K) columns), impacting memory. polars implements encoding with efficient hash maps, minimizing memory usage compared to Python’s pandas, which may duplicate data. Rust’s type safety ensures correct category mapping, unlike C++ where manual indexing risks errors.

Handling Missing Values

Missing values disrupt ML algorithms. Common strategies include:

  • Imputation: Replaces missing xij with the mean μj or median.
  • Deletion: Removes rows with missing values, reducing m.

Derivation: Mean imputation preserves the feature’s mean:

μj=1m(i:xij not missingxij+i:xij missingμj)=μj

However, it underestimates variance, requiring careful evaluation.

Under the Hood: Imputation requires scanning data (O(m) per feature). polars parallelizes this, with Rust’s null handling ensuring robust missing value detection, unlike Python’s pandas, which may slow down for sparse data. Rust’s compile-time checks prevent null pointer errors, common in C++.

Feature Engineering

Feature engineering creates new features to improve model performance, e.g., polynomial features (x12, x1x2) or interaction terms. For a feature pair (x1,x2), a quadratic term is:

xnew=x1x2

Under the Hood: Feature engineering increases dimensionality, impacting computation. polars enables efficient feature creation with vectorized operations, leveraging Rust’s performance to minimize overhead, unlike Python’s pandas, which may require costly loops. Rust’s memory safety ensures correct feature matrix updates, avoiding C++’s buffer errors.

Evaluation

Preprocessing is evaluated indirectly through model performance (e.g., accuracy, MSE) on validation data. Key metrics include:

  • Model Performance: Higher accuracy or lower MSE post-preprocessing.
  • Training Stability: Faster convergence or lower loss variance.
  • Data Distribution: Post-preprocessing mean, variance, or skewness.

Under the Hood: Preprocessing impacts gradient descent by normalizing gradients. polars computes post-preprocessing statistics efficiently, with Rust’s parallel processing reducing latency compared to Python’s sequential operations. Rust’s type system prevents data type mismatches, unlike C++’s manual type handling.

Lab: Data Preprocessing with polars

You’ll preprocess a synthetic dataset with missing values, categorical features, and varying scales, applying normalization, encoding, and imputation, then train a model to evaluate impact.

  1. Edit src/main.rs in your rust_ml_tutorial project:

    rust
    use polars::prelude::*;
    use linfa::prelude::*;
    use linfa_linear::LogisticRegression;
    use ndarray::{Array2, Array1};
    
    fn main() -> Result<(), PolarsError> {
        // Synthetic dataset: features (numeric, categorical, numeric with missing), binary target
        let df = df!(
            "size" => [1000.0, 1500.0, 2000.0, 2500.0, 3000.0, 1200.0, 1800.0, 2200.0, 2700.0, 3200.0],
            "color" => ["red", "blue", "green", "blue", "red", "green", "red", "blue", "green", "red"],
            "age" => [Some(5.0), None, Some(3.0), Some(8.0), Some(2.0), Some(4.0), None, Some(6.0), Some(7.0), Some(1.0)],
            "target" => [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
        )?;
    
        // Preprocessing pipeline
        let mean_age = df["age"].f64()?.mean().unwrap_or(0.0);
        let df = df
            // Impute missing age with mean
            .lazy()
            .with_column(col("age").fill_null(lit(mean_age)).alias("age"))
            // One-hot encode color
            .join(
                df!("color").one_hot("color_", "")?,
                ["color"], ["color"], JoinType::Left, None
            )
            .drop_columns(["color"])
            // Standardize size and age
            .with_columns([
                ((col("size") - col("size").mean().unwrap()) / col("size").std(1).unwrap()).alias("size"),
                ((col("age") - col("age").mean().unwrap()) / col("age").std(1).unwrap()).alias("age"),
            ])
            .collect()?;
    
        // Extract features and target
        let features = df.select(["size", "age", "color_red", "color_blue", "color_green"])?.to_ndarray::<Float64Type>()?;
        let targets = df["target"].f64()?.to_vec();
    
        // Convert to linfa dataset
        let x = Array2::from(features.to_vec()).into_shape((features.nrows(), features.ncols())).unwrap();
        let y = Array1::from(targets);
        let dataset = Dataset::new(x.clone(), y.clone());
    
        // Train logistic regression
        let model = LogisticRegression::default()
            .l2_penalty(0.1)
            .max_iterations(100)
            .fit(&dataset)
            .unwrap();
    
        // Evaluate accuracy
        let preds = model.predict(&x);
        let accuracy = preds.iter().zip(y.iter())
            .filter(|(p, t)| p == t).count() as f64 / y.len() as f64;
        println!("Accuracy: {}", accuracy);
    
        Ok(())
    }
  2. Ensure Dependencies:

    • Verify Cargo.toml includes:
      toml
      [dependencies]
      polars = { version = "0.46.0", features = ["lazy"] }
      linfa = "0.7.1"
      linfa-linear = "0.7.0"
      ndarray = "0.15.0"
    • Run cargo build.
  3. Run the Program:

    bash
    cargo run

    Expected Output (approximate):

    Accuracy: 0.90

Understanding the Results

  • Dataset: Synthetic data with 10 samples includes a numeric feature (size), a categorical feature (color), a numeric feature with missing values (age), and a binary target.
  • Preprocessing: Mean imputation fills missing ages, one-hot encoding converts colors to binary columns, and standardization normalizes size and age, creating a 5D feature matrix.
  • Model: Logistic regression on the preprocessed data achieves ~90% accuracy, reflecting effective feature preparation.
  • Under the Hood: polars optimizes preprocessing with lazy evaluation, deferring computations until necessary, reducing memory usage compared to Python’s pandas, which materializes intermediate dataframes. Rust’s polars parallelizes operations, speeding up large datasets, and its type safety prevents null-handling errors, unlike C++’s manual checks. The preprocessing pipeline ensures consistent feature scales, stabilizing gradient descent in logistic regression.
  • Evaluation: High accuracy confirms preprocessing efficacy, though validation data would quantify generalization.

This lab introduces practical ML skills, preparing for hyperparameter tuning.

Next Steps

Continue to Hyperparameter Tuning for model optimization, or revisit Optimization.

Further Reading