Data Preprocessing
Data Preprocessing
Section titled “Data Preprocessing”Data preprocessing is a critical step in machine learning (ML), transforming raw data into a format suitable for modeling to improve performance and stability. This section provides a comprehensive exploration of normalization, encoding, handling missing values, and feature engineering, with a Rust lab using polars. We’ll dive into computational efficiency, pipeline mechanics, and Rust’s performance advantages, starting the Practical ML Skills module.
Theory
Section titled “Theory”Preprocessing addresses issues like inconsistent scales, categorical variables, missing data, and irrelevant features in a dataset with samples and features, , and targets . Effective preprocessing ensures models converge faster and generalize better.
Normalization
Section titled “Normalization”Normalization scales features to a consistent range, preventing features with large ranges (e.g., house size in square feet) from dominating smaller ones (e.g., number of bedrooms). Common methods include:
- Min-Max Scaling: Scales feature to :
- Standardization: Centers to zero mean and unit variance: where , .
Derivation: Standardization ensures has mean 0 and variance 1. For a feature :
This stabilizes gradient descent by normalizing feature scales.
Under the Hood: Normalization requires computing statistics (, ), costing per feature. polars optimizes this with parallelized column operations, leveraging Rust’s rayon crate for multi-threading, unlike Python’s pandas, which processes columns sequentially for large datasets. Rust’s memory safety prevents buffer overflows during scaling, a risk in C++.
Encoding Categorical Variables
Section titled “Encoding Categorical Variables”Categorical features (e.g., color: red, blue) must be converted to numerical form:
- One-Hot Encoding: Creates binary columns for each category. For a feature with categories, sample with category gets a vector .
- Label Encoding: Assigns integers (e.g., red=0, blue=1), suitable for ordinal data.
Under the Hood: One-hot encoding increases dimensionality ( columns), impacting memory. polars implements encoding with efficient hash maps, minimizing memory usage compared to Python’s pandas, which may duplicate data. Rust’s type safety ensures correct category mapping, unlike C++ where manual indexing risks errors.
Handling Missing Values
Section titled “Handling Missing Values”Missing values disrupt ML algorithms. Common strategies include:
- Imputation: Replaces missing with the mean or median.
- Deletion: Removes rows with missing values, reducing .
Derivation: Mean imputation preserves the feature’s mean:
However, it underestimates variance, requiring careful evaluation.
Under the Hood: Imputation requires scanning data ( per feature). polars parallelizes this, with Rust’s null handling ensuring robust missing value detection, unlike Python’s pandas, which may slow down for sparse data. Rust’s compile-time checks prevent null pointer errors, common in C++.
Feature Engineering
Section titled “Feature Engineering”Feature engineering creates new features to improve model performance, e.g., polynomial features (, ) or interaction terms. For a feature pair , a quadratic term is:
Under the Hood: Feature engineering increases dimensionality, impacting computation. polars enables efficient feature creation with vectorized operations, leveraging Rust’s performance to minimize overhead, unlike Python’s pandas, which may require costly loops. Rust’s memory safety ensures correct feature matrix updates, avoiding C++‘s buffer errors.
Evaluation
Section titled “Evaluation”Preprocessing is evaluated indirectly through model performance (e.g., accuracy, MSE) on validation data. Key metrics include:
- Model Performance: Higher accuracy or lower MSE post-preprocessing.
- Training Stability: Faster convergence or lower loss variance.
- Data Distribution: Post-preprocessing mean, variance, or skewness.
Under the Hood: Preprocessing impacts gradient descent by normalizing gradients. polars computes post-preprocessing statistics efficiently, with Rust’s parallel processing reducing latency compared to Python’s sequential operations. Rust’s type system prevents data type mismatches, unlike C++‘s manual type handling.
Lab: Data Preprocessing with polars
Section titled “Lab: Data Preprocessing with polars”You’ll preprocess a synthetic dataset with missing values, categorical features, and varying scales, applying normalization, encoding, and imputation, then train a model to evaluate impact.
-
Edit
src/main.rsin yourrust_ml_tutorialproject:use polars::prelude::*;use linfa::prelude::*;use linfa_linear::LogisticRegression;use ndarray::{Array2, Array1};fn main() -> Result<(), PolarsError> {// Synthetic dataset: features (numeric, categorical, numeric with missing), binary targetlet df = df!("size" => [1000.0, 1500.0, 2000.0, 2500.0, 3000.0, 1200.0, 1800.0, 2200.0, 2700.0, 3200.0],"color" => ["red", "blue", "green", "blue", "red", "green", "red", "blue", "green", "red"],"age" => [Some(5.0), None, Some(3.0), Some(8.0), Some(2.0), Some(4.0), None, Some(6.0), Some(7.0), Some(1.0)],"target" => [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0])?;// Preprocessing pipelinelet mean_age = df["age"].f64()?.mean().unwrap_or(0.0);let df = df// Impute missing age with mean.lazy().with_column(col("age").fill_null(lit(mean_age)).alias("age"))// One-hot encode color.join(df!("color").one_hot("color_", "")?,["color"], ["color"], JoinType::Left, None).drop_columns(["color"])// Standardize size and age.with_columns([((col("size") - col("size").mean().unwrap()) / col("size").std(1).unwrap()).alias("size"),((col("age") - col("age").mean().unwrap()) / col("age").std(1).unwrap()).alias("age"),]).collect()?;// Extract features and targetlet features = df.select(["size", "age", "color_red", "color_blue", "color_green"])?.to_ndarray::<Float64Type>()?;let targets = df["target"].f64()?.to_vec();// Convert to linfa datasetlet x = Array2::from(features.to_vec()).into_shape((features.nrows(), features.ncols())).unwrap();let y = Array1::from(targets);let dataset = Dataset::new(x.clone(), y.clone());// Train logistic regressionlet model = LogisticRegression::default().l2_penalty(0.1).max_iterations(100).fit(&dataset).unwrap();// Evaluate accuracylet preds = model.predict(&x);let accuracy = preds.iter().zip(y.iter()).filter(|(p, t)| p == t).count() as f64 / y.len() as f64;println!("Accuracy: {}", accuracy);Ok(())} -
Ensure Dependencies:
- Verify
Cargo.tomlincludes:[dependencies]polars = { version = "0.46.0", features = ["lazy"] }linfa = "0.7.1"linfa-linear = "0.7.0"ndarray = "0.15.0" - Run
cargo build.
- Verify
-
Run the Program:
Terminal window cargo runExpected Output (approximate):
Accuracy: 0.90
Understanding the Results
Section titled “Understanding the Results”- Dataset: Synthetic data with 10 samples includes a numeric feature (size), a categorical feature (color), a numeric feature with missing values (age), and a binary target.
- Preprocessing: Mean imputation fills missing ages, one-hot encoding converts colors to binary columns, and standardization normalizes size and age, creating a 5D feature matrix.
- Model: Logistic regression on the preprocessed data achieves ~90% accuracy, reflecting effective feature preparation.
- Under the Hood:
polarsoptimizes preprocessing with lazy evaluation, deferring computations until necessary, reducing memory usage compared to Python’spandas, which materializes intermediate dataframes. Rust’spolarsparallelizes operations, speeding up large datasets, and its type safety prevents null-handling errors, unlike C++‘s manual checks. The preprocessing pipeline ensures consistent feature scales, stabilizing gradient descent in logistic regression. - Evaluation: High accuracy confirms preprocessing efficacy, though validation data would quantify generalization.
This lab introduces practical ML skills, preparing for hyperparameter tuning.
Next Steps
Section titled “Next Steps”Further Reading
Section titled “Further Reading”- An Introduction to Statistical Learning by James et al. (Chapter 6)
- Hands-On Machine Learning by Géron (Chapter 3)
polarsDocumentation: github.com/pola-rs/polarslinfaDocumentation: github.com/rust-ml/linfa