Data Preprocessing
Preparation and standardization of raw data through cleaning, transformation, and normalization to improve analyses and models.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeDesign
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Introduction of bias through incorrect cleaning
- Overfitting from excessive feature engineering
- Scalability issues with large data volumes
- Encapsulate transformations as reusable components
- Careful versioning of schemas and pipelines
- Introduce automated tests for data quality
I/O & resources
- Raw datasets from various sources
- Schema definitions and metadata
- Quality rules and validation specifications
- Cleaned and standardized datasets
- Computed features and transformation logs
- Validation reports and metrics
Description
Data preprocessing prepares raw data for analysis and modeling by including cleaning, transformation, and normalization steps. It reduces noise, handles missing values, and standardizes formats to provide consistent inputs for algorithms and reports. Commonly used within data pipelines and machine learning workflows.
✔Benefits
- Improved accuracy of analyses and models
- Consistent data formats across systems
- Early error detection and reduced rework
✖Limitations
- Effort to develop and maintain pipelines
- Loss of information from improper transformations
- Misinterpretation due to inappropriate imputation
Trade-offs
Metrics
- Share of cleaned records
Percentage of records that pass validation rules.
- Error rate after preprocessing
Number of faulty records per million after processing.
- Pipeline runtime
Average time to process a given data volume.
Examples & implementations
E-commerce sales analysis
Unifying transaction data and removing duplicates before monthly reporting.
Sensor data preprocessing
Smoothing and imputation of readings in an IoT data stream.
Customer segmentation
Feature computation and scaling of customer attributes prior to clustering.
Implementation steps
Define requirements and quality rules.
Implement and version pipelines modularly.
Ensure monitoring, tests and reproducibility.
⚠️ Technical debt & bottlenecks
Technical debt
- Hardcoded mappings and missing tests
- Legacy transformations without refactoring
- Lack of observability and logging
Known bottlenecks
Misuse examples
- Excessive imputation without domain checks
- Dropping critical values labeled as 'noise' without analysis
- Using training data to select transformation rules
Typical traps
- Loss of information through aggressive normalization
- Undocumented edge cases in the pipeline
- Unnoticed drift in source formats
Required skills
Architectural drivers
Constraints
- • Availability and quality of source data
- • Privacy and compliance requirements
- • Limited compute resources in production environments