Catalog
method#Data#Analytics#Integration#Software Engineering

Data Preprocessing

Preparation and standardization of raw data through cleaning, transformation, and normalization to improve analyses and models.

Data preprocessing prepares raw data for analysis and modeling by including cleaning, transformation, and normalization steps.
Established
Medium

Classification

  • Medium
  • Technical
  • Design
  • Intermediate

Technical context

Database and data warehouse systems (e.g., Postgres, Snowflake)ETL/ELT tools and orchestrators (e.g., Airflow, dbt)Stream processing platforms (e.g., Kafka, Flink)

Principles & goals

Early validation of data qualityIdempotent, reproducible transformationsSeparation of cleaning, transformation, and feature engineering
Build
Domain, Team

Use cases & scenarios

Compromises

  • Introduction of bias through incorrect cleaning
  • Overfitting from excessive feature engineering
  • Scalability issues with large data volumes
  • Encapsulate transformations as reusable components
  • Careful versioning of schemas and pipelines
  • Introduce automated tests for data quality

I/O & resources

  • Raw datasets from various sources
  • Schema definitions and metadata
  • Quality rules and validation specifications
  • Cleaned and standardized datasets
  • Computed features and transformation logs
  • Validation reports and metrics

Description

Data preprocessing prepares raw data for analysis and modeling by including cleaning, transformation, and normalization steps. It reduces noise, handles missing values, and standardizes formats to provide consistent inputs for algorithms and reports. Commonly used within data pipelines and machine learning workflows.

  • Improved accuracy of analyses and models
  • Consistent data formats across systems
  • Early error detection and reduced rework

  • Effort to develop and maintain pipelines
  • Loss of information from improper transformations
  • Misinterpretation due to inappropriate imputation

  • Share of cleaned records

    Percentage of records that pass validation rules.

  • Error rate after preprocessing

    Number of faulty records per million after processing.

  • Pipeline runtime

    Average time to process a given data volume.

E-commerce sales analysis

Unifying transaction data and removing duplicates before monthly reporting.

Sensor data preprocessing

Smoothing and imputation of readings in an IoT data stream.

Customer segmentation

Feature computation and scaling of customer attributes prior to clustering.

1

Define requirements and quality rules.

2

Implement and version pipelines modularly.

3

Ensure monitoring, tests and reproducibility.

⚠️ Technical debt & bottlenecks

  • Hardcoded mappings and missing tests
  • Legacy transformations without refactoring
  • Lack of observability and logging
I/O bottlenecks with large raw dataCompute-intensive transformationsMissing metadata about data provenance
  • Excessive imputation without domain checks
  • Dropping critical values labeled as 'noise' without analysis
  • Using training data to select transformation rules
  • Loss of information through aggressive normalization
  • Undocumented edge cases in the pipeline
  • Unnoticed drift in source formats
Data modeling and SQL skillsKnowledge of data cleaning and transformationUnderstanding of performance and scaling concerns
Data quality and verifiabilityScalability of processingRepeatability and reproducibility
  • Availability and quality of source data
  • Privacy and compliance requirements
  • Limited compute resources in production environments