Data Quality
Concept for ensuring and managing data quality using metrics, governance and improvement processes.
Classification
- ComplexityMedium
- Impact areaBusiness
- Decision typeOrganizational
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Focusing on measurable metrics instead of actual value
- Excessive gates that hinder innovation and speed
- Lack of domain acceptance leads to workarounds
- Start with a few business-relevant metrics
- Integrate automated tests into CI/CD
- Define ownership and SLAs per data product
I/O & resources
- Data sources and their schemas
- Business rules and acceptance criteria
- Metadata and data lineage
- Quality metrics and dashboards
- Alerts and error reports
- Improved data products and contracts
Description
Data quality describes the fitness of data for specific purposes, characterized by accuracy, completeness, consistency, and timeliness. The concept covers measurement methods, governance, data lineage and processes for improvement. It is vital for reliable analytics, operational processes and automated decision-making.
✔Benefits
- Increased reliability of analytics and reporting
- Reduced error costs in operational processes
- Better decision basis for management
✖Limitations
- Requires organizational alignment and ownership
- Complete error-free data is often unattainable
- Measurement and automation have initial implementation costs
Trade-offs
Metrics
- Completeness rate
Share of records with required fields populated.
- Accuracy rate
Share of values validated against authoritative sources.
- Freshness/latency
Time since last update of relevant data fields.
Examples & implementations
Customer master data consolidation
Harmonizing IDs and addresses, enriching missing fields, introducing duplicate detection.
BI dashboard with quality gate
Dashboards are published only when core metrics like completeness and timeliness meet defined thresholds.
Data trust for ML models
Continuous monitoring pipelines check data drift, missing labels and inconsistencies before training and inference.
Implementation steps
Initial assessment and definition of core metrics
Introduce monitoring and validation pipelines
Operationalize data contracts and governance processes
⚠️ Technical debt & bottlenecks
Technical debt
- Ad-hoc remediation scripts without tests
- Missing data lineage for historical remediation
- Outdated validation rules after system changes
Known bottlenecks
Misuse examples
- Optimizing 'completeness' metric in isolation while critical fields are missing
- Automatically deleting suspicious records without review
- Governance rules preventing necessary fast remediations
Typical traps
- Relying on single metrics instead of holistic assessment
- Ignoring context and domain logic in validations
- Over-specification of rules that are hard to maintain
Required skills
Architectural drivers
Constraints
- • Privacy and compliance requirements
- • Limited resources for data maintenance
- • Heterogeneous system landscape