Data Lake
A centralized, scalable repository for raw and heterogeneous data in native formats to support analytics and integrations.
Classification
- ComplexityHigh
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Inaccurate or missing metadata hinders reuse.
- Insufficient access controls lead to security and compliance breaches.
- Monolithic usage without boundaries blocks team autonomy.
- Define metadata and catalog strategies early.
- Control data access finely via roles.
- Use storage tiering and automated archiving.
I/O & resources
- Source data feeds (batch and stream)
- Metadata sources and data inventory
- Access and retention policies
- Raw data archive
- Cleaned and formatted datasets
- Metadata catalog and provenance information
Description
A Data Lake is a centralized repository that stores large volumes of raw, heterogeneous data in native formats to support analytics, machine learning workflows, and operational integration. It relies on schema-on-read, flexible ingestion pipelines, and separates storage from compute. Proper governance, metadata cataloging, and lifecycle policies are essential to maintain data quality, discoverability, and controlled usage.
✔Benefits
- High flexibility for ingesting heterogeneous data formats.
- Scalable storage of large volumes at comparatively low cost.
- Support for diverse analytics and ML use cases via access to raw data.
✖Limitations
- Without governance, data sprawl and poor data quality can occur.
- Interactive query performance requires additional optimization.
- Costs can grow uncontrollably without lifecycle management.
Trade-offs
Metrics
- Storage cost per TB
Monetary cost to store one terabyte over defined periods.
- Time‑to‑value for datasets
Time from data arrival to usability in analytics or ML.
- Percentage of datasets with metadata
Share of datasets that include complete metadata and provenance info.
Examples & implementations
Global e‑commerce data lake
A retail company consolidates clickstream, orders and logs to enable personalized recommendations and analytics.
Financial services compliance archive
Banks use the data lake for auditable long‑term archiving and preservation of audit trails.
IoT platform with time series data
A manufacturer stores telemetry from distributed devices for analysis and predictive maintenance.
Implementation steps
Inventory data sources and define objectives.
Make architectural decisions: storage, metadata, access controls.
Implement ingestion pipelines and metadata catalog.
Introduce lifecycle policies, monitoring and cost control.
⚠️ Technical debt & bottlenecks
Technical debt
- Non‑standard file formats without conversion strategy.
- Missing or incomplete metadata catalogs.
- Ad hoc schemas and transformation logic in user scripts.
Known bottlenecks
Misuse examples
- Storing sensitive data without encryption or access control.
- Using it as a permanent dump for all raw data without classification.
- Directly querying large raw datasets without optimized formats.
Typical traps
- Unclear responsibilities for data quality and maintenance.
- Missing automation for lifecycle management.
- Overestimating cloud providers' out‑of‑the‑box capabilities.
Required skills
Architectural drivers
Constraints
- • Budget for storage and compute
- • Existing data protection and compliance requirements
- • Technical integration capability of existing systems