concept#Data#Architecture#Platform

Data Lake

A centralized, scalable repository for raw and heterogeneous data in native formats to support analytics and integrations.

A Data Lake is a centralized repository that stores large volumes of raw, heterogeneous data in native formats to support analytics, machine learning workflows, and operational integration.

Maturity

Established

Cognitive loadHigh

Classification

ComplexityHigh
Impact areaTechnical
Decision typeArchitectural
Organizational maturityIntermediate

Technical context

Integrations

Cloud object storage (e.g., Azure Blob, S3)ETL/ELT tools (e.g., Spark, Databricks)Metadata catalogs (e.g., Apache Atlas, Glue)

Principles & goals

Principles

Separate storage and compute to enable independent scaling.Operate a metadata catalog to support discoverability and governance.Define clear access and lifecycle policies to control compliance and costs.

Value stream stage

Build

Organizational level

Enterprise, Domain

Use cases & scenarios

Use cases

Scenarios

Compromises

Risks

Inaccurate or missing metadata hinders reuse.
Insufficient access controls lead to security and compliance breaches.
Monolithic usage without boundaries blocks team autonomy.

Best practices

Define metadata and catalog strategies early.
Control data access finely via roles.
Use storage tiering and automated archiving.

I/O & resources

Inputs

Source data feeds (batch and stream)
Metadata sources and data inventory
Access and retention policies

Outputs

Raw data archive
Cleaned and formatted datasets
Metadata catalog and provenance information

Resources

Description

A Data Lake is a centralized repository that stores large volumes of raw, heterogeneous data in native formats to support analytics, machine learning workflows, and operational integration. It relies on schema-on-read, flexible ingestion pipelines, and separates storage from compute. Proper governance, metadata cataloging, and lifecycle policies are essential to maintain data quality, discoverability, and controlled usage.

✔Benefits

High flexibility for ingesting heterogeneous data formats.
Scalable storage of large volumes at comparatively low cost.
Support for diverse analytics and ML use cases via access to raw data.

✖Limitations

Without governance, data sprawl and poor data quality can occur.
Interactive query performance requires additional optimization.
Costs can grow uncontrollably without lifecycle management.

Trade-offs

Metrics

Storage cost per TB
Monetary cost to store one terabyte over defined periods.
Time‑to‑value for datasets
Time from data arrival to usability in analytics or ML.
Percentage of datasets with metadata
Share of datasets that include complete metadata and provenance info.

Examples & implementations

Global e‑commerce data lake

A retail company consolidates clickstream, orders and logs to enable personalized recommendations and analytics.

Financial services compliance archive

Banks use the data lake for auditable long‑term archiving and preservation of audit trails.

IoT platform with time series data

A manufacturer stores telemetry from distributed devices for analysis and predictive maintenance.

Implementation steps

Inventory data sources and define objectives.

Make architectural decisions: storage, metadata, access controls.

Implement ingestion pipelines and metadata catalog.

Introduce lifecycle policies, monitoring and cost control.

⚠️ Technical debt & bottlenecks

Technical debt

Non‑standard file formats without conversion strategy.
Missing or incomplete metadata catalogs.
Ad hoc schemas and transformation logic in user scripts.

Known bottlenecks

Ingest throughputMetadata managementQuery performance

Misuse examples

Storing sensitive data without encryption or access control.
Using it as a permanent dump for all raw data without classification.
Directly querying large raw datasets without optimized formats.

Typical traps

Unclear responsibilities for data quality and maintenance.
Missing automation for lifecycle management.
Overestimating cloud providers' out‑of‑the‑box capabilities.

Required skills

Data architecture and storage modelsData pipelines and streamingData governance and security

Architectural drivers

Storage scalabilityIntegration of heterogeneous data sourcesGovernance, security and compliance

Constraints

• Budget for storage and compute
• Existing data protection and compliance requirements
• Technical integration capability of existing systems