Catalog
concept#Data#Architecture#Platform

Data Lake

A centralized, scalable repository for raw and heterogeneous data in native formats to support analytics and integrations.

A Data Lake is a centralized repository that stores large volumes of raw, heterogeneous data in native formats to support analytics, machine learning workflows, and operational integration.
Established
High

Classification

  • High
  • Technical
  • Architectural
  • Intermediate

Technical context

Cloud object storage (e.g., Azure Blob, S3)ETL/ELT tools (e.g., Spark, Databricks)Metadata catalogs (e.g., Apache Atlas, Glue)

Principles & goals

Separate storage and compute to enable independent scaling.Operate a metadata catalog to support discoverability and governance.Define clear access and lifecycle policies to control compliance and costs.
Build
Enterprise, Domain

Use cases & scenarios

Compromises

  • Inaccurate or missing metadata hinders reuse.
  • Insufficient access controls lead to security and compliance breaches.
  • Monolithic usage without boundaries blocks team autonomy.
  • Define metadata and catalog strategies early.
  • Control data access finely via roles.
  • Use storage tiering and automated archiving.

I/O & resources

  • Source data feeds (batch and stream)
  • Metadata sources and data inventory
  • Access and retention policies
  • Raw data archive
  • Cleaned and formatted datasets
  • Metadata catalog and provenance information

Description

A Data Lake is a centralized repository that stores large volumes of raw, heterogeneous data in native formats to support analytics, machine learning workflows, and operational integration. It relies on schema-on-read, flexible ingestion pipelines, and separates storage from compute. Proper governance, metadata cataloging, and lifecycle policies are essential to maintain data quality, discoverability, and controlled usage.

  • High flexibility for ingesting heterogeneous data formats.
  • Scalable storage of large volumes at comparatively low cost.
  • Support for diverse analytics and ML use cases via access to raw data.

  • Without governance, data sprawl and poor data quality can occur.
  • Interactive query performance requires additional optimization.
  • Costs can grow uncontrollably without lifecycle management.

  • Storage cost per TB

    Monetary cost to store one terabyte over defined periods.

  • Time‑to‑value for datasets

    Time from data arrival to usability in analytics or ML.

  • Percentage of datasets with metadata

    Share of datasets that include complete metadata and provenance info.

Global e‑commerce data lake

A retail company consolidates clickstream, orders and logs to enable personalized recommendations and analytics.

Financial services compliance archive

Banks use the data lake for auditable long‑term archiving and preservation of audit trails.

IoT platform with time series data

A manufacturer stores telemetry from distributed devices for analysis and predictive maintenance.

1

Inventory data sources and define objectives.

2

Make architectural decisions: storage, metadata, access controls.

3

Implement ingestion pipelines and metadata catalog.

4

Introduce lifecycle policies, monitoring and cost control.

⚠️ Technical debt & bottlenecks

  • Non‑standard file formats without conversion strategy.
  • Missing or incomplete metadata catalogs.
  • Ad hoc schemas and transformation logic in user scripts.
Ingest throughputMetadata managementQuery performance
  • Storing sensitive data without encryption or access control.
  • Using it as a permanent dump for all raw data without classification.
  • Directly querying large raw datasets without optimized formats.
  • Unclear responsibilities for data quality and maintenance.
  • Missing automation for lifecycle management.
  • Overestimating cloud providers' out‑of‑the‑box capabilities.
Data architecture and storage modelsData pipelines and streamingData governance and security
Storage scalabilityIntegration of heterogeneous data sourcesGovernance, security and compliance
  • Budget for storage and compute
  • Existing data protection and compliance requirements
  • Technical integration capability of existing systems