Catalog
concept#Data#Architecture#Analytics#Platform

Lambda Architecture

An architectural pattern combining batch and real-time processing for scalable data platforms.

Lambda Architecture is a structural architectural principle for large-scale data processing that separates batch and speed pipelines.
Established
High

Classification

  • High
  • Technical
  • Architectural
  • Intermediate

Technical context

Apache Kafka for ingest and streamingApache Spark / Hadoop for batch processingNoSQL store (e.g. Cassandra) as serving store

Principles & goals

Separate batch and real-time processing to combine accuracy and latency.Serving layer as unified read layer integrating results from both pipelines.Idempotent processing and unique timestamps as basis for reconciliation.
Build
Domain, Team

Use cases & scenarios

Compromises

  • Divergent results between speed and batch views can erode trust.
  • High operational costs due to parallel infrastructure for batch and real-time.
  • Lack of automated reconciliations leads to manual error corrections.
  • Use idempotent events and unique timestamps.
  • Implement automated reconciliation processes.
  • Establish monitoring for latency, throughput and divergence.

I/O & resources

  • Real-time event streams
  • Raw data for batch processing
  • Metadata, timestamps and schema definitions
  • Low-latency metrics and alerts
  • Corrected, final aggregations
  • Serve index for read APIs

Description

Lambda Architecture is a structural architectural principle for large-scale data processing that separates batch and speed pipelines. It defines batch, speed and serving layers to combine accuracy with low latency. Typical decisions involve trade-offs around consistency, complexity and operational cost in data integration.

  • Combines low latency and high accuracy via specialized pipelines.
  • Clear separation of responsibilities eases scaling of individual layers.
  • Batch layer allows full corrections and re-processing on errors.

  • Considerable implementation and operational overhead due to duplicated logic.
  • Complex consistency and error handling models between layers.
  • Increasing maintenance effort when data models change.

  • End-to-end latency

    Time from event arrival to availability in the serving layer.

  • Batch processing duration

    Duration of complete batch jobs to produce corrected results.

  • Data divergence rate

    Share of inconsistent values between speed and batch views.

Real-time analytics platform with Spark and Kafka

Combination of Apache Spark (batch), Spark Streaming (speed) and Kafka as ingest for low latency.

Log analysis with separate serving index

Batch computation yields corrected aggregations, speed layer supplies dashboards, serving layer indexes results.

Hybrid reporting in an e-commerce system

Real-time conversion metrics combined with daily computed revenue figures in the serving layer.

1

Analyze requirements for latency, accuracy and volume.

2

Define data flows, storage locations and interfaces.

3

Develop a minimal speed layer for critical dashboards.

4

Implement the batch layer with re-processing capability.

5

Build a serving layer and establish validation processes.

⚠️ Technical debt & bottlenecks

  • Ad-hoc reconciliations instead of proper re-processing pipelines.
  • Outdated batch jobs that cannot handle new schemas.
  • Insufficient test coverage for divergence cases.
Batch latencyServing index performanceData quality and schema evolution
  • Implementing only the speed layer without batch corrections.
  • Full duplication of complex logic in both pipelines.
  • Neglecting reconciliation tests before production.
  • Underestimating operational effort for parallel pipelines.
  • Missing unique timestamps complicate corrections.
  • Serving layer becomes bottleneck due to insufficient indexing.
Data engineering and distributed systemsStreaming frameworks and batch processing conceptsOperations and observability of large data pipelines
Expected event volume and scaling requirementsRequirements for latency and accuracy of resultsOperational capacity for monitoring and re-processing
  • Existing infrastructure for batch and stream processing
  • Limited resources for parallel operation of multiple layers
  • Regulatory requirements for data retention and correction