Catalog
concept#Data#Analytics#Architecture#Platform

Big Data

Big Data denotes practices and technologies for storing, processing and analysing very large, heterogeneous datasets to derive actionable insights.

Big Data refers to practices, technologies, and organizational approaches for processing very large, heterogeneous, and rapidly growing datasets.
Established
High

Classification

  • High
  • Technical
  • Architectural
  • Intermediate

Technical context

Apache Kafka for streamingApache Spark for batch and stream processingData warehouse and BI tools

Principles & goals

Design for horizontal scalabilitySchema-on-read for flexible integrationEmbed data governance and privacy by design
Build
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Privacy breaches and legal consequences
  • Misinterpreting correlations as causation
  • Loss of data quality with poor preparation
  • Establish a metadata catalog early
  • Automated monitoring and cost tracking
  • Measure and continuously improve data quality

I/O & resources

  • Raw data from production sources
  • Infrastructure for storage and processing
  • Metadata and data catalogs
  • Analytical datasets and reports
  • APIs and data services for applications
  • Model datasets for machine learning workflows

Description

Big Data refers to practices, technologies, and organizational approaches for processing very large, heterogeneous, and rapidly growing datasets. It covers storage, processing, integration and analysis to extract actionable insights. Emphasis is on scalability, data quality, governance, privacy, infrastructure requirements and operational cost.

  • Enables deep insights from large heterogeneous datasets
  • Supports data-driven decisions and automation
  • Scalable analytics for historical and real-time data

  • High infrastructure and operational costs
  • Complexity integrating heterogeneous sources
  • Requires specialised skills and processes

  • Throughput (events/s or GB/s)

    Measures the volume of data processed per time unit.

  • Latency (ms/s)

    Time between arrival of a data item and its processing completion.

  • Cost per terabyte

    Total cost for storage and processing per data volume.

Log analysis in e-commerce

Processing server and clickstream logs to detect fraud and optimise conversion rates.

Sensor data in manufacturing

Streaming analysis of machine metrics for predictive maintenance and reduced downtime.

Customer segmentation using historical transaction data

Batch analyses of large transaction datasets to identify customer patterns and personalised offers.

1

Define goals and metrics

2

Inventory data sources and set priorities

3

Make infrastructure and architecture decisions and run proofs of concept

⚠️ Technical debt & bottlenecks

  • Undocumented, unstructured schemas
  • Ad-hoc scripts instead of maintainable pipelines
  • Missing metadata system and data lineage
Storage costNetwork bandwidthData quality
  • Unvetted release of personal data to analysts
  • Optimising only for throughput without quality controls
  • Migrating large datasets without integrity testing
  • Overestimating data quality in legacy systems
  • Ignoring statutory retention requirements
  • Neglecting observability in data pipelines
Data engineering and distributed systemsData modelling and ETL/ELTData governance and privacy expertise
Data volume and growth rateData variety and integration needsLatency requirements and throughput
  • Regulatory requirements (e.g. GDPR)
  • Budget and operational effort
  • Legacy systems and incompatible formats