concept#Data#Analytics#Architecture#Platform

Big Data

Big Data denotes practices and technologies for storing, processing and analysing very large, heterogeneous datasets to derive actionable insights.

Big Data refers to practices, technologies, and organizational approaches for processing very large, heterogeneous, and rapidly growing datasets.

Maturity

Established

Cognitive loadHigh

Classification

ComplexityHigh
Impact areaTechnical
Decision typeArchitectural
Organizational maturityIntermediate

Technical context

Integrations

Apache Kafka for streamingApache Spark for batch and stream processingData warehouse and BI tools

Principles & goals

Principles

Design for horizontal scalabilitySchema-on-read for flexible integrationEmbed data governance and privacy by design

Value stream stage

Build

Organizational level

Enterprise, Domain, Team

Use cases & scenarios

Use cases

Scenarios

Compromises

Risks

Privacy breaches and legal consequences
Misinterpreting correlations as causation
Loss of data quality with poor preparation

Best practices

Establish a metadata catalog early
Automated monitoring and cost tracking
Measure and continuously improve data quality

I/O & resources

Inputs

Raw data from production sources
Infrastructure for storage and processing
Metadata and data catalogs

Outputs

Analytical datasets and reports
APIs and data services for applications
Model datasets for machine learning workflows

Resources

Description

Big Data refers to practices, technologies, and organizational approaches for processing very large, heterogeneous, and rapidly growing datasets. It covers storage, processing, integration and analysis to extract actionable insights. Emphasis is on scalability, data quality, governance, privacy, infrastructure requirements and operational cost.

✔Benefits

Enables deep insights from large heterogeneous datasets
Supports data-driven decisions and automation
Scalable analytics for historical and real-time data

✖Limitations

High infrastructure and operational costs
Complexity integrating heterogeneous sources
Requires specialised skills and processes

Trade-offs

Metrics

Throughput (events/s or GB/s)
Measures the volume of data processed per time unit.
Latency (ms/s)
Time between arrival of a data item and its processing completion.
Cost per terabyte
Total cost for storage and processing per data volume.

Examples & implementations

Log analysis in e-commerce

Processing server and clickstream logs to detect fraud and optimise conversion rates.

Sensor data in manufacturing

Streaming analysis of machine metrics for predictive maintenance and reduced downtime.

Customer segmentation using historical transaction data

Batch analyses of large transaction datasets to identify customer patterns and personalised offers.

Implementation steps

Define goals and metrics

Inventory data sources and set priorities

Make infrastructure and architecture decisions and run proofs of concept

⚠️ Technical debt & bottlenecks

Technical debt

Undocumented, unstructured schemas
Ad-hoc scripts instead of maintainable pipelines
Missing metadata system and data lineage

Known bottlenecks

Storage costNetwork bandwidthData quality

Misuse examples

Unvetted release of personal data to analysts
Optimising only for throughput without quality controls
Migrating large datasets without integrity testing

Typical traps

Overestimating data quality in legacy systems
Ignoring statutory retention requirements
Neglecting observability in data pipelines

Required skills

Data engineering and distributed systemsData modelling and ETL/ELTData governance and privacy expertise

Architectural drivers

Data volume and growth rateData variety and integration needsLatency requirements and throughput

Constraints

• Regulatory requirements (e.g. GDPR)
• Budget and operational effort
• Legacy systems and incompatible formats