Big Data
Big Data denotes practices and technologies for storing, processing and analysing very large, heterogeneous datasets to derive actionable insights.
Classification
- ComplexityHigh
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Privacy breaches and legal consequences
- Misinterpreting correlations as causation
- Loss of data quality with poor preparation
- Establish a metadata catalog early
- Automated monitoring and cost tracking
- Measure and continuously improve data quality
I/O & resources
- Raw data from production sources
- Infrastructure for storage and processing
- Metadata and data catalogs
- Analytical datasets and reports
- APIs and data services for applications
- Model datasets for machine learning workflows
Description
Big Data refers to practices, technologies, and organizational approaches for processing very large, heterogeneous, and rapidly growing datasets. It covers storage, processing, integration and analysis to extract actionable insights. Emphasis is on scalability, data quality, governance, privacy, infrastructure requirements and operational cost.
✔Benefits
- Enables deep insights from large heterogeneous datasets
- Supports data-driven decisions and automation
- Scalable analytics for historical and real-time data
✖Limitations
- High infrastructure and operational costs
- Complexity integrating heterogeneous sources
- Requires specialised skills and processes
Trade-offs
Metrics
- Throughput (events/s or GB/s)
Measures the volume of data processed per time unit.
- Latency (ms/s)
Time between arrival of a data item and its processing completion.
- Cost per terabyte
Total cost for storage and processing per data volume.
Examples & implementations
Log analysis in e-commerce
Processing server and clickstream logs to detect fraud and optimise conversion rates.
Sensor data in manufacturing
Streaming analysis of machine metrics for predictive maintenance and reduced downtime.
Customer segmentation using historical transaction data
Batch analyses of large transaction datasets to identify customer patterns and personalised offers.
Implementation steps
Define goals and metrics
Inventory data sources and set priorities
Make infrastructure and architecture decisions and run proofs of concept
⚠️ Technical debt & bottlenecks
Technical debt
- Undocumented, unstructured schemas
- Ad-hoc scripts instead of maintainable pipelines
- Missing metadata system and data lineage
Known bottlenecks
Misuse examples
- Unvetted release of personal data to analysts
- Optimising only for throughput without quality controls
- Migrating large datasets without integrity testing
Typical traps
- Overestimating data quality in legacy systems
- Ignoring statutory retention requirements
- Neglecting observability in data pipelines
Required skills
Architectural drivers
Constraints
- • Regulatory requirements (e.g. GDPR)
- • Budget and operational effort
- • Legacy systems and incompatible formats