Data Format
A data format defines the structured representation of information for storage and transmission between systems.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Incompatible schema changes can cause outages
- Wrong choice of binary format hinders debugging
- Lack of governance leads to proliferation of formats
- Use formal schema definitions and CI validation
- Limit breaking changes via compatible extensions
- Document semantics and example payloads
I/O & resources
- Source systems and data models
- Compatibility and performance requirements
- Available libraries and runtime environments
- Defined schema and serialization rules
- Implemented validation and tests
- Governance and migration policies
Description
A data format defines the structured representation of data for storage, transmission, and interpretation across systems. It specifies syntax, semantics, data types, serialization conventions and common standards (e.g., JSON, XML, Avro). It also covers versioning, schema evolution and rules for compatibility, validation, performance and extensibility.
✔Benefits
- Improved interoperability between systems
- Easier validation and error detection
- Better compression and performance with suitable formats
✖Limitations
- Format decisions can complicate later migrations
- Not all formats equally support complex data types
- Overhead from metadata or serialization step
Trade-offs
Metrics
- Schema compatibility rate
Percentage of changes that remain compatible with existing consumers.
- Payload size
Average bytes per message/record to estimate bandwidth and storage.
- Parsing latency
Time required for serialization/deserialization in the data processing path.
Examples & implementations
REST API with JSON Schema
A service uses JSON Schema to validate and document its API payloads.
Event streaming with Avro and Schema Registry
Events are serialized in Avro and versioned via a schema registry.
Analytical data lake with Parquet
Batch data is stored as Parquet to optimize queries and compression.
Implementation steps
Analyze requirements and existing formats
Select a suitable format and define schemas
Introduce validation, registry and documentation
⚠️ Technical debt & bottlenecks
Technical debt
- Outdated format versions without migration path
- Missing central registry for schemas
- Ad-hoc serialization libraries across services
Known bottlenecks
Misuse examples
- Using CSV for complex nested structures
- Persisting binary JSON (BSON) without compatibility rules
- Directly changing production schemas without tests
Typical traps
- Assuming text formats are always sufficient
- Underestimating schema migration costs
- Missing monitoring of incompatibilities
Required skills
Architectural drivers
Constraints
- • Existing consumers expect fixed schema
- • Regulatory requirements for formats and metadata
- • Legacy systems support limited formats only