concept#Data#Integration#Architecture#Software Engineering

Data Format

A data format defines the structured representation of information for storage and transmission between systems.

A data format defines the structured representation of data for storage, transmission, and interpretation across systems.

Maturity

Established

Cognitive loadMedium

Classification

ComplexityMedium
Impact areaTechnical
Decision typeArchitectural
Organizational maturityIntermediate

Technical context

Integrations

Message brokers (e.g., Kafka, RabbitMQ)REST/HTTP APIsData lake / object store (e.g., S3)

Principles & goals

Principles

Explicit schema definitions for interfacesVersioning and backward-compatible changesMinimal necessary semantics instead of implicit conventions

Value stream stage

Build

Organizational level

Enterprise, Domain, Team

Use cases & scenarios

Use cases

Scenarios

Compromises

Risks

Incompatible schema changes can cause outages
Wrong choice of binary format hinders debugging
Lack of governance leads to proliferation of formats

Best practices

Use formal schema definitions and CI validation
Limit breaking changes via compatible extensions
Document semantics and example payloads

I/O & resources

Inputs

Source systems and data models
Compatibility and performance requirements
Available libraries and runtime environments

Outputs

Defined schema and serialization rules
Implemented validation and tests
Governance and migration policies

Resources

Description

A data format defines the structured representation of data for storage, transmission, and interpretation across systems. It specifies syntax, semantics, data types, serialization conventions and common standards (e.g., JSON, XML, Avro). It also covers versioning, schema evolution and rules for compatibility, validation, performance and extensibility.

✔Benefits

Improved interoperability between systems
Easier validation and error detection
Better compression and performance with suitable formats

✖Limitations

Format decisions can complicate later migrations
Not all formats equally support complex data types
Overhead from metadata or serialization step

Trade-offs

Metrics

Schema compatibility rate
Percentage of changes that remain compatible with existing consumers.
Payload size
Average bytes per message/record to estimate bandwidth and storage.
Parsing latency
Time required for serialization/deserialization in the data processing path.

Examples & implementations

REST API with JSON Schema

A service uses JSON Schema to validate and document its API payloads.

Event streaming with Avro and Schema Registry

Events are serialized in Avro and versioned via a schema registry.

Analytical data lake with Parquet

Batch data is stored as Parquet to optimize queries and compression.

Implementation steps

Analyze requirements and existing formats

Select a suitable format and define schemas

Introduce validation, registry and documentation

⚠️ Technical debt & bottlenecks

Technical debt

Outdated format versions without migration path
Missing central registry for schemas
Ad-hoc serialization libraries across services

Known bottlenecks

Serialization/deserializationSchema registry latencyNetwork bandwidth

Misuse examples

Using CSV for complex nested structures
Persisting binary JSON (BSON) without compatibility rules
Directly changing production schemas without tests

Typical traps

Assuming text formats are always sufficient
Underestimating schema migration costs
Missing monitoring of incompatibilities

Required skills

Data modeling and schema designKnowledge of serialization formats (JSON, Avro, Parquet)Experience with integration patterns and governance

Architectural drivers

Interoperability between heterogeneous systemsPerformance and latency requirementsLong-term maintainability and schema evolution

Constraints

• Existing consumers expect fixed schema
• Regulatory requirements for formats and metadata
• Legacy systems support limited formats only