Catalog
concept#Data#Integration#Architecture#Software Engineering

Data Format

A data format defines the structured representation of information for storage and transmission between systems.

A data format defines the structured representation of data for storage, transmission, and interpretation across systems.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

Message brokers (e.g., Kafka, RabbitMQ)REST/HTTP APIsData lake / object store (e.g., S3)

Principles & goals

Explicit schema definitions for interfacesVersioning and backward-compatible changesMinimal necessary semantics instead of implicit conventions
Build
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Incompatible schema changes can cause outages
  • Wrong choice of binary format hinders debugging
  • Lack of governance leads to proliferation of formats
  • Use formal schema definitions and CI validation
  • Limit breaking changes via compatible extensions
  • Document semantics and example payloads

I/O & resources

  • Source systems and data models
  • Compatibility and performance requirements
  • Available libraries and runtime environments
  • Defined schema and serialization rules
  • Implemented validation and tests
  • Governance and migration policies

Description

A data format defines the structured representation of data for storage, transmission, and interpretation across systems. It specifies syntax, semantics, data types, serialization conventions and common standards (e.g., JSON, XML, Avro). It also covers versioning, schema evolution and rules for compatibility, validation, performance and extensibility.

  • Improved interoperability between systems
  • Easier validation and error detection
  • Better compression and performance with suitable formats

  • Format decisions can complicate later migrations
  • Not all formats equally support complex data types
  • Overhead from metadata or serialization step

  • Schema compatibility rate

    Percentage of changes that remain compatible with existing consumers.

  • Payload size

    Average bytes per message/record to estimate bandwidth and storage.

  • Parsing latency

    Time required for serialization/deserialization in the data processing path.

REST API with JSON Schema

A service uses JSON Schema to validate and document its API payloads.

Event streaming with Avro and Schema Registry

Events are serialized in Avro and versioned via a schema registry.

Analytical data lake with Parquet

Batch data is stored as Parquet to optimize queries and compression.

1

Analyze requirements and existing formats

2

Select a suitable format and define schemas

3

Introduce validation, registry and documentation

⚠️ Technical debt & bottlenecks

  • Outdated format versions without migration path
  • Missing central registry for schemas
  • Ad-hoc serialization libraries across services
Serialization/deserializationSchema registry latencyNetwork bandwidth
  • Using CSV for complex nested structures
  • Persisting binary JSON (BSON) without compatibility rules
  • Directly changing production schemas without tests
  • Assuming text formats are always sufficient
  • Underestimating schema migration costs
  • Missing monitoring of incompatibilities
Data modeling and schema designKnowledge of serialization formats (JSON, Avro, Parquet)Experience with integration patterns and governance
Interoperability between heterogeneous systemsPerformance and latency requirementsLong-term maintainability and schema evolution
  • Existing consumers expect fixed schema
  • Regulatory requirements for formats and metadata
  • Legacy systems support limited formats only