Catalog
concept#Observability#Reliability#Architecture#DevOps

Distributed Tracing

Technique for tracking and correlating requests across services to make performance issues and root causes in distributed systems visible.

Distributed tracing is a technique to record and correlate requests across services to analyze performance and diagnose failures in distributed systems.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

OpenTelemetry CollectorJaeger/Zipkin for visualizationGrafana for trace and metric correlation

Principles & goals

Propagate trace context reliably across boundariesChoose pragmatic granularity: enough visibility, acceptable overheadPrefer standardized formats and open standards
Run
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Missing or inconsistent instrumentation leads to misleading results
  • Privacy risks from unintentionally captured sensitive data in traces
  • High storage costs with uncontrolled trace retention
  • Use standardized trace IDs and context propagation
  • Define and document sampling strategies
  • Mask or filter sensitive data before capturing traces

I/O & resources

  • Instrumented application libraries
  • Trace export/collector infrastructure
  • Storage and analysis backend
  • Trace history and visualizations
  • Root causes and critical paths
  • Metrics for latency and error trends

Description

Distributed tracing is a technique to record and correlate requests across services to analyze performance and diagnose failures in distributed systems. It captures spans and trace context across process and network boundaries, enabling root-cause analysis, latency breakdowns, and dependency mapping. Widely used for observability and operational debugging.

  • Faster fault localization and root-cause analysis
  • Finer performance insights across service boundaries
  • Improved cross-team communication through shared traces

  • Only captures instrumented paths; blind spots remain
  • Additional runtime and storage overhead
  • Correlation challenges for asynchronous or batch jobs

  • Average request latency (trace)

    Mean duration of a distributed request measured via traces.

  • Share of error traces

    Percentage of traces containing errors or exceptions.

  • Spans per trace

    Average number of spans in a trace as a granularity indicator.

Jaeger for distributed trace analysis

Use of Jaeger to collect, visualize, and analyze traces in a microservice environment.

OpenTelemetry instrumentation for HTTP services

Use libraries to automatically instrument HTTP requests and propagate trace context.

End-to-end trace for reproducing failures

Store complete request traces to follow a failure path across multiple services.

1

Instrument application libraries with OpenTelemetry and propagate trace context.

2

Set up OpenTelemetry Collector and configure exporters.

3

Tune trace sampling, retention, and storage layer to SLAs and costs.

⚠️ Technical debt & bottlenecks

  • Outdated instrumentation in legacy components
  • Missing trace context propagation in batch jobs
  • Monolithic exporters with poor scalability
Incomplete instrumentationNetwork or serialization overheadScaling of collection and storage backend
  • Collect traces containing full user data and store them violating privacy
  • Aggressive sampling that hides rare errors
  • Instrument only parts of the system and draw wrong conclusions
  • Confusing trace latency with system latency without context
  • Ignoring asynchronous workflows when correlating
  • Insufficient sampling design leads to missing insights
Understanding of distributed systems and context propagationExperience with observability toolingKnowledge of performance analysis and root-cause techniques
Visibility across service boundariesReduction of mean time to repairStandardized context propagation
  • Performance overhead must not violate SLAs
  • Security and privacy requirements for trace data
  • Compatibility with existing monitoring systems