Catalog
concept#Software Engineering#Reliability#Observability#Security

Error Handling

Core strategies for detecting, classifying and handling errors in software systems.

Error handling defines strategies and mechanisms to detect, classify and respond to faults in software systems.
Established
Medium

Classification

  • Medium
  • Technical
  • Design
  • Intermediate

Technical context

Central logging solution (e.g. ELK/Opensearch)Monitoring and alerts (e.g. Prometheus)Distributed tracing (e.g. Jaeger, OpenTelemetry)

Principles & goals

Validate errors early and handle negative paths explicitly.Provide consistent, machine-readable error formats.Prefer recovery and fallback over silent failures.
Build
Team, Domain

Use cases & scenarios

Compromises

  • Incorrect classification leads to false alerts.
  • Overly coarse fallbacks can create inconsistent states.
  • Insufficient logs greatly hinder debugging.
  • Validate errors early and use clear HTTP status codes.
  • Propagate correlation IDs across components.
  • Log errors with rich context and structured format (JSON).

I/O & resources

  • Requirements and SLO definitions
  • Error and exception taxonomy
  • Logging and tracing infrastructure
  • Standardized error responses
  • Alerts, metrics and dashboards
  • Documentation of recovery strategies

Description

Error handling defines strategies and mechanisms to detect, classify and respond to faults in software systems. It covers input validation, consistent error responses, fallback paths and recovery tactics to ensure resilient runtime behavior. Effective error handling reduces downtime, improves debugging and supports clear user communication.

  • Reduces downtime through clear recovery strategies.
  • Facilitates root-cause analysis via structured logs and traces.
  • Improves API interoperability via standardized error formats.

  • Not all errors can be fully auto-repaired.
  • Requires additional effort for design, logging and tests.
  • Standardized formats may hide domain-specific details.

  • Error rate

    Proportion of failed requests per time window.

  • Mean Time To Recover

    Average time to recover after an error.

  • Number of critical alerts

    Frequency of alerts requiring immediate action.

HTTP API using RFC 7807 error format

A REST API returns standardized problem JSON for machine-readable error handling.

Compensating transaction as recovery

On partial failure, compensating steps are executed to restore consistency.

Circuit breaker on overload

Protects dependent services from further requests after repeated failures.

1

Define and prioritize an error taxonomy.

2

Establish standardized error formats and codes.

3

Implement logging and tracing conventions.

4

Introduce and test fallbacks, retries and circuit breakers.

⚠️ Technical debt & bottlenecks

  • Legacy components without standardized error format.
  • Incomplete tests for error cases and timeouts.
  • Missing retention strategy for structured logs.
External dependenciesInsufficient test coverageUnstructured logs
  • Production logs lack correlation IDs, making debugging impossible.
  • Client displays internal error messages to end users.
  • Automatic retries without idempotence causing duplicated effects.
  • Logging error details that contain sensitive data.
  • Too many alerts (alert fatigue) due to broad rules.
  • Fallbacks that create inconsistent data states.
Software architecture and error-handling principlesExperience with logging, tracing and monitoringKnowledge of API design and protocols
Availability and fault toleranceObservability and error diagnosisAPI and user communication
  • Legal constraints for log handling
  • Limited resources for retention and storage
  • Legacy systems with incompatible error formats