concept#Software Engineering#Reliability#Observability#Security

Error Handling

Core strategies for detecting, classifying and handling errors in software systems.

Error handling defines strategies and mechanisms to detect, classify and respond to faults in software systems.

Maturity

Established

Cognitive loadMedium

Classification

ComplexityMedium
Impact areaTechnical
Decision typeDesign
Organizational maturityIntermediate

Technical context

Integrations

Central logging solution (e.g. ELK/Opensearch)Monitoring and alerts (e.g. Prometheus)Distributed tracing (e.g. Jaeger, OpenTelemetry)

Principles & goals

Principles

Validate errors early and handle negative paths explicitly.Provide consistent, machine-readable error formats.Prefer recovery and fallback over silent failures.

Value stream stage

Build

Organizational level

Team, Domain

Use cases & scenarios

Use cases

Scenarios

Compromises

Risks

Incorrect classification leads to false alerts.
Overly coarse fallbacks can create inconsistent states.
Insufficient logs greatly hinder debugging.

Best practices

Validate errors early and use clear HTTP status codes.
Propagate correlation IDs across components.
Log errors with rich context and structured format (JSON).

I/O & resources

Inputs

Requirements and SLO definitions
Error and exception taxonomy
Logging and tracing infrastructure

Outputs

Standardized error responses
Alerts, metrics and dashboards
Documentation of recovery strategies

Resources

Description

Error handling defines strategies and mechanisms to detect, classify and respond to faults in software systems. It covers input validation, consistent error responses, fallback paths and recovery tactics to ensure resilient runtime behavior. Effective error handling reduces downtime, improves debugging and supports clear user communication.

✔Benefits

Reduces downtime through clear recovery strategies.
Facilitates root-cause analysis via structured logs and traces.
Improves API interoperability via standardized error formats.

✖Limitations

Not all errors can be fully auto-repaired.
Requires additional effort for design, logging and tests.
Standardized formats may hide domain-specific details.

Trade-offs

Metrics

Error rate
Proportion of failed requests per time window.
Mean Time To Recover
Average time to recover after an error.
Number of critical alerts
Frequency of alerts requiring immediate action.

Examples & implementations

HTTP API using RFC 7807 error format

A REST API returns standardized problem JSON for machine-readable error handling.

Compensating transaction as recovery

On partial failure, compensating steps are executed to restore consistency.

Circuit breaker on overload

Protects dependent services from further requests after repeated failures.

Implementation steps

Define and prioritize an error taxonomy.

Establish standardized error formats and codes.

Implement logging and tracing conventions.

Introduce and test fallbacks, retries and circuit breakers.

⚠️ Technical debt & bottlenecks

Technical debt

Legacy components without standardized error format.
Incomplete tests for error cases and timeouts.
Missing retention strategy for structured logs.

Known bottlenecks

External dependenciesInsufficient test coverageUnstructured logs

Misuse examples

Production logs lack correlation IDs, making debugging impossible.
Client displays internal error messages to end users.
Automatic retries without idempotence causing duplicated effects.

Typical traps

Logging error details that contain sensitive data.
Too many alerts (alert fatigue) due to broad rules.
Fallbacks that create inconsistent data states.

Required skills

Software architecture and error-handling principlesExperience with logging, tracing and monitoringKnowledge of API design and protocols

Architectural drivers

Availability and fault toleranceObservability and error diagnosisAPI and user communication

Constraints

• Legal constraints for log handling
• Limited resources for retention and storage
• Legacy systems with incompatible error formats