Error Handling
Core strategies for detecting, classifying and handling errors in software systems.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeDesign
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Incorrect classification leads to false alerts.
- Overly coarse fallbacks can create inconsistent states.
- Insufficient logs greatly hinder debugging.
- Validate errors early and use clear HTTP status codes.
- Propagate correlation IDs across components.
- Log errors with rich context and structured format (JSON).
I/O & resources
- Requirements and SLO definitions
- Error and exception taxonomy
- Logging and tracing infrastructure
- Standardized error responses
- Alerts, metrics and dashboards
- Documentation of recovery strategies
Description
Error handling defines strategies and mechanisms to detect, classify and respond to faults in software systems. It covers input validation, consistent error responses, fallback paths and recovery tactics to ensure resilient runtime behavior. Effective error handling reduces downtime, improves debugging and supports clear user communication.
✔Benefits
- Reduces downtime through clear recovery strategies.
- Facilitates root-cause analysis via structured logs and traces.
- Improves API interoperability via standardized error formats.
✖Limitations
- Not all errors can be fully auto-repaired.
- Requires additional effort for design, logging and tests.
- Standardized formats may hide domain-specific details.
Trade-offs
Metrics
- Error rate
Proportion of failed requests per time window.
- Mean Time To Recover
Average time to recover after an error.
- Number of critical alerts
Frequency of alerts requiring immediate action.
Examples & implementations
HTTP API using RFC 7807 error format
A REST API returns standardized problem JSON for machine-readable error handling.
Compensating transaction as recovery
On partial failure, compensating steps are executed to restore consistency.
Circuit breaker on overload
Protects dependent services from further requests after repeated failures.
Implementation steps
Define and prioritize an error taxonomy.
Establish standardized error formats and codes.
Implement logging and tracing conventions.
Introduce and test fallbacks, retries and circuit breakers.
⚠️ Technical debt & bottlenecks
Technical debt
- Legacy components without standardized error format.
- Incomplete tests for error cases and timeouts.
- Missing retention strategy for structured logs.
Known bottlenecks
Misuse examples
- Production logs lack correlation IDs, making debugging impossible.
- Client displays internal error messages to end users.
- Automatic retries without idempotence causing duplicated effects.
Typical traps
- Logging error details that contain sensitive data.
- Too many alerts (alert fatigue) due to broad rules.
- Fallbacks that create inconsistent data states.
Required skills
Architectural drivers
Constraints
- • Legal constraints for log handling
- • Limited resources for retention and storage
- • Legacy systems with incompatible error formats