Troubleshooting
A structured process for rapid detection, diagnosis and remediation of technical faults in systems and workflows.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Wrong hypotheses delay resolution
- Temporary workarounds instead of sustainable fixes
- Knowledge remains in silos instead of documentation
- Reduce alert noise, define SLO-oriented alerts
- Document reproduction steps and context in tickets
- Use automated diagnostic tools and checklists
I/O & resources
- Monitoring metrics and dashboards
- Structured logs and traces
- Runbooks, playbooks and system documentation
- Restored functionality
- Root‑cause analysis and permanent fixes
- Updated runbooks and preventive measures
Description
Troubleshooting is the structured process for identifying, diagnosing, and resolving technical failures in systems and workflows. It uses hypothesis-driven investigation, analysis of logs, metrics and traces, and targeted remediation to restore service. The practice also captures root causes and improves system observability and resilience.
✔Benefits
- Faster service restoration
- Improved system understanding and prevention
- Documented knowledge and runbooks
✖Limitations
- Not always immediately reproducible
- Dependence on metric and log quality
- May require organizational coordination
Trade-offs
Metrics
- Mean Time To Resolve (MTTR)
Average time from incident onset to full resolution.
- Time To Detect (TTD)
Average time until a problem is detected by monitoring or users.
- Number of reopened incidents
Count or rate of incidents that reoccur after an apparent fix.
Examples & implementations
Latency spike after feature rollout
After a feature launch API latencies rose; cause was an unoptimized query in a new code path.
Worker crashes sporadically
Intermittent null pointer in a third-party library caused restarts; reproducible with specific input.
CI tests fail after dependency upgrade
An upgraded test library changed assumptions; tests had to be adapted or the library rolled back.
Implementation steps
Establish visibility: instrument relevant metrics, logs and traces
Create and make runbooks and playbooks accessible
Define escalation and communication paths
Establish regular post-mortems and knowledge transfer
⚠️ Technical debt & bottlenecks
Technical debt
- Incomplete or inconsistent logs
- Missing tests for reproducing failures
- Outdated runbooks and obsolete playbooks
Known bottlenecks
Misuse examples
- Only applying quick fixes repeatedly without addressing root cause
- Ignoring monitoring data and relying solely on user reports
- Denying privileged access so diagnosis becomes impossible
Typical traps
- Making assumptions too early before collecting data
- Hunter mentality: fix short-term only instead of documenting
- Missing context in alerts (owner, runbook)
Required skills
Architectural drivers
Constraints
- • Limited access rights to production data
- • Costs for long-term metrics and storage
- • Regulatory constraints for data privacy