Catalog
concept#Observability#Reliability#DevOps#Software Engineering

Troubleshooting

A structured process for rapid detection, diagnosis and remediation of technical faults in systems and workflows.

Troubleshooting is the structured process for identifying, diagnosing, and resolving technical failures in systems and workflows.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

Monitoring systems (e.g. Prometheus, Datadog)Logging/tracing platforms (ELK, Jaeger)Incident management (PagerDuty, Opsgenie)

Principles & goals

Observability as prerequisiteHypothesis-driven approachReproduce before fix
Run
Team, Domain, Enterprise

Use cases & scenarios

Compromises

  • Wrong hypotheses delay resolution
  • Temporary workarounds instead of sustainable fixes
  • Knowledge remains in silos instead of documentation
  • Reduce alert noise, define SLO-oriented alerts
  • Document reproduction steps and context in tickets
  • Use automated diagnostic tools and checklists

I/O & resources

  • Monitoring metrics and dashboards
  • Structured logs and traces
  • Runbooks, playbooks and system documentation
  • Restored functionality
  • Root‑cause analysis and permanent fixes
  • Updated runbooks and preventive measures

Description

Troubleshooting is the structured process for identifying, diagnosing, and resolving technical failures in systems and workflows. It uses hypothesis-driven investigation, analysis of logs, metrics and traces, and targeted remediation to restore service. The practice also captures root causes and improves system observability and resilience.

  • Faster service restoration
  • Improved system understanding and prevention
  • Documented knowledge and runbooks

  • Not always immediately reproducible
  • Dependence on metric and log quality
  • May require organizational coordination

  • Mean Time To Resolve (MTTR)

    Average time from incident onset to full resolution.

  • Time To Detect (TTD)

    Average time until a problem is detected by monitoring or users.

  • Number of reopened incidents

    Count or rate of incidents that reoccur after an apparent fix.

Latency spike after feature rollout

After a feature launch API latencies rose; cause was an unoptimized query in a new code path.

Worker crashes sporadically

Intermittent null pointer in a third-party library caused restarts; reproducible with specific input.

CI tests fail after dependency upgrade

An upgraded test library changed assumptions; tests had to be adapted or the library rolled back.

1

Establish visibility: instrument relevant metrics, logs and traces

2

Create and make runbooks and playbooks accessible

3

Define escalation and communication paths

4

Establish regular post-mortems and knowledge transfer

⚠️ Technical debt & bottlenecks

  • Incomplete or inconsistent logs
  • Missing tests for reproducing failures
  • Outdated runbooks and obsolete playbooks
Insufficient telemetryKnowledge silos in teamComplex inter-service dependencies
  • Only applying quick fixes repeatedly without addressing root cause
  • Ignoring monitoring data and relying solely on user reports
  • Denying privileged access so diagnosis becomes impossible
  • Making assumptions too early before collecting data
  • Hunter mentality: fix short-term only instead of documenting
  • Missing context in alerts (owner, runbook)
Log and metrics analysisSystem and network diagnosisHypothesis formation and experimental testing
Observability and telemetrySLOs and availabilityFast feedback and release cycles
  • Limited access rights to production data
  • Costs for long-term metrics and storage
  • Regulatory constraints for data privacy