Catalog
concept#Reliability#Observability#DevOps#Software Engineering

Severity Levels

Categorizes impact and urgency of incidents to drive prioritization, escalation and response times in operations.

Severity levels classify impact and urgency of incidents, outages or defects using defined criteria.
Established
Medium

Classification

  • Medium
  • Organizational
  • Organizational
  • Intermediate

Technical context

Alerting systems (e.g., Prometheus, Datadog)Incident management tools (e.g., PagerDuty)Ticketing and communication platforms (e.g., Jira, Slack)

Principles & goals

Clear, measurable criteria for each severity tierDefine short- and mid-term response objectives per tierTransparent communication and responsibilities
Run
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Misclassification leads to incorrect prioritization
  • Overuse of high severity tiers reduces their effectiveness
  • Unclear criteria create conflicts between teams
  • Use clear, measurable criteria instead of vague descriptions
  • Automatic initial classification with manual review on doubt
  • Regular training and drills for on-call teams

I/O & resources

  • Monitoring and alert data
  • SLA and contractual information
  • Service topology and dependencies
  • Assigned severity tier
  • Escalation and communication plan
  • Post-incident report with learnings

Description

Severity levels classify impact and urgency of incidents, outages or defects using defined criteria. They provide clear escalation paths, prioritization and response time objectives across teams and systems. This enables coordinated incident management, efficient allocation of resources and traceable communication during operational incidents.

  • Faster decision-making during incidents
  • Consistent escalation processes and clearer ownership
  • Improved SLA adherence and resource prioritization

  • Cumbersomeness with overly rigid or numerous tiers
  • Subjective classification without clear criteria
  • Maintenance effort when operational conditions change

  • MTTR (Mean Time to Repair)

    Average time from incident detection to restoration.

  • Number of incidents per severity tier

    Distribution of incidents across defined severity levels.

  • SLA compliance rate

    Percentage of incidents resolved within SLA timeframes.

SLA-driven prioritization at payment providers

A payment provider uses severity levels to standardize escalation and SLA reporting.

On-call routing based on severity

Severity tiers determine which on-call role is assigned an incident.

Prioritization in release planning

Bugs are classified by severity to set fix priorities in releases.

1

Inventory: capture services, SLAs and monitoring coverage.

2

Definition: establish clear criteria and escalation paths for each tier.

3

Integration: adjust alerts, on-call routing and ticketing.

4

Review: regularly review and adjust based on post-incident analyses.

⚠️ Technical debt & bottlenecks

  • Insufficient observability hinders correct classification
  • Outdated SLA documentation
  • Missing integrations between alerting and ticketing
Incomplete telemetrySlow communication channelsUnclear ownership
  • Classifying marketing bugs as high severity while no production impact exists
  • Using severity to bypass change processes
  • Not documenting severity levels and applying them inconsistently
  • Confusing impact with frequency when rating
  • Over-automation without escalation checks
  • Failure to adapt to changed business priorities
Incident triage and prioritizationBasics of observability and monitoringCrisis communication
Availability of critical pathsSLA and contractual requirementsObservability and monitoring coverage
  • Technical limitations in monitoring
  • Organizational escalation boundaries
  • Contractual SLA constraints