Catalog
concept#Reliability#Observability#DevOps#Governance

On-Call

Organized team duty to respond to incidents and operational disruptions outside regular hours. Purpose is rapid recovery, minimizing downtime, and providing clear escalation paths.

On-call describes the structured rotation of responsibilities for operations teams to quickly detect, escalate, and remediate incidents.
Established
Medium

Classification

  • Medium
  • Organizational
  • Organizational
  • Intermediate

Technical context

Incident management and pager systems (e.g. PagerDuty)Monitoring and observability platformsChatOps tools for escalation and communication

Principles & goals

Clear responsibilities and rotation to avoid single-person risks.Automated alert filtering and prioritization reduce noise.Documented runbooks and post-incident reviews support learning.
Run
Team, Domain

Use cases & scenarios

Compromises

  • Burnout from sustained high on-call load.
  • Unclear escalation paths lead to delayed responses.
  • Insufficient documentation worsens resolution times.
  • Schedule on-call so load is distributed fairly.
  • Automate routine responses and reduce manual steps.
  • Hold regular post-incident reviews with concrete improvements.

I/O & resources

  • Monitoring and alerting configuration
  • On-call roster and escalation matrix
  • Runbooks, playbooks and contact details
  • Detected and handled incidents with timeline
  • Post-incident reports and action items
  • Improved runbooks and reduced alarm flood

Description

On-call describes the structured rotation of responsibilities for operations teams to quickly detect, escalate, and remediate incidents. It includes scheduling, alerting, runbooks and post-incident reviews for continuous improvement. Well-designed on-call processes reduce downtime and distribute operational knowledge across the organization.

  • Faster recovery and reduced downtime.
  • Knowledge is distributed across the team and not tied to individuals.
  • Better measurability of operational effects and improvement cycles.

  • Increased stress for involved people, especially with poor planning.
  • Effort for scheduling and administration can be significant.
  • Without good automation, high noise arises from irrelevant alerts.

  • Mean Time to Acknowledge (MTTA)

    Average time until an alert is acknowledged by on-call.

  • Mean Time to Resolve (MTTR)

    Average time until normal operation is restored after an incident.

  • Alert noise ratio

    Share of irrelevant or false-positive alerts relative to total alerts.

Established SRE on-call rotation

An SRE team runs rotating on-call, combining alert prioritization and runbooks for remediation.

Small team with shared on-call shifts

A small product team shares on-call duties among few people and uses clear escalation paths.

Externally supported on-call via pager service

A team uses a pager/incident service for alerting, complemented by automated playbooks.

1

Define goals, SLAs and compensation rules for on-call.

2

Introduce a rotating roster and clear escalation paths.

3

Create and test runbooks; introduce and monitor metrics.

⚠️ Technical debt & bottlenecks

  • Missing automation for frequent recovery steps.
  • Outdated or untested runbooks and playbooks.
  • Insufficient integration between monitoring and pager systems.
Knowledge bottleneck (key-person risk)Alert noise from insufficient filteringLack of automation for routine responses
  • Continuous developer availability as a substitute for structured on-call processes.
  • Escalating directly to management instead of technical experts.
  • On-call without access to logs/runbooks and without action instructions.
  • Insufficient handover between shifts leads to information loss.
  • Lack of measurement prevents improvement of on-call quality.
  • Too tight rotation without recovery time increases error risk.
Basic troubleshooting and log analysisKnowledge of system architecture and dependenciesCommunication and incident management skills
Availability of critical systems around the clockFast traceability of incidents through telemetryClear escalation and communication paths
  • Working time regulations and compensation obligations
  • Availability of qualified personnel for rotation
  • Integration into existing alerting and monitoring systems