On-Call
Organized team duty to respond to incidents and operational disruptions outside regular hours. Purpose is rapid recovery, minimizing downtime, and providing clear escalation paths.
Classification
- ComplexityMedium
- Impact areaOrganizational
- Decision typeOrganizational
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Burnout from sustained high on-call load.
- Unclear escalation paths lead to delayed responses.
- Insufficient documentation worsens resolution times.
- Schedule on-call so load is distributed fairly.
- Automate routine responses and reduce manual steps.
- Hold regular post-incident reviews with concrete improvements.
I/O & resources
- Monitoring and alerting configuration
- On-call roster and escalation matrix
- Runbooks, playbooks and contact details
- Detected and handled incidents with timeline
- Post-incident reports and action items
- Improved runbooks and reduced alarm flood
Description
On-call describes the structured rotation of responsibilities for operations teams to quickly detect, escalate, and remediate incidents. It includes scheduling, alerting, runbooks and post-incident reviews for continuous improvement. Well-designed on-call processes reduce downtime and distribute operational knowledge across the organization.
✔Benefits
- Faster recovery and reduced downtime.
- Knowledge is distributed across the team and not tied to individuals.
- Better measurability of operational effects and improvement cycles.
✖Limitations
- Increased stress for involved people, especially with poor planning.
- Effort for scheduling and administration can be significant.
- Without good automation, high noise arises from irrelevant alerts.
Trade-offs
Metrics
- Mean Time to Acknowledge (MTTA)
Average time until an alert is acknowledged by on-call.
- Mean Time to Resolve (MTTR)
Average time until normal operation is restored after an incident.
- Alert noise ratio
Share of irrelevant or false-positive alerts relative to total alerts.
Examples & implementations
Established SRE on-call rotation
An SRE team runs rotating on-call, combining alert prioritization and runbooks for remediation.
Small team with shared on-call shifts
A small product team shares on-call duties among few people and uses clear escalation paths.
Externally supported on-call via pager service
A team uses a pager/incident service for alerting, complemented by automated playbooks.
Implementation steps
Define goals, SLAs and compensation rules for on-call.
Introduce a rotating roster and clear escalation paths.
Create and test runbooks; introduce and monitor metrics.
⚠️ Technical debt & bottlenecks
Technical debt
- Missing automation for frequent recovery steps.
- Outdated or untested runbooks and playbooks.
- Insufficient integration between monitoring and pager systems.
Known bottlenecks
Misuse examples
- Continuous developer availability as a substitute for structured on-call processes.
- Escalating directly to management instead of technical experts.
- On-call without access to logs/runbooks and without action instructions.
Typical traps
- Insufficient handover between shifts leads to information loss.
- Lack of measurement prevents improvement of on-call quality.
- Too tight rotation without recovery time increases error risk.
Required skills
Architectural drivers
Constraints
- • Working time regulations and compensation obligations
- • Availability of qualified personnel for rotation
- • Integration into existing alerting and monitoring systems