concept#Reliability#Observability#DevOps#Governance

On-Call

Organized team duty to respond to incidents and operational disruptions outside regular hours. Purpose is rapid recovery, minimizing downtime, and providing clear escalation paths.

On-call describes the structured rotation of responsibilities for operations teams to quickly detect, escalate, and remediate incidents.

Maturity

Established

Cognitive loadMedium

Classification

ComplexityMedium
Impact areaOrganizational
Decision typeOrganizational
Organizational maturityIntermediate

Technical context

Integrations

Incident management and pager systems (e.g. PagerDuty)Monitoring and observability platformsChatOps tools for escalation and communication

Principles & goals

Principles

Clear responsibilities and rotation to avoid single-person risks.Automated alert filtering and prioritization reduce noise.Documented runbooks and post-incident reviews support learning.

Value stream stage

Run

Organizational level

Team, Domain

Use cases & scenarios

Use cases

Scenarios

Compromises

Risks

Burnout from sustained high on-call load.
Unclear escalation paths lead to delayed responses.
Insufficient documentation worsens resolution times.

Best practices

Schedule on-call so load is distributed fairly.
Automate routine responses and reduce manual steps.
Hold regular post-incident reviews with concrete improvements.

I/O & resources

Inputs

Monitoring and alerting configuration
On-call roster and escalation matrix
Runbooks, playbooks and contact details

Outputs

Detected and handled incidents with timeline
Post-incident reports and action items
Improved runbooks and reduced alarm flood

Resources

Description

On-call describes the structured rotation of responsibilities for operations teams to quickly detect, escalate, and remediate incidents. It includes scheduling, alerting, runbooks and post-incident reviews for continuous improvement. Well-designed on-call processes reduce downtime and distribute operational knowledge across the organization.

✔Benefits

Faster recovery and reduced downtime.
Knowledge is distributed across the team and not tied to individuals.
Better measurability of operational effects and improvement cycles.

✖Limitations

Increased stress for involved people, especially with poor planning.
Effort for scheduling and administration can be significant.
Without good automation, high noise arises from irrelevant alerts.

Trade-offs

Metrics

Mean Time to Acknowledge (MTTA)
Average time until an alert is acknowledged by on-call.
Mean Time to Resolve (MTTR)
Average time until normal operation is restored after an incident.
Alert noise ratio
Share of irrelevant or false-positive alerts relative to total alerts.

Examples & implementations

Established SRE on-call rotation

An SRE team runs rotating on-call, combining alert prioritization and runbooks for remediation.

Small team with shared on-call shifts

A small product team shares on-call duties among few people and uses clear escalation paths.

Externally supported on-call via pager service

A team uses a pager/incident service for alerting, complemented by automated playbooks.

Implementation steps

Define goals, SLAs and compensation rules for on-call.

Introduce a rotating roster and clear escalation paths.

Create and test runbooks; introduce and monitor metrics.

⚠️ Technical debt & bottlenecks

Technical debt

Missing automation for frequent recovery steps.
Outdated or untested runbooks and playbooks.
Insufficient integration between monitoring and pager systems.

Known bottlenecks

Knowledge bottleneck (key-person risk)Alert noise from insufficient filteringLack of automation for routine responses

Misuse examples

Continuous developer availability as a substitute for structured on-call processes.
Escalating directly to management instead of technical experts.
On-call without access to logs/runbooks and without action instructions.

Typical traps

Insufficient handover between shifts leads to information loss.
Lack of measurement prevents improvement of on-call quality.
Too tight rotation without recovery time increases error risk.

Required skills

Basic troubleshooting and log analysisKnowledge of system architecture and dependenciesCommunication and incident management skills

Architectural drivers

Availability of critical systems around the clockFast traceability of incidents through telemetryClear escalation and communication paths

Constraints

• Working time regulations and compensation obligations
• Availability of qualified personnel for rotation
• Integration into existing alerting and monitoring systems