Catalog
concept#Reliability#Governance#DevOps#Observability

Error Budget Policy

A policy that defines a service's tolerable error budget and the organizational actions triggered when that budget is exceeded.

An Error Budget Policy specifies how much unreliability a service may tolerate over a defined period and which organizational actions trigger on breach.
Established
Medium

Classification

  • Medium
  • Organizational
  • Organizational
  • Intermediate

Technical context

Monitoring systems (e.g. Prometheus, Grafana)Incident management (e.g. PagerDuty, Opsgenie)CI/CD pipelines for release gating

Principles & goals

SLOs are the foundation for every error budget decision.Clear escalation and approval rules link engineering and governance.Measurability and transparency enable trust and actionability.
Run
Domain, Team

Use cases & scenarios

Compromises

  • Incorrect or poorly defined SLIs distort decisions
  • Excessive bureaucracy from overly rigid policies
  • Teams may circumvent policies through inadequate measurement
  • Choose small, easily measurable SLOs
  • Use automated alerts based on burn rate
  • Conduct regular reviews and policy adjustments

I/O & resources

  • Definition of SLOs, SLIs and time windows
  • Reliable monitoring data and dashboards
  • Governance and escalation policies
  • Decisions on releases, rollbacks or throttling
  • Prioritized stability measures
  • Transparent reports for stakeholders

Description

An Error Budget Policy specifies how much unreliability a service may tolerate over a defined period and which organizational actions trigger on breach. It ties SLOs to release, prioritization, and incident-response rules. The policy makes risk measurable and embeds accountability into operational governance.

  • Balance between reliability and velocity
  • Clearer decision basis for releases
  • Promotes data-driven governance and accountability

  • Dependent on quality monitoring and reliable SLIs
  • May lead to conservative decisions in short time windows
  • Requires disciplined maintenance of SLO definitions

  • SLO compliance rate

    Portion of time the SLO was met; central success control.

  • Error burn rate

    Speed at which the error budget is consumed; early warning indicator.

  • MTTR (Mean Time To Repair)

    Average recovery time after an incident; measures responsiveness.

Google SRE example

Google uses error budgets to balance reliability and feature velocity across services.

Small product team

A startup defines simple SLOs and blocks releases when budgets are exceeded during peak periods.

E-commerce platform

Team prioritizes bug fixes over new features when the error burn rate reaches a critical level.

1

Define SLOs and SLIs for critical user journeys

2

Build monitoring pipelines and dashboards

3

Set policy rules for burn-rate thresholds and actions

4

Configure integrations to CI/CD and incident management

⚠️ Technical debt & bottlenecks

  • Incomplete instrumentation leads to uncertain SLI values
  • Manual reports instead of automated dashboards
  • Outdated runbooks and escalation paths
Insufficient SLI qualityMissing dashboard or reporting pipelinesUnclear escalation processes
  • Setting SLOs too high to mask disruptions
  • Ignoring error budget while forcing releases
  • Deriving SLIs from non-representative test data
  • Confusing SLA (contractual) with SLO (internal)
  • Choosing too many or too complex SLIs
  • No automated measurements in place
Knowledge of SRE principles and SLO designMonitoring and observability skillsOrganizational communication and governance experience
Availability of monitoring and observability toolsClear service ownership and responsibilitiesAutomatability of release and gating decisions
  • Technical ability to measure relevant SLIs
  • Organizational agreement on SLO targets
  • Legal or compliance requirements for availability