Error Budget Policy
A policy that defines a service's tolerable error budget and the organizational actions triggered when that budget is exceeded.
Classification
- ComplexityMedium
- Impact areaOrganizational
- Decision typeOrganizational
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Incorrect or poorly defined SLIs distort decisions
- Excessive bureaucracy from overly rigid policies
- Teams may circumvent policies through inadequate measurement
- Choose small, easily measurable SLOs
- Use automated alerts based on burn rate
- Conduct regular reviews and policy adjustments
I/O & resources
- Definition of SLOs, SLIs and time windows
- Reliable monitoring data and dashboards
- Governance and escalation policies
- Decisions on releases, rollbacks or throttling
- Prioritized stability measures
- Transparent reports for stakeholders
Description
An Error Budget Policy specifies how much unreliability a service may tolerate over a defined period and which organizational actions trigger on breach. It ties SLOs to release, prioritization, and incident-response rules. The policy makes risk measurable and embeds accountability into operational governance.
✔Benefits
- Balance between reliability and velocity
- Clearer decision basis for releases
- Promotes data-driven governance and accountability
✖Limitations
- Dependent on quality monitoring and reliable SLIs
- May lead to conservative decisions in short time windows
- Requires disciplined maintenance of SLO definitions
Trade-offs
Metrics
- SLO compliance rate
Portion of time the SLO was met; central success control.
- Error burn rate
Speed at which the error budget is consumed; early warning indicator.
- MTTR (Mean Time To Repair)
Average recovery time after an incident; measures responsiveness.
Examples & implementations
Google SRE example
Google uses error budgets to balance reliability and feature velocity across services.
Small product team
A startup defines simple SLOs and blocks releases when budgets are exceeded during peak periods.
E-commerce platform
Team prioritizes bug fixes over new features when the error burn rate reaches a critical level.
Implementation steps
Define SLOs and SLIs for critical user journeys
Build monitoring pipelines and dashboards
Set policy rules for burn-rate thresholds and actions
Configure integrations to CI/CD and incident management
⚠️ Technical debt & bottlenecks
Technical debt
- Incomplete instrumentation leads to uncertain SLI values
- Manual reports instead of automated dashboards
- Outdated runbooks and escalation paths
Known bottlenecks
Misuse examples
- Setting SLOs too high to mask disruptions
- Ignoring error budget while forcing releases
- Deriving SLIs from non-representative test data
Typical traps
- Confusing SLA (contractual) with SLO (internal)
- Choosing too many or too complex SLIs
- No automated measurements in place
Required skills
Architectural drivers
Constraints
- • Technical ability to measure relevant SLIs
- • Organizational agreement on SLO targets
- • Legal or compliance requirements for availability