Operations
Overview of activities and practices for maintaining, monitoring, and evolving IT services and infrastructure.
Classification
- ComplexityMedium
- Impact areaOrganizational
- Decision typeOrganizational
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Over‑automation can create complexity and opacity
- Missing SLOs lead to inconsistent priorities
- Insufficient documentation increases MTTD/MTTR
- Small, tested releases (canary/blue‑green)
- Blameless postmortems with clear action items
- SLO‑driven prioritization of work
I/O & resources
- Monitoring data and telemetry
- Service level objectives (SLOs) and SLAs
- Automated CI/CD pipelines
- Operational documentation and runbooks
- Monitoring dashboards and alerts
- Postmortems and improvement actions
Description
Operations encompasses the organizational, technical, and procedural activities to maintain, monitor, and evolve IT services. It covers incident response, release management, capacity planning, and infrastructure automation to ensure availability and stability. Operations emphasizes automation, measurable service levels, and continuous improvement across development and operations boundaries.
✔Benefits
- Higher availability and stability of services
- Faster incident response and reduced downtime
- Better predictability through capacity and cost control
✖Limitations
- Requires organizational alignment and responsibilities
- Initial effort for automation and observability
- Not all legacy systems are easy to automate
Trade-offs
Metrics
- MTTR
Mean time to restore service after failures.
- Availability (uptime)
Percentage of time a service is available.
- SLO attainment rate
Proportion of time defined SLOs are met.
Examples & implementations
SRE approach at a payment provider
Establishing SLOs, error budget policies and on‑call rotations to improve availability.
Automated rollout on Kubernetes
CI/CD pipeline with canary deployments, automated health checks and rollbacks.
Incident postmortem in a SaaS startup
Establish a blameless postmortem culture and derive preventive actions.
Implementation steps
Introduce basic monitoring and telemetry
Define runbooks, SLAs/SLOs and on‑call processes
Gradually automate critical operations
⚠️ Technical debt & bottlenecks
Technical debt
- Non‑automated deployments
- Missing structured logs and traces
- Outdated operational documentation
Known bottlenecks
Misuse examples
- Introducing automation without monitoring
- Setting SLOs but not measuring them
- On‑call roles without sufficient training
Typical traps
- Focusing only on tools instead of processes and culture
- Excessive optimization pressure without error budgets
- Ignoring costs in scaling decisions
Required skills
Architectural drivers
Constraints
- • Regulatory requirements and compliance
- • Budget and staffing limits
- • Technical legacy constraints