Catalog
concept#DevOps#Observability#Platform#Reliability

Operations

Overview of activities and practices for maintaining, monitoring, and evolving IT services and infrastructure.

Operations encompasses the organizational, technical, and procedural activities to maintain, monitor, and evolve IT services.
Established
High

Classification

  • Medium
  • Organizational
  • Organizational
  • Intermediate

Technical context

CI/CD systems (e.g. GitHub Actions, GitLab CI)Monitoring tools (e.g. Prometheus, Grafana)Cloud providers and platforms (e.g. Kubernetes)

Principles & goals

Automate repeatable processesDefine measurable service levels (SLOs)Blameless postmortems and continuous improvement
Run
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Over‑automation can create complexity and opacity
  • Missing SLOs lead to inconsistent priorities
  • Insufficient documentation increases MTTD/MTTR
  • Small, tested releases (canary/blue‑green)
  • Blameless postmortems with clear action items
  • SLO‑driven prioritization of work

I/O & resources

  • Monitoring data and telemetry
  • Service level objectives (SLOs) and SLAs
  • Automated CI/CD pipelines
  • Operational documentation and runbooks
  • Monitoring dashboards and alerts
  • Postmortems and improvement actions

Description

Operations encompasses the organizational, technical, and procedural activities to maintain, monitor, and evolve IT services. It covers incident response, release management, capacity planning, and infrastructure automation to ensure availability and stability. Operations emphasizes automation, measurable service levels, and continuous improvement across development and operations boundaries.

  • Higher availability and stability of services
  • Faster incident response and reduced downtime
  • Better predictability through capacity and cost control

  • Requires organizational alignment and responsibilities
  • Initial effort for automation and observability
  • Not all legacy systems are easy to automate

  • MTTR

    Mean time to restore service after failures.

  • Availability (uptime)

    Percentage of time a service is available.

  • SLO attainment rate

    Proportion of time defined SLOs are met.

SRE approach at a payment provider

Establishing SLOs, error budget policies and on‑call rotations to improve availability.

Automated rollout on Kubernetes

CI/CD pipeline with canary deployments, automated health checks and rollbacks.

Incident postmortem in a SaaS startup

Establish a blameless postmortem culture and derive preventive actions.

1

Introduce basic monitoring and telemetry

2

Define runbooks, SLAs/SLOs and on‑call processes

3

Gradually automate critical operations

⚠️ Technical debt & bottlenecks

  • Non‑automated deployments
  • Missing structured logs and traces
  • Outdated operational documentation
Legacy infrastructureInsufficient observabilityMissing automation
  • Introducing automation without monitoring
  • Setting SLOs but not measuring them
  • On‑call roles without sufficient training
  • Focusing only on tools instead of processes and culture
  • Excessive optimization pressure without error budgets
  • Ignoring costs in scaling decisions
Systems and infrastructure knowledgeMonitoring and observability skillsAutomation and scripting
Availability and resilienceFast recoverability (MTTR)Scalability and capacity planning
  • Regulatory requirements and compliance
  • Budget and staffing limits
  • Technical legacy constraints