Catalog
concept#Delivery#Governance#Observability#Reliability

Operating Processes

Operating processes describe recurring workflows, roles, and responsibilities for operating products and systems.

Operating processes provide stable, repeatable workflows for operating systems, services, and products.
Established
Medium

Classification

  • Medium
  • Organizational
  • Organizational
  • Intermediate

Technical context

Monitoring tools (e.g. Prometheus, Datadog)Incident management platforms (e.g. PagerDuty)CI/CD pipelines and orchestration tools

Principles & goals

Define clear roles and responsibilities.Favor standardization and aim for automation.Use metrics and feedback loops for continuous improvement.
Run
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Processes become bureaucratic and slow responses.
  • Unclear ownership leads to delays.
  • Metric fixation can lead to wrong optimizations.
  • Keep runbooks concise, current, and versioned.
  • Foster blameless postmortems.
  • Automate repetitive tasks and measure effects.

I/O & resources

  • Runbooks and operations documentation
  • Monitoring and logging data
  • Escalation and communication plans
  • Stable operating procedures and validated runbooks
  • Incident reports and postmortems
  • Metrics, dashboards, and improvement plans

Description

Operating processes provide stable, repeatable workflows for operating systems, services, and products. They define roles, responsibilities, escalation paths, and metrics for monitoring. They include procedures for deployments, monitoring, incident response, and change coordination aligned with business goals.

  • Higher operational consistency and predictability.
  • Faster incident response due to clear procedures.
  • Better alignment between operations and business goals.

  • Over-generalization can neglect local requirements.
  • Adoption requires time and cultural adjustment.
  • Excessive standardization reduces flexibility for innovation.

  • Mean Time To Repair (MTTR)

    Time from detection to restoration of a service; measures response and recovery capability.

  • Change failure rate

    Share of deployments that cause failures or rollbacks; indicates release stability.

  • Availability/Uptime

    Percentage of time a service is available; relates to SLAs and SLOs.

Established runbooks at SaaS providers

SaaS companies use standardized runbooks for incident response and maintenance windows.

ITIL process map in mid-sized companies

Mid-sized companies adopt ITIL elements for change and incident management to harmonize processes.

SRE implementation for service stability

Teams adapt SRE principles for SLIs, SLOs, and error budgets to govern operating processes.

1

Take inventory of existing processes and tools.

2

Define roles, responsibilities, and escalation paths.

3

Create and validate runbooks for critical paths.

4

Automate repeatable steps and integrate into CI/CD.

5

Introduce metrics, dashboards, and regular reviews.

⚠️ Technical debt & bottlenecks

  • Outdated runbooks requiring manual interventions.
  • Non-automated deployments as a recurring bottleneck.
  • Lack of observability in critical service paths.
Manual stepsUnclear escalation pathsResource constraints
  • Processes implemented only to satisfy audits, not to improve efficiency.
  • Runbooks that are outdated and provide incorrect guidance.
  • Overly rigid change gates that block rapid security fixes.
  • Allowing too many exceptions and diluting processes.
  • Measuring metrics without clear actionability.
  • Introducing governance without operational support.
Operations and monitoring skillsIncident management and communicationBasic automation and scripting knowledge
Availability and fault toleranceAutomatability and repeatabilityTransparency through metrics and monitoring
  • Limited personnel resources for 24/7 operations.
  • Regulatory requirements for processes and audits.
  • Technical dependencies between services.