Observability & Monitoring
Observability and monitoring are crucial for understanding and managing complex systems.
- Knowledge domains
- /Thematic areas
- /Segments
- /Building blocks
Alerting
A process for monitoring and notifying critical events.
Incident Management
A systematic approach to identifying and resolving incidents in IT environments.
On-Call
Organized team duty to respond to incidents and operational disruptions outside regular hours. Purpose is rapid recovery, minimizing downtime, and providing clear escalation paths.
Error Budget Policy
A policy that defines a service's tolerable error budget and the organizational actions triggered when that budget is exceeded.
Observability Practice
A conceptual guide for systematically capturing, correlating and analysing telemetry (metrics, traces, logs) to enable fast debugging and performance optimisation.
Service Level Objective (SLO)
A Service Level Objective (SLO) defines specific performance expectations for a service.
Instrumentation
Strategic collection of telemetry from software and infrastructure to make behavior, performance and operational state measurable.
Telemetry Collection
Concept for systematically collecting and forwarding metrics, logs and traces to support observability and operations.
OpenTelemetry
Open standard and toolkit for instrumenting and collecting traces, metrics and logs via SDKs, collectors and exporters.
Distributed Tracing
Technique for tracking and correlating requests across services to make performance issues and root causes in distributed systems visible.
Logs
Time-ordered records of events and state changes used for debugging, monitoring, and forensic analysis.
Metrics
Metrics help measure and analyze the performance and efficiency of processes.
Dependency Mapping
Systematic capture and visualization of dependencies between components, services and teams to support architecture and decision-making processes.
Distributed Tracing
Technique for tracking and correlating requests across services to make performance issues and root causes in distributed systems visible.
Service Map
Visual representation of services and their runtime dependencies to analyze communication, impact and failure sources.
Data Visualization
Data visualization is the graphical representation of data to make patterns, trends, and insights visible.
Observability Dashboard
Central dashboard for visualizing and analyzing telemetry (metrics, logs, traces) to enable rapid incident diagnosis and performance monitoring.
Grafana
Grafana is an open-source tool for visualizing and analyzing data.