Incident Management & Resilience
The topic of Incident Management and Resilience deals with the identification, management, and recovery of systems after incidents. It encompasses strategies to minimize risks and ensure operational continuity in crisis situations.
- Knowledge domains
- /Thematic areas
- /Segments
- /Building blocks
Alerting
A process for monitoring and notifying critical events.
Incident Detection
Concept for systematically detecting operational outages, performance deviations, and security incidents based on observability signals and defined alerting criteria.
On-Call
Organized team duty to respond to incidents and operational disruptions outside regular hours. Purpose is rapid recovery, minimizing downtime, and providing clear escalation paths.
Incident Classification
Systematic rules to categorize and prioritize operational incidents to drive escalation and resource allocation.
Service Impact
Analysis and assessment of how incidents or performance issues affect a service's functionality and availability.
Severity Levels
Categorizes impact and urgency of incidents to drive prioritization, escalation and response times in operations.
Continuous Improvement
An ongoing, systematic approach to identify and implement improvements in products, processes and organizations. Focuses on iterative cycles, data-informed decisions and team-led actions.
Postmortem
A formal review after an incident to determine root causes, document findings, and derive improvements.
Root Cause Analysis (RCA)
A structured approach to identify the root causes of problems.
Business Continuity Management (BCM)
BCM is a strategic approach that ensures continuity of critical business processes during disruptions. It combines risk assessment, contingency planning and recovery with governance and testing.
Disaster Recovery
Strategies, processes and technical measures to restore IT systems and data after major outages or disasters.
Recovery Point Objective (RPO)
RPO defines the maximum tolerable amount of data loss measured in time and serves as a target for backup and replication strategies.
Recovery Time Objective (RTO)
RTO defines the maximum tolerable time within which an IT service must be restored after an outage to limit business impact.
Graceful Degradation
An architectural principle that preserves core functionality under partial failure by sacrificing less critical features.
Redundancy
Strategy to increase availability and fault tolerance by provisioning additional components, replication, and failover.
Resilience Engineering
A systems-focused concept for designing and governing robust, adaptive systems to preserve service quality under disruption.
Incident Management Process
Process for structured detection, escalation and resolution of IT incidents with defined roles, communication channels and post-incident reviews.
Incident Command System (ICS)
The Incident Command System (ICS) is a standardized leadership and coordination framework for managing emergencies and complex incidents across agency and organizational boundaries.
Incident Response
Structured process for detecting, analysing and containing security incidents and restoring normal operations.