Redundancy
Strategy to increase availability and fault tolerance by provisioning additional components, replication, and failover.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Split-brain situations with insufficient coordination
- Cost escalation from uncontrolled overprovisioning
- Untested failover paths leading to outages
- Automated testing of failover scenarios
- Documented recovery runbooks and responsibilities
- Measurable SLAs and continuous monitoring
I/O & resources
- Availability requirements and RTO/RPO
- Inventory of critical components and dependencies
- Budget and operational constraints
- Redundancy architecture design with fallback paths
- Test and monitoring plans for failover scenarios
- Metrics and SLAs for availability measurement
Description
Redundancy is the deliberate provisioning of additional components or capacity to tolerate failures and increase availability. It includes active and passive replication, geographic distribution, and failover strategies; granularity and placement affect cost, consistency, and recovery time. Planning, monitoring, and regular testing are essential to ensure effective resilient systems.
✔Benefits
- Increased availability and reduced downtime
- Improved fault tolerance and business continuity
- Predictable recovery times through deterministic fallbacks
✖Limitations
- Increased cost due to additional hardware/instances
- Complexity around consistency and synchronization
- Misconfiguration can create a false sense of security
Trade-offs
Metrics
- Availability (uptime)
Percentage of time the system is operational.
- Mean Time To Recover (MTTR)
Average time to recover after a failure.
- Failover success rate
Share of successful automatic or manual failover operations.
Examples & implementations
Database replica cluster
Primary/secondary replication to minimize downtime and enable quick recovery.
Load-balanced microservice farm
Multiple stateless service instances behind a load balancer for horizontal scaling and redundancy.
Geo-redundant storage archives
Data replicated across regions to prevent loss during regional outages.
Implementation steps
Analyze requirements and identify critical paths
Design redundant topologies and failover strategies
Implement replication, load balancing, and health checks
Regular testing, monitoring setup, and documentation
⚠️ Technical debt & bottlenecks
Technical debt
- Untested or manual failover mechanisms
- Legacy replication solutions with poor observability
- Unclear ownership for backup and recovery processes
Known bottlenecks
Misuse examples
- Replicating sensitive data without privacy checks
- Using redundant hardware without monitoring
- Multiple failover layers without clear ownership process
Typical traps
- Unconsidered latency in geo-replicated setups
- Complex synchronization logic introduces failure sources
- Missing tests for rare failure cases
Required skills
Architectural drivers
Constraints
- • Budget restrictions for additional resources
- • Network latency between replication sites
- • Regulatory requirements on data location