Catalog
concept#Architecture#Reliability#Observability#Software Engineering

Redundancy

Strategy to increase availability and fault tolerance by provisioning additional components, replication, and failover.

Redundancy is the deliberate provisioning of additional components or capacity to tolerate failures and increase availability.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

Load balancers and service discovery systemsBackup and replication mechanismsObservability tools (logging, metrics, tracing)

Principles & goals

Favor simplicity: minimal necessary redundancyDefined failure-mode handlingRegular testing and validation
Build
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Split-brain situations with insufficient coordination
  • Cost escalation from uncontrolled overprovisioning
  • Untested failover paths leading to outages
  • Automated testing of failover scenarios
  • Documented recovery runbooks and responsibilities
  • Measurable SLAs and continuous monitoring

I/O & resources

  • Availability requirements and RTO/RPO
  • Inventory of critical components and dependencies
  • Budget and operational constraints
  • Redundancy architecture design with fallback paths
  • Test and monitoring plans for failover scenarios
  • Metrics and SLAs for availability measurement

Description

Redundancy is the deliberate provisioning of additional components or capacity to tolerate failures and increase availability. It includes active and passive replication, geographic distribution, and failover strategies; granularity and placement affect cost, consistency, and recovery time. Planning, monitoring, and regular testing are essential to ensure effective resilient systems.

  • Increased availability and reduced downtime
  • Improved fault tolerance and business continuity
  • Predictable recovery times through deterministic fallbacks

  • Increased cost due to additional hardware/instances
  • Complexity around consistency and synchronization
  • Misconfiguration can create a false sense of security

  • Availability (uptime)

    Percentage of time the system is operational.

  • Mean Time To Recover (MTTR)

    Average time to recover after a failure.

  • Failover success rate

    Share of successful automatic or manual failover operations.

Database replica cluster

Primary/secondary replication to minimize downtime and enable quick recovery.

Load-balanced microservice farm

Multiple stateless service instances behind a load balancer for horizontal scaling and redundancy.

Geo-redundant storage archives

Data replicated across regions to prevent loss during regional outages.

1

Analyze requirements and identify critical paths

2

Design redundant topologies and failover strategies

3

Implement replication, load balancing, and health checks

4

Regular testing, monitoring setup, and documentation

⚠️ Technical debt & bottlenecks

  • Untested or manual failover mechanisms
  • Legacy replication solutions with poor observability
  • Unclear ownership for backup and recovery processes
Single point of failureState synchronizationCapacity planning
  • Replicating sensitive data without privacy checks
  • Using redundant hardware without monitoring
  • Multiple failover layers without clear ownership process
  • Unconsidered latency in geo-replicated setups
  • Complex synchronization logic introduces failure sources
  • Missing tests for rare failure cases
System architecture and availability planningOperational experience with failover and backup processesKnowledge of networking and data replication
AvailabilityFault toleranceBusiness continuity
  • Budget restrictions for additional resources
  • Network latency between replication sites
  • Regulatory requirements on data location