Catalog
concept#Cloud#Architecture#Platform#Reliability

Cloud Design Pattern

Reusable architectural patterns for building scalable, resilient cloud systems.

Cloud design patterns are reusable architectural solutions for common challenges when building scalable, resilient, and maintainable cloud-native systems.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

Kubernetes / container orchestrationManaged cloud services (storage, messaging, DB)Observability toolchain (Prometheus, Grafana, Jaeger)

Principles & goals

Decouple components to limit failure propagationDesign for failure and recovery rather than perfect availabilityLimit blast radius using isolation and quotas
Build
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Misconfiguration (e.g., too tight thresholds) reduces effectiveness
  • Vendor lock-in due to platform-specific implementations
  • Insufficient monitoring prevents early detection of side effects
  • Start with simple, well-understood patterns and iterate
  • Validate parameters and thresholds empirically
  • Document boundaries, assumptions and operational needs

I/O & resources

  • Non-functional requirements (SLA, RTO/RPO)
  • Architecture and operational metrics
  • Platform capabilities and constraints
  • Recommended pattern set and design decisions
  • Configuration and operational guidelines
  • Monitoring and test requirements

Description

Cloud design patterns are reusable architectural solutions for common challenges when building scalable, resilient, and maintainable cloud-native systems. They describe proven structures and practices—such as circuit breakers, bulkheads, and autoscaling—to manage failure, latency, state, and tenancy across cloud platforms. They serve as a decision framework and reference for architects and engineering teams on technology, operations, and organizational concerns.

  • Faster architectural decisions based on proven solutions
  • Increased resilience and better handling of partial failures
  • Improved scalability through repeatable patterns

  • Patterns are not full implementation instructions
  • Not all patterns fit every platform or use case
  • Excessive use can increase complexity and cost

  • Availability (SLA)

    Percentage uptime of the service function over the observation period.

  • Failure propagation rate

    Share of failures that propagate across system boundaries.

  • Response time p95

    95th percentile of end-user request latency.

Auto-scaling an e-commerce platform

Use of load-based auto-scaling combined with circuit breakers to stabilize checkout processes during traffic spikes.

Bulkheads in payment processing

Segmentation of resources for payment services to isolate failures from other subsystems.

CQRS for high write and read demands

Separation of read and write paths to optimize performance and scalability in a cloud environment.

1

Assess requirements and select relevant patterns

2

Create proof-of-concept for critical patterns

3

Integrate with platform tools and automation

4

Observability, testing and phased rollout

⚠️ Technical debt & bottlenecks

  • Temporary workarounds instead of stable isolation create long-term complexity
  • Incomplete implementation of retry and backoff strategies
  • Missing test and chaos engineering to validate patterns
Network bandwidthDatabase throughputConfiguration complexity
  • Circuit breaker with too-short reset times causes constant flapping
  • Bulkheads at wrong granularity result in resource waste
  • Auto-scaling without cost controls causes unexpected high cloud bills
  • Ignoring observability requirements before rollout
  • Unclear responsibilities for pattern-related operations
  • Overreliance on platform features without fallback strategies
Cloud architecture and infrastructure knowledgeExperience with distributed systems and fault handlingMonitoring, SLI/SLO management and performance analysis
Scalability under variable loadAvailability and fault toleranceOperational observability and diagnostics
  • Budget and cost constraints for cloud resources
  • Compliance and data protection requirements
  • Platform dependencies (managed services)