Autoscaling
Dynamic automatic adjustment of instances and resources to load, improving availability and cost-efficiency.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Oscillation due to aggressive scaling rates and missing cooldowns.
- Cost explosion from faulty rules or load prediction.
- Cascading failures if dependent systems are not scaled in tandem.
- Conservative default bounds and iterative tuning instead of aggressive rules.
- Combine multiple metrics (e.g., CPU + latency) instead of single metric.
- Use cooldowns and hysteresis to avoid oscillations.
I/O & resources
- Metrics (CPU, memory, custom metrics, queue depth)
- Scaling rules and policies (min/max, targets, cooldown)
- Monitoring, alerting and observability data
- Automatically changed capacity (scale up/down)
- Measurable effects on latency and throughput
- Auditable scaling events and logs
Description
Autoscaling automatically adjusts an application's instance count and resource allocation to match real load. It improves availability and cost efficiency by dynamically scaling capacity — in cloud platforms, container orchestrators, or hybrid environments. Policies, metrics and thresholds determine scaling behaviour and safety limits.
✔Benefits
- Automatic adjustment reduces manual intervention and operational effort.
- Cost efficiency through demand-driven resource provisioning.
- Improved availability and responsiveness during load spikes.
✖Limitations
- Scaling often reacts with delay; very short peaks may remain unhandled.
- Incorrect metrics or thresholds can lead to misbehaviour.
- Not all components are easily horizontally scalable (stateful services).
Trade-offs
Metrics
- Replicas/Instances
Number of running instances or pods to measure capacity.
- CPU/Memory utilization
Percentage resource utilization to trigger scaling actions.
- Request rate / latency
Throughput and response time as indicators for needed scaling.
Examples & implementations
Kubernetes Horizontal Pod Autoscaler
Widely used implementation of pod-based autoscaling based on CPU and custom metrics.
AWS Auto Scaling Groups
Cloud provider mechanism to autoscale EC2 instances using policies and target metrics.
Serverless concurrency/provisioned scaling
Examples from functions-as-a-service (e.g., provisioned concurrency) for controlling cold starts and throughput.
Implementation steps
Set up metrics and observability; validate relevant metrics.
Define scaling rules with min/max and cooldown parameters.
Test autoscaling in stages (staging → canary → prod) and monitor.
⚠️ Technical debt & bottlenecks
Technical debt
- Manual tuning of thresholds without automation or tests.
- Missing capacity models for templates, permissions and quotas.
- Incomplete telemetry that obscures reasons for scaling decisions.
Known bottlenecks
Misuse examples
- Autoscaling stateful databases without proper replication strategy.
- Aggressive scaling rules that increase costs uncontrollably.
- Scaling only up, without automatic scale-down when load decreases.
Typical traps
- Ignoring dependencies that remain bottlenecks (DB, network).
- Missing tests for combined load scenarios lead to surprises in prod.
- Blind trust in cloud defaults without adapting to own workloads.
Required skills
Architectural drivers
Constraints
- • Dependencies that cannot be horizontally scaled (e.g., monolithic DB).
- • Provider limits or quotas in the cloud.
- • Network or storage throughput limitations.