Autoscaling in Kubernetes is the process of automatically adjusting your resources to match the current demand. Instead of guessing how many servers or how much memory you need, Kubernetes monitors your traffic and “flexes” the infrastructure in real-time.
There are three main “layers” of autoscaling. Think of them as a chain: if one layer can’t handle the load, the next one kicks in.
1. Horizontal Pod Autoscaler (HPA)
The Concept: Adding more “lanes” to the highway.
HPA is the most common form of scaling. It increases or decreases the number of pod replicas based on metrics like CPU usage, memory, or custom traffic data.
- How it works: It checks your pods every 15 seconds. If the average CPU across all pods is above your target (e.g., 70%), it tells the Deployment to spin up more pods.
- Best for: Stateless services like web APIs or microservices that can handle traffic by simply having more copies running.
2. Vertical Pod Autoscaler (VPA)
The Concept: Making the “cars” bigger.
VPA doesn’t add more pods; instead, it looks at a single pod and decides if it needs more CPU or Memory. It “right-sizes” your containers.
- How it works: It observes your app’s actual usage over time. If a pod is constantly hitting its memory limit, VPA will recommend (or automatically apply) a higher limit.
- The Catch: Currently, in most versions of Kubernetes, changing a pod’s size requires restarting the pod.
- Best for: Stateful apps (like databases) that can’t easily be “split” into multiple copies, or apps where you aren’t sure what the resource limits should be.
3. Cluster Autoscaler (CA)
The Concept: Adding more “pavement” to the highway.
HPA and VPA scale Pods, but eventually, you will run out of physical space on your worker nodes (VMs). This is where the Cluster Autoscaler comes in.
- How it works: It watches for “Pending” pods—pods that want to run but can’t because no node has enough free CPU/RAM. When it sees this, it calls your cloud provider (AWS, Azure, GCP) and asks for a new VM to be added to the cluster.
- Downscaling: It also watches for underutilized nodes. If a node is mostly empty, it will move those pods elsewhere and delete the node to save money.
The “Scaling Chain” in Action
Imagine a sudden surge of users hits your website:
- HPA sees high CPU usage and creates 10 new Pods.
- The cluster is full, so those 10 Pods stay in Pending status.
- Cluster Autoscaler sees the Pending pods and provisions 2 new Worker Nodes.
- The Pods finally land on the new nodes, and your website stays online.
Comparison Summary
| Feature | HPA | VPA | Cluster Autoscaler |
| What it scales | Number of Pods | Size of Pods (CPU/RAM) | Number of Nodes (VMs) |
| Primary Goal | Handle traffic spikes | Optimize resource efficiency | Provide hardware capacity |
| Impact | Fast, no downtime | Usually requires pod restart | Slower (minutes to boot VM) |
Pro-Tip: Never run HPA and VPA on the same metric (like CPU) for the same app. They will “fight” each other—HPA will try to add pods while VPA tries to make them bigger, leading to a “flapping” state where your app is constantly restarting.
To set up a Horizontal Pod Autoscaler (HPA), you need two things: a Deployment (your app) and an HPA resource that watches it.
Here is a breakdown of how to configure this in a way that actually works.
1. The Deployment
First, your pods must have resources.requests defined. If the HPA doesn’t know how much CPU a pod should use, it can’t calculate the percentage.
YAML
apiVersion: apps/v1kind: Deploymentmetadata: name: php-apachespec: selector: matchLabels: run: php-apache replicas: 1 template: metadata: labels: run: php-apache spec: containers: - name: php-apache image: registry.k8s.io/hpa-example ports: - containerPort: 80 resources: limits: cpu: 500m requests: cpu: 200m # HPA uses this as the baseline
2. The HPA Resource
This YAML tells Kubernetes: “Keep the average CPU usage of these pods at 50%. If it goes higher, spin up more pods (up to 10). If it goes lower, scale back down to 1.”
YAML
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: php-apache-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: php-apache minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
3. How to Apply and Test
You can apply these using oc apply -f <filename>.yaml (in OpenShift) or kubectl apply.
Once applied, you can watch the autoscaler in real-time:
- View status:
oc get hpa - Watch it live:
oc get hpa php-apache-hpa --watch
The Calculation Logic:
The HPA uses a specific formula to decide how many replicas to run:
$$\text{Desired Replicas} = \lceil \text{Current Replicas} \times \frac{\text{Current Metric Value}}{\text{Desired Metric Value}} \rceil$$
Quick Tip: If you are using OpenShift, you can also do this instantly via the CLI without a YAML file:
oc autoscale deployment/php-apache --cpu-percent=50 --min=1 --max=10
To make your autoscaling more robust, you can combine CPU and Memory metrics in a single HPA. Kubernetes will look at both and scale based on whichever one hits the limit first.
Here is the updated YAML including both resource types and a “Scale Down” stabilization period to prevent your cluster from “flapping” (rapidly adding and removing pods).
1. Advanced HPA YAML (CPU + Memory)
YAML
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: advanced-app-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: advanced-app minReplicas: 2 maxReplicas: 15 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleDown: stabilizationWindowSeconds: 300 # Wait 5 mins before scaling down to ensure traffic is actually gone policies: - type: Percent value: 10 periodSeconds: 60
2. Scaling on Custom Metrics (e.g., HTTP Requests)
Sometimes CPU doesn’t tell the whole story. If your app is waiting on a database, CPU might stay low while users experience lag. In these cases, you can scale based on Requests Per Second (RPS).
To use this, you must have the Prometheus Adapter installed (which comes standard in OpenShift’s monitoring stack).
YAML
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 500 # Scale up if pods average more than 500 requests/sec
Pro-Tips for Memory Scaling
- Memory is “Sticky”: Unlike CPU, which drops the moment a process finishes, many runtimes (like Java/JVM or Node.js) do not immediately release memory back to the OS.
- The Danger: If your app doesn’t have a good Garbage Collector configuration, the HPA might see high memory usage, spin up 10 pods, and never scale back down because the memory stays “reserved” by the app.
- The Fix: Always ensure your
memory.requestsin the Deployment are set to what the app actually needs to start, not its peak limit.
Summary Table: Which metric to use?
| Scenario | Recommended Metric | Why? |
| Calculation heavy | CPU | Directly maps to processing power. |
| Caching/Large Data | Memory | Prevents OOM (Out of Memory) kills. |
| Web APIs | Requests Per Second | Scaled based on actual user load. |
| Message Queue | Queue Depth | Scales based on “work to be done.” |
When an HPA isn’t behaving as expected—maybe it’s not scaling up during a spike, or it’s “stuck” at the minimum replicas—you need to look at the Controller Manager’s internal logic.
Here is how you can perform a “health check” on your HPA’s decision-making process.
1. The “Describe” Command (Most Useful)
The describe command provides a chronological log of every scaling action and, more importantly, why a request failed.
Bash
oc describe hpa advanced-app-hpa
What to look for in the “Events” section:
- SuccessfulRescale: The HPA successfully changed the replica count.
- FailedComputeMetricsReplicas: Usually means the HPA can’t talk to the Metrics Server (check if your pods have
resources.requestsdefined!). - FailedGetResourceMetric: The pods might be crashing or “Unready,” so the HPA can’t pull their CPU/Memory usage.
2. Checking the “Conditions”
In the output of the describe command, look for the Conditions section. It tells you the current “brain state” of the autoscaler:
| Condition | Status | Meaning |
| AbleToScale | True | The HPA is healthy and can talk to the Deployment. |
| ScalingActive | True | Metrics are being received and scaling logic is running. |
| ScalingLimited | True | Warning: You’ve hit your maxReplicas or minReplicas. It wants to scale further but you’ve capped it. |
3. Real-time Metric Monitoring
If you want to see exactly what numbers the HPA is seeing right now compared to your target, use:
Bash
oc get hpa advanced-app-hpa -w
Example Output:
Plaintext
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGEadvanced-app-hpa Deployment/advanced-app 75%/60%, 40%/80% 2 15 5 10m
In this example, CPU is at 75% (above the 60% target), so it has already scaled to 5 replicas.
4. Debugging Common “Stuck” Scenarios
Scenario A: “Target: “
If the TARGETS column shows <unknown>, it almost always means:
- Missing Requests: You forgot to set
resources.requestsin your Deployment YAML. - Metrics Server Down: The cluster-wide metrics service is having issues.
- Labels Mismatch: The HPA selector doesn’t match the Deployment labels.
Scenario B: High CPU but No Scaling
Check if the pods are in a Ready state. HPA ignores “Unready” pods to prevent scaling up based on the high CPU usage often seen during a container’s startup/boot phase.
Pro-Tip: The “Cooldown” Period
If you just stopped a load test and the pods are still running, don’t panic! By default, Kubernetes has a 5-minute stabilization window for scaling down. This prevents the “Flapping” effect where pods are deleted and then immediately recreated because of a small traffic blip.