Autoscaling in Kubernetes

Autoscaling in Kubernetes is the process of automatically adjusting your resources to match the current demand. Instead of guessing how many servers or how much memory you need, Kubernetes monitors your traffic and “flexes” the infrastructure in real-time.

There are three main “layers” of autoscaling. Think of them as a chain: if one layer can’t handle the load, the next one kicks in.


1. Horizontal Pod Autoscaler (HPA)

The Concept: Adding more “lanes” to the highway.

HPA is the most common form of scaling. It increases or decreases the number of pod replicas based on metrics like CPU usage, memory, or custom traffic data.

  • How it works: It checks your pods every 15 seconds. If the average CPU across all pods is above your target (e.g., 70%), it tells the Deployment to spin up more pods.
  • Best for: Stateless services like web APIs or microservices that can handle traffic by simply having more copies running.

2. Vertical Pod Autoscaler (VPA)

The Concept: Making the “cars” bigger.

VPA doesn’t add more pods; instead, it looks at a single pod and decides if it needs more CPU or Memory. It “right-sizes” your containers.

  • How it works: It observes your app’s actual usage over time. If a pod is constantly hitting its memory limit, VPA will recommend (or automatically apply) a higher limit.
  • The Catch: Currently, in most versions of Kubernetes, changing a pod’s size requires restarting the pod.
  • Best for: Stateful apps (like databases) that can’t easily be “split” into multiple copies, or apps where you aren’t sure what the resource limits should be.

3. Cluster Autoscaler (CA)

The Concept: Adding more “pavement” to the highway.

HPA and VPA scale Pods, but eventually, you will run out of physical space on your worker nodes (VMs). This is where the Cluster Autoscaler comes in.

  • How it works: It watches for “Pending” pods—pods that want to run but can’t because no node has enough free CPU/RAM. When it sees this, it calls your cloud provider (AWS, Azure, GCP) and asks for a new VM to be added to the cluster.
  • Downscaling: It also watches for underutilized nodes. If a node is mostly empty, it will move those pods elsewhere and delete the node to save money.

The “Scaling Chain” in Action

Imagine a sudden surge of users hits your website:

  1. HPA sees high CPU usage and creates 10 new Pods.
  2. The cluster is full, so those 10 Pods stay in Pending status.
  3. Cluster Autoscaler sees the Pending pods and provisions 2 new Worker Nodes.
  4. The Pods finally land on the new nodes, and your website stays online.

Comparison Summary

FeatureHPAVPACluster Autoscaler
What it scalesNumber of PodsSize of Pods (CPU/RAM)Number of Nodes (VMs)
Primary GoalHandle traffic spikesOptimize resource efficiencyProvide hardware capacity
ImpactFast, no downtimeUsually requires pod restartSlower (minutes to boot VM)

Pro-Tip: Never run HPA and VPA on the same metric (like CPU) for the same app. They will “fight” each other—HPA will try to add pods while VPA tries to make them bigger, leading to a “flapping” state where your app is constantly restarting.

To set up a Horizontal Pod Autoscaler (HPA), you need two things: a Deployment (your app) and an HPA resource that watches it.

Here is a breakdown of how to configure this in a way that actually works.

1. The Deployment

First, your pods must have resources.requests defined. If the HPA doesn’t know how much CPU a pod should use, it can’t calculate the percentage.

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
selector:
matchLabels:
run: php-apache
replicas: 1
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m # HPA uses this as the baseline

2. The HPA Resource

This YAML tells Kubernetes: “Keep the average CPU usage of these pods at 50%. If it goes higher, spin up more pods (up to 10). If it goes lower, scale back down to 1.”

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50

3. How to Apply and Test

You can apply these using oc apply -f <filename>.yaml (in OpenShift) or kubectl apply.

Once applied, you can watch the autoscaler in real-time:

  • View status: oc get hpa
  • Watch it live: oc get hpa php-apache-hpa --watch

The Calculation Logic:

The HPA uses a specific formula to decide how many replicas to run:

$$\text{Desired Replicas} = \lceil \text{Current Replicas} \times \frac{\text{Current Metric Value}}{\text{Desired Metric Value}} \rceil$$

Quick Tip: If you are using OpenShift, you can also do this instantly via the CLI without a YAML file:

oc autoscale deployment/php-apache --cpu-percent=50 --min=1 --max=10

To make your autoscaling more robust, you can combine CPU and Memory metrics in a single HPA. Kubernetes will look at both and scale based on whichever one hits the limit first.

Here is the updated YAML including both resource types and a “Scale Down” stabilization period to prevent your cluster from “flapping” (rapidly adding and removing pods).

1. Advanced HPA YAML (CPU + Memory)

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: advanced-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: advanced-app
minReplicas: 2
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 mins before scaling down to ensure traffic is actually gone
policies:
- type: Percent
value: 10
periodSeconds: 60

2. Scaling on Custom Metrics (e.g., HTTP Requests)

Sometimes CPU doesn’t tell the whole story. If your app is waiting on a database, CPU might stay low while users experience lag. In these cases, you can scale based on Requests Per Second (RPS).

To use this, you must have the Prometheus Adapter installed (which comes standard in OpenShift’s monitoring stack).

YAML

  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 500 # Scale up if pods average more than 500 requests/sec


Pro-Tips for Memory Scaling

  1. Memory is “Sticky”: Unlike CPU, which drops the moment a process finishes, many runtimes (like Java/JVM or Node.js) do not immediately release memory back to the OS.
  2. The Danger: If your app doesn’t have a good Garbage Collector configuration, the HPA might see high memory usage, spin up 10 pods, and never scale back down because the memory stays “reserved” by the app.
  3. The Fix: Always ensure your memory.requests in the Deployment are set to what the app actually needs to start, not its peak limit.

Summary Table: Which metric to use?

ScenarioRecommended MetricWhy?
Calculation heavyCPUDirectly maps to processing power.
Caching/Large DataMemoryPrevents OOM (Out of Memory) kills.
Web APIsRequests Per SecondScaled based on actual user load.
Message QueueQueue DepthScales based on “work to be done.”

When an HPA isn’t behaving as expected—maybe it’s not scaling up during a spike, or it’s “stuck” at the minimum replicas—you need to look at the Controller Manager’s internal logic.

Here is how you can perform a “health check” on your HPA’s decision-making process.


1. The “Describe” Command (Most Useful)

The describe command provides a chronological log of every scaling action and, more importantly, why a request failed.

Bash

oc describe hpa advanced-app-hpa

What to look for in the “Events” section:

  • SuccessfulRescale: The HPA successfully changed the replica count.
  • FailedComputeMetricsReplicas: Usually means the HPA can’t talk to the Metrics Server (check if your pods have resources.requests defined!).
  • FailedGetResourceMetric: The pods might be crashing or “Unready,” so the HPA can’t pull their CPU/Memory usage.

2. Checking the “Conditions”

In the output of the describe command, look for the Conditions section. It tells you the current “brain state” of the autoscaler:

ConditionStatusMeaning
AbleToScaleTrueThe HPA is healthy and can talk to the Deployment.
ScalingActiveTrueMetrics are being received and scaling logic is running.
ScalingLimitedTrueWarning: You’ve hit your maxReplicas or minReplicas. It wants to scale further but you’ve capped it.

3. Real-time Metric Monitoring

If you want to see exactly what numbers the HPA is seeing right now compared to your target, use:

Bash

oc get hpa advanced-app-hpa -w

Example Output:

Plaintext

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
advanced-app-hpa Deployment/advanced-app 75%/60%, 40%/80% 2 15 5 10m

In this example, CPU is at 75% (above the 60% target), so it has already scaled to 5 replicas.


4. Debugging Common “Stuck” Scenarios

Scenario A: “Target: “

If the TARGETS column shows <unknown>, it almost always means:

  1. Missing Requests: You forgot to set resources.requests in your Deployment YAML.
  2. Metrics Server Down: The cluster-wide metrics service is having issues.
  3. Labels Mismatch: The HPA selector doesn’t match the Deployment labels.

Scenario B: High CPU but No Scaling

Check if the pods are in a Ready state. HPA ignores “Unready” pods to prevent scaling up based on the high CPU usage often seen during a container’s startup/boot phase.


Pro-Tip: The “Cooldown” Period

If you just stopped a load test and the pods are still running, don’t panic! By default, Kubernetes has a 5-minute stabilization window for scaling down. This prevents the “Flapping” effect where pods are deleted and then immediately recreated because of a small traffic blip.

Leave a comment