Chafik Belhaoues
There are times when the load on an application spikes suddenly. And at those moments, the last thing you want to do is manually add pods and monitor metrics. Fortunately, there is Kubernetes HPA (Horizontal Pod Autoscaler), which solves this problem automatically. It monitors the current load and independently adjusts the number of running pods. More traffic means more pods; when the load drops, the number of pods decreases. This is the foundation for scalable, fault-tolerant applications in Kubernetes.
Kubernetes Horizontal Pod Autoscaler is a built-in Kubernetes mechanism that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics.
Simply put, you set the rules - how many pods should run at what load - and the HPA ensures they are followed without your intervention.
In the Kubernetes architecture, the Kubernetes Horizontal Pod Autoscaler exists as a separate controller that operates in a cycle: it collects metrics, compares them to target values, and decides whether to change the number of replicas. This cycle repeats every 15 seconds by default.
It’s important to understand how HPA differs from vertical scaling (VPA). VPA increases the resources of a single pod - CPU and memory. HPA, on the other hand, adds new pods to distribute the load horizontally. This makes it an ideal tool for stateless applications that scale well horizontally: web servers, APIs, microservices.
Kubernetes Horizontal Pod Autoscaler has been supported by Kubernetes since version 1.2, and more advanced features with custom metrics were introduced in version 1.6+.
Let’s break down the mechanics using a specific K8s HPA example. Suppose you have an API service running with 2 pods, and you’ve configured the HPA with a target CPU utilization of 50%.
Here’s what happens step by step:
This is the basic K8s HPA example in action: no manual intervention, the system adapts on its own. All you have to do is enjoy the result.
Kubernetes HPA metrics fall into three categories, and choosing the right one is half the battle in configuring autoscaling:
The choice of metric directly affects the quality of scaling. An incorrectly chosen metric causes the HPA to react too late or too early.
To view the current status of the autoscaler, run ‘kubectl get hpa’ command. The output will look something like this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
api-service Deployment/api 48%/50% 2 10 3 5d
Here you can see all the most important information: which Deployment the HPA is attached to, the current metric value vs. the target (48% of 50%), the minimum and maximum number of pods, and how many replicas are currently running.
For more detailed information, use: kubectl describe hpa api-service. This command will show the scaling event history, the reasons for recent changes, and the current conditions. It is indispensable for debugging when the HPA behaves unexpectedly.
Understanding how the HPA makes decisions helps avoid unstable or unexpected behavior.
Thresholds and cooldown. HPA does not react to every metric fluctuation. Scaling up happens quickly - by default, after just 3 minutes of consistently exceeding the threshold. Scaling down is slower, with a 5-minute delay. This protects against situations where pods are constantly being added and removed due to short-term spikes.
The ‘kubectl get HPA’ command, when used in conjunction with ‘describe’, allows you to track these specific events: when scaling last triggered and for what reason.
Limits. You always specify minReplicas and maxReplicas. The HPA will never exceed these limits, even if metrics require more. This is an important lever for controlling costs and stability.
In Kubernetes 1.18+, it became possible to fine-tune behavior via the ‘behavior’ field in the HPA specification - separately for scale-up and scale-down. This allows, for example, aggressive scaling up and smooth scaling down.
The basic horizontal pod autoscaler YAML looks like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
The scaleTargetRef field specifies which Deployment will be managed. minReplicas and maxReplicas set the upper and lower bounds for the HPA scale. The metrics field specifies which metric is used to make decisions and the target value. This minimal configuration is already fully functional and covers most basic scenarios.
Why has Kubernetes HPA become the standard for production environments? There are several compelling reasons.
If you’re designing a cloud infrastructure and want to see the entire stack - from Kubernetes configurations to network dependencies - in a single visual space, Brainboard will help you build a clear architecture with automatic Terraform code generation.
Kubernetes HPA is a powerful tool. But nothing is perfect, and it also has weaknesses that are important to know in advance:
Practical recommendations: Start with conservative thresholds and adjust them based on the application’s actual behavior. Test load scenarios before production. Set up alerts in case HPA reaches maxReplicas - this is a signal that limits need to be reviewed.
Brainboard helps systematically build such processes - a platform where infrastructure decisions are made with full context and built-in security checks.
Kubernetes HPA is not an optional feature, but an essential element of any production environment with dynamic load. It allows applications to be both cost-effective during quiet periods and resilient during peak times - exactly what modern teams need.