
Kubernetes HPA Scale to Zero Without KEDA: Native Autoscaling for Idle Workloads

If you run queue processors, batch workers, or event-driven workloads that sit idle for hours between bursts, you're paying for compute you don't need. Kubernetes HPA can scale these deployments to zero replicas — no KEDA, no Knative, no external controllers required. You need one feature gate, an external metrics source, and about twenty minutes of setup. When the queue is empty your pods disappear, and if you pair this with cluster autoscaler, the nodes disappear too. Real scale-to-zero, using nothing but native Kubernetes primitives.
Quick Reference
| Requirement | Detail |
|---|---|
| Feature gate | HPAScaleToZero=true (available since Kubernetes 1.16) |
| Minimum replicas | minReplicas: 0 in the HPA spec |
| Metrics source | Must use external or custom metrics (not CPU/memory) |
| Scale-to-zero trigger | Metric value drops to zero |
| Scale-from-zero trigger | Metric value rises above zero |
| Why not CPU/memory? | No pods means no resource metrics to observe — the HPA controller needs a signal that exists independently of the pods |
Local Setup with kind
Since HPAScaleToZero requires an explicit feature gate, we need a cluster that has it enabled on both the API server and the controller manager. kind makes this straightforward — especially on Kubernetes 1.36, which at the time of writing is too new for most managed providers.
If you haven't already built the node image:
kind build node-image --type release v1.36.0
Create a cluster config that enables the feature gate:
# kind-config.yaml kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 featureGates: HPAScaleToZero: true nodes: - role: control-plane - role: worker
kind create cluster --name hpa-scale-to-zero --config kind-config.yaml --image kindest/node:v1.36.0
The featureGates field at the cluster level propagates the flag to all relevant components (kube-apiserver, kube-controller-manager, kubelet). No need to manually patch component configs.
The Metrics Pipeline
The HPA controller can't scale from zero based on CPU or memory — there are no pods to measure. You need a metric that exists outside the workload itself. For queue-based workloads, the natural choice is the queue length: how many items are waiting to be processed.
We'll use Redis as the queue (specifically a Redis list), expose its length via a redis-exporter sidecar, scrape it with Prometheus, and surface it to the Kubernetes metrics API through prometheus-adapter. This is the same pipeline pattern you'd use in production — the only difference is we're running everything in a single kind cluster.
Deploy Redis with an Exporter
helm repo add bitnami https://charts.bitnami.com/bitnami helm install redis bitnami/redis \ --set architecture=standalone \ --set auth.enabled=false \ --set metrics.enabled=true \ --set metrics.extraArgs.check-keys=work-queue
The metrics.enabled=true deploys a redis-exporter sidecar alongside Redis. The --check-keys work-queue argument tells the exporter to emit redis_key_size{key="work-queue"} — the length of our Redis list. That's the metric we'll use to drive the HPA. The chart also sets up the correct Prometheus scrape annotations automatically.
Deploy Prometheus and the Adapter
The prometheus-community Helm charts get us running in two commands:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/prometheus \ --set server.service.type=ClusterIP \ --set alertmanager.enabled=false \ --set kube-state-metrics.enabled=false \ --set prometheus-node-exporter.enabled=false \ --set prometheus-pushgateway.enabled=false
This gives us a minimal Prometheus that auto-discovers pods annotated with prometheus.io/scrape: "true" — which the Bitnami Redis chart already configures.
Now deploy the prometheus-adapter with a custom external metrics rule:
helm install prometheus-adapter prometheus-community/prometheus-adapter \ --set prometheus.url="http://prometheus-server" \ --set prometheus.port=80 \ --set-json 'rules.external=[{"seriesQuery":"{__name__=\"redis_key_size\",key=\"work-queue\"}","metricsQuery":"sum(<<.Series>>{<<.LabelMatchers>>})","name":{"as":"redis_queue_length"},"resources":{"namespaced":false}}]'
This rule tells the adapter: take the redis_key_size metric where key="work-queue", expose it as an external metric called redis_queue_length, and don't filter by namespace (since the queue is a cluster-wide resource — there's only one Redis).
After a minute or so, verify the external metric is available:
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/redis_queue_length" | jq .
You should see a response with a value of 0 (since the queue is empty).
The Worker and HPA
The worker is deliberately simple — a shell script that pops items from the Redis list and "processes" them by sleeping for a second. In production this would be your actual consumer code.
Important: start the deployment with replicas: 1, not zero. The HPA controller uses a ScaledToZero condition internally — it only scales from zero if it was the one that previously scaled the workload to zero. A deployment that starts at zero replicas with a fresh HPA will never scale up. Let the HPA handle the initial scale-down to zero on its own.
# worker.yaml apiVersion: apps/v1 kind: Deployment metadata: name: queue-worker spec: replicas: 1 # start at 1 — the HPA will scale to zero once metrics are available selector: matchLabels: app: queue-worker template: metadata: labels: app: queue-worker spec: containers: - name: worker image: redis:7 command: - /bin/sh - -c - | while true; do item=$(redis-cli -h redis-master BRPOP work-queue 30) if [ -n "$item" ]; then echo "Processing: $item" sleep 1 # simulate work fi done
Now the HPA that scales this deployment between 0 and 10 replicas based on the queue length:
# hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: queue-worker spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: queue-worker minReplicas: 0 # this is what HPAScaleToZero enables maxReplicas: 10 metrics: - type: External external: metric: name: redis_queue_length target: type: Value value: "5" # one pod per 5 queue items behavior: scaleDown: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 15
The target.value: "5" means the HPA will try to maintain a ratio of one pod per five pending items. When the queue has 20 items, you get 4 pods. When it drops to zero, you get zero pods. The behavior.scaleDown section shortens the stabilization window from the default 5 minutes to 60 seconds — useful for demos and workloads where you want faster scale-to-zero.
Seeing It in Action
Apply the worker and HPA:
kubectl apply -f worker.yaml -f hpa.yaml
Open a watch in one terminal:
kubectl get hpa queue-worker -w
Since the queue is empty and the deployment starts at 1 replica, the HPA will first scale it down to zero. This initial scale-down is critical — it sets the internal ScaledToZero condition that enables future scale-from-zero behavior. After about a minute (our shortened stabilization window), you'll see replicas drop to 0.
Now push 20 items into the queue:
kubectl exec redis-master-0 -c redis -- sh -c ' for i in $(seq 1 20); do redis-cli LPUSH work-queue "job-$i" done '
Within 30-60 seconds (Prometheus scrape interval + adapter refresh + HPA sync period), you'll see the HPA scale the deployment from 0 to 4 replicas. The workers will process the items — one per second per worker — and as the queue drains, the replica count drops. Once the queue is empty and the metric reads zero, the HPA scales the deployment back to zero replicas. Pods gone.
Here's what the kubectl get hpa queue-worker -w output looks like over the full cycle:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS queue-worker Deployment/queue-worker 0/5 0 10 1 queue-worker Deployment/queue-worker 0/5 0 10 0 queue-worker Deployment/queue-worker 20/5 0 10 0 queue-worker Deployment/queue-worker 20/5 0 10 4 queue-worker Deployment/queue-worker 8/5 0 10 4 queue-worker Deployment/queue-worker 0/5 0 10 2 queue-worker Deployment/queue-worker 0/5 0 10 0
Push items again — the HPA scales from zero immediately, no manual intervention needed. The cycle repeats indefinitely.
Cluster Autoscaler Integration
When a deployment scales to zero, its pods vanish. If those pods were the only workload on a node, the node becomes empty. Cluster autoscaler (or any node lifecycle manager) will notice the idle node and remove it after its configurable cool-down period — typically 10 minutes.
This is where the real cost savings live. You're not just saving pod-level resources; you're saving entire node-hours. For a workload that processes a queue for two hours a day and sits idle for twenty-two, you go from paying for 24 node-hours to paying for roughly 2.5 (accounting for scale-up overhead).
The cold-start chain matters. When an item hits the queue after a period of silence, here's what happens in sequence:
- Prometheus scrapes the redis-exporter (scrape interval, typically 15s)
- Prometheus-adapter picks up the new value (adapter refresh, typically 30s)
- HPA sync loop fires (default 15s) and decides to scale up
- Scheduler assigns the pod to a node — if no nodes are available, cluster autoscaler provisions one (60-120s for cloud providers)
- Kubelet pulls the container image and starts the pod
- Pod becomes Ready
In a warm cluster (nodes already present), steps 1-5 take about 30-60 seconds total. When a new node must be provisioned, add 1-2 minutes. For workloads where a minute or two of latency is unacceptable, consider these mitigations:
- Smaller container images reduce pull time. Distroless or scratch-based images start faster.
- Priority-based overprovisioning: deploy a low-priority "placeholder" pod that reserves a node. When real work arrives, the placeholder gets preempted and the worker starts immediately — no node provisioning needed.
- Shorter scrape intervals on Prometheus reduce the detection lag.
For batch and queue workloads, the 30-60 second cold start is typically fine. You're processing a backlog, not serving interactive requests.
When to Use This vs. KEDA
KEDA is the most popular alternative for event-driven autoscaling in Kubernetes. It supports 60+ scalers (Redis, SQS, RabbitMQ, Kafka, Azure Queue, and many more), handles the metrics plumbing internally, and can scale to zero without any feature gate.
Use native HPA scale-to-zero when:
- You already have Prometheus and a metrics adapter deployed
- You want fewer moving parts and no additional CRDs
- Your scaling logic is simple (one metric, linear scaling)
- You prefer to stay within the boundaries of native Kubernetes APIs
Use KEDA when:
- You don't have Prometheus or don't want to maintain the adapter
- You need advanced scaling logic (multiple triggers, complex cooldowns, cron-based scaling)
- You want out-of-the-box support for dozens of event sources without manual adapter configuration
- You need ScaledJobs (scaling Kubernetes Jobs rather than Deployments)
There's no shame in either choice. KEDA is excellent software. The native approach is simpler when your infrastructure already includes Prometheus — you're just connecting pieces that already exist.
Gotchas
You cannot use CPU or memory metrics to scale from zero. This is the most common mistake. When replica count is zero, there are no pods, so there's no CPU or memory to measure. The HPA controller requires at least one non-resource metric (external or custom) to make scaling decisions at zero replicas. If you configure only resource metrics, the HPA will refuse to scale from zero.
The deployment must not start at zero replicas. This is the subtlest gotcha and I haven't seen it documented clearly anywhere. The HPA controller maintains a ScaledToZero condition internally. It only scales FROM zero if it was the one that previously scaled the workload TO zero. If you deploy with replicas: 0 and create a fresh HPA, the controller sees zero replicas but no ScaledToZero condition, and sets ScalingActive: False with reason ScalingDisabled. The fix is simple: start with replicas: 1 and let the HPA scale it down naturally. Once the HPA has performed that initial scale-to-zero, the ScaledToZero condition is set to True and all subsequent scale-from-zero operations work correctly.
The stabilization window still applies at zero. By default, the HPA waits 5 minutes of sustained low metrics before scaling down. This applies to the transition from 1 to 0 as well. If you want faster scale-to-zero, configure the behavior.scaleDown section with a shorter stabilizationWindowSeconds.
The feature gate is alpha — but don't let that scare you. HPAScaleToZero has been available since Kubernetes 1.16 without breaking changes. It hasn't graduated to beta primarily due to KEP process inertia, not because of instability. The implementation is a small conditional in the HPA controller that allows minReplicas: 0 when the feature gate is enabled. It's been running in production clusters for years.
PodDisruptionBudgets at zero replicas. A PDB with minAvailable: 1 on a deployment that's at zero replicas won't prevent the scale-to-zero — PDBs apply to voluntary eviction of running pods, and scaling down is a different path. However, consider what happens if you have a PDB and scale from zero to one: the single pod is protected by the PDB immediately upon becoming ready.
Wrapping Up
The combination of native HPA scale-to-zero, external metrics from Prometheus, and cluster autoscaler gives you genuine pay-for-what-you-use economics on idle workloads. No third-party controllers, no additional CRDs, no vendor lock-in. The feature gate has been stable for twenty releases. Enable it, point your HPA at a queue metric, set minReplicas: 0, and stop paying for compute that's doing nothing.
The KEP tracking this feature is KEP-2021: HPA Scale to Zero — if you want to see it graduate to beta and become the default, that's the issue to watch and engage with.





