Can Kubernetes HPA really scale workloads to zero without KEDA?

Yes. Native Kubernetes HPA supports scale-to-zero using the HPAScaleToZero feature gate, external or custom metrics, and minReplicas: 0.

Why can't CPU or memory metrics be used for scale-to-zero?

When a deployment has zero pods running, there are no CPU or memory metrics to collect. HPA needs an external metric, like queue length, that still exists when pods are gone.

What metric works best for queue-based workloads?

Queue length is the most common choice. In this setup, Redis queue size is exposed through Prometheus and used by HPA to decide when to scale up or down.

Why should the deployment start with one replica instead of zero?

The HPA controller must first scale the workload down to zero itself. If the deployment starts at zero replicas, HPA may not scale it back up later because the internal ScaledToZero condition was never set.

When should I use KEDA instead of native HPA scale-to-zero?

Use KEDA if you need built-in integrations for many event sources, advanced scaling rules, or simpler setup without managing Prometheus and metrics adapters yourself.

Daniel Kraszewski

Head of Engineering

Kubernetes HPA Scale to Zero Without KEDA: Native Autoscaling for Idle Workloads

May 06, 20269 min read

If you run queue processors, batch workers, or event-driven workloads that sit idle for hours between bursts, you're paying for compute you don't need. Kubernetes HPA can scale these deployments to zero replicas — no KEDA, no Knative, no external controllers required. You need one feature gate, an external metrics source, and about twenty minutes of setup. When the queue is empty your pods disappear, and if you pair this with cluster autoscaler, the nodes disappear too. Real scale-to-zero, using nothing but native Kubernetes primitives.

Quick Reference

Requirement	Detail
Feature gate	`HPAScaleToZero=true` (available since Kubernetes 1.16)
Minimum replicas	`minReplicas: 0` in the HPA spec
Metrics source	Must use external or custom metrics (not CPU/memory)
Scale-to-zero trigger	Metric value drops to zero
Scale-from-zero trigger	Metric value rises above zero
Why not CPU/memory?	No pods means no resource metrics to observe — the HPA controller needs a signal that exists independently of the pods

Local Setup with kind

Since HPAScaleToZero requires an explicit feature gate, we need a cluster that has it enabled on both the API server and the controller manager. kind makes this straightforward — especially on Kubernetes 1.36, which at the time of writing is too new for most managed providers.

If you haven't already built the node image:

kind build node-image --type release v1.36.0

Create a cluster config that enables the feature gate:

# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  HPAScaleToZero: true
nodes:
  - role: control-plane
  - role: worker

kind create cluster --name hpa-scale-to-zero --config kind-config.yaml --image kindest/node:v1.36.0

The featureGates field at the cluster level propagates the flag to all relevant components (kube-apiserver, kube-controller-manager, kubelet). No need to manually patch component configs.

The Metrics Pipeline

The HPA controller can't scale from zero based on CPU or memory — there are no pods to measure. You need a metric that exists outside the workload itself. For queue-based workloads, the natural choice is the queue length: how many items are waiting to be processed.

We'll use Redis as the queue (specifically a Redis list), expose its length via a redis-exporter sidecar, scrape it with Prometheus, and surface it to the Kubernetes metrics API through prometheus-adapter. This is the same pipeline pattern you'd use in production — the only difference is we're running everything in a single kind cluster.

Deploy Redis with an Exporter

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis \
  --set architecture=standalone \
  --set auth.enabled=false \
  --set metrics.enabled=true \
  --set metrics.extraArgs.check-keys=work-queue

The metrics.enabled=true deploys a redis-exporter sidecar alongside Redis. The --check-keys work-queue argument tells the exporter to emit redis_key_size{key="work-queue"} — the length of our Redis list. That's the metric we'll use to drive the HPA. The chart also sets up the correct Prometheus scrape annotations automatically.

Deploy Prometheus and the Adapter

The prometheus-community Helm charts get us running in two commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install prometheus prometheus-community/prometheus \
  --set server.service.type=ClusterIP \
  --set alertmanager.enabled=false \
  --set kube-state-metrics.enabled=false \
  --set prometheus-node-exporter.enabled=false \
  --set prometheus-pushgateway.enabled=false

This gives us a minimal Prometheus that auto-discovers pods annotated with prometheus.io/scrape: "true" — which the Bitnami Redis chart already configures.

Now deploy the prometheus-adapter with a custom external metrics rule:

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --set prometheus.url="http://prometheus-server" \
  --set prometheus.port=80 \
  --set-json 'rules.external=[{"seriesQuery":"{__name__=\"redis_key_size\",key=\"work-queue\"}","metricsQuery":"sum(<<.Series>>{<<.LabelMatchers>>})","name":{"as":"redis_queue_length"},"resources":{"namespaced":false}}]'

This rule tells the adapter: take the redis_key_size metric where key="work-queue", expose it as an external metric called redis_queue_length, and don't filter by namespace (since the queue is a cluster-wide resource — there's only one Redis).

After a minute or so, verify the external metric is available:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/redis_queue_length" | jq .

You should see a response with a value of 0 (since the queue is empty).

The Worker and HPA

The worker is deliberately simple — a shell script that pops items from the Redis list and "processes" them by sleeping for a second. In production this would be your actual consumer code.

Important: start the deployment with replicas: 1, not zero. The HPA controller uses a ScaledToZero condition internally — it only scales from zero if it was the one that previously scaled the workload to zero. A deployment that starts at zero replicas with a fresh HPA will never scale up. Let the HPA handle the initial scale-down to zero on its own.

# worker.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: queue-worker
spec:
  replicas: 1  # start at 1 — the HPA will scale to zero once metrics are available
  selector:
    matchLabels:
      app: queue-worker
  template:
    metadata:
      labels:
        app: queue-worker
    spec:
      containers:
        - name: worker
          image: redis:7
          command:
            - /bin/sh
            - -c
            - |
              while true; do
                item=$(redis-cli -h redis-master BRPOP work-queue 30)
                if [ -n "$item" ]; then
                  echo "Processing: $item"
                  sleep 1  # simulate work
                fi
              done

Now the HPA that scales this deployment between 0 and 10 replicas based on the queue length:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 0  # this is what HPAScaleToZero enables
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: redis_queue_length
        target:
          type: Value
          value: "5"  # one pod per 5 queue items
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

The target.value: "5" means the HPA will try to maintain a ratio of one pod per five pending items. When the queue has 20 items, you get 4 pods. When it drops to zero, you get zero pods. The behavior.scaleDown section shortens the stabilization window from the default 5 minutes to 60 seconds — useful for demos and workloads where you want faster scale-to-zero.

Seeing It in Action

Apply the worker and HPA:

kubectl apply -f worker.yaml -f hpa.yaml

Open a watch in one terminal:

kubectl get hpa queue-worker -w

Since the queue is empty and the deployment starts at 1 replica, the HPA will first scale it down to zero. This initial scale-down is critical — it sets the internal ScaledToZero condition that enables future scale-from-zero behavior. After about a minute (our shortened stabilization window), you'll see replicas drop to 0.

Now push 20 items into the queue:

kubectl exec redis-master-0 -c redis -- sh -c '
  for i in $(seq 1 20); do
    redis-cli LPUSH work-queue "job-$i"
  done
'

Within 30-60 seconds (Prometheus scrape interval + adapter refresh + HPA sync period), you'll see the HPA scale the deployment from 0 to 4 replicas. The workers will process the items — one per second per worker — and as the queue drains, the replica count drops. Once the queue is empty and the metric reads zero, the HPA scales the deployment back to zero replicas. Pods gone.

Here's what the kubectl get hpa queue-worker -w output looks like over the full cycle:

NAME           REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS
queue-worker   Deployment/queue-worker   0/5       0         10        1
queue-worker   Deployment/queue-worker   0/5       0         10        0
queue-worker   Deployment/queue-worker   20/5      0         10        0
queue-worker   Deployment/queue-worker   20/5      0         10        4
queue-worker   Deployment/queue-worker   8/5       0         10        4
queue-worker   Deployment/queue-worker   0/5       0         10        2
queue-worker   Deployment/queue-worker   0/5       0         10        0

Push items again — the HPA scales from zero immediately, no manual intervention needed. The cycle repeats indefinitely.

Cluster Autoscaler Integration

When a deployment scales to zero, its pods vanish. If those pods were the only workload on a node, the node becomes empty. Cluster autoscaler (or any node lifecycle manager) will notice the idle node and remove it after its configurable cool-down period — typically 10 minutes.

This is where the real cost savings live. You're not just saving pod-level resources; you're saving entire node-hours. For a workload that processes a queue for two hours a day and sits idle for twenty-two, you go from paying for 24 node-hours to paying for roughly 2.5 (accounting for scale-up overhead).

The cold-start chain matters. When an item hits the queue after a period of silence, here's what happens in sequence:

Prometheus scrapes the redis-exporter (scrape interval, typically 15s)
Prometheus-adapter picks up the new value (adapter refresh, typically 30s)
HPA sync loop fires (default 15s) and decides to scale up
Scheduler assigns the pod to a node — if no nodes are available, cluster autoscaler provisions one (60-120s for cloud providers)
Kubelet pulls the container image and starts the pod
Pod becomes Ready

In a warm cluster (nodes already present), steps 1-5 take about 30-60 seconds total. When a new node must be provisioned, add 1-2 minutes. For workloads where a minute or two of latency is unacceptable, consider these mitigations:

Smaller container images reduce pull time. Distroless or scratch-based images start faster.
Priority-based overprovisioning: deploy a low-priority "placeholder" pod that reserves a node. When real work arrives, the placeholder gets preempted and the worker starts immediately — no node provisioning needed.
Shorter scrape intervals on Prometheus reduce the detection lag.

For batch and queue workloads, the 30-60 second cold start is typically fine. You're processing a backlog, not serving interactive requests.

When to Use This vs. KEDA

KEDA is the most popular alternative for event-driven autoscaling in Kubernetes. It supports 60+ scalers (Redis, SQS, RabbitMQ, Kafka, Azure Queue, and many more), handles the metrics plumbing internally, and can scale to zero without any feature gate.

Use native HPA scale-to-zero when:

You already have Prometheus and a metrics adapter deployed
You want fewer moving parts and no additional CRDs
Your scaling logic is simple (one metric, linear scaling)
You prefer to stay within the boundaries of native Kubernetes APIs

Use KEDA when:

You don't have Prometheus or don't want to maintain the adapter
You need advanced scaling logic (multiple triggers, complex cooldowns, cron-based scaling)
You want out-of-the-box support for dozens of event sources without manual adapter configuration
You need ScaledJobs (scaling Kubernetes Jobs rather than Deployments)

There's no shame in either choice. KEDA is excellent software. The native approach is simpler when your infrastructure already includes Prometheus — you're just connecting pieces that already exist.

Gotchas

You cannot use CPU or memory metrics to scale from zero. This is the most common mistake. When replica count is zero, there are no pods, so there's no CPU or memory to measure. The HPA controller requires at least one non-resource metric (external or custom) to make scaling decisions at zero replicas. If you configure only resource metrics, the HPA will refuse to scale from zero.

The deployment must not start at zero replicas. This is the subtlest gotcha and I haven't seen it documented clearly anywhere. The HPA controller maintains a ScaledToZero condition internally. It only scales FROM zero if it was the one that previously scaled the workload TO zero. If you deploy with replicas: 0 and create a fresh HPA, the controller sees zero replicas but no ScaledToZero condition, and sets ScalingActive: False with reason ScalingDisabled. The fix is simple: start with replicas: 1 and let the HPA scale it down naturally. Once the HPA has performed that initial scale-to-zero, the ScaledToZero condition is set to True and all subsequent scale-from-zero operations work correctly.

The stabilization window still applies at zero. By default, the HPA waits 5 minutes of sustained low metrics before scaling down. This applies to the transition from 1 to 0 as well. If you want faster scale-to-zero, configure the behavior.scaleDown section with a shorter stabilizationWindowSeconds.

The feature gate is alpha — but don't let that scare you. HPAScaleToZero has been available since Kubernetes 1.16 without breaking changes. It hasn't graduated to beta primarily due to KEP process inertia, not because of instability. The implementation is a small conditional in the HPA controller that allows minReplicas: 0 when the feature gate is enabled. It's been running in production clusters for years.

PodDisruptionBudgets at zero replicas. A PDB with minAvailable: 1 on a deployment that's at zero replicas won't prevent the scale-to-zero — PDBs apply to voluntary eviction of running pods, and scaling down is a different path. However, consider what happens if you have a PDB and scale from zero to one: the single pod is protected by the PDB immediately upon becoming ready.

Wrapping Up

The combination of native HPA scale-to-zero, external metrics from Prometheus, and cluster autoscaler gives you genuine pay-for-what-you-use economics on idle workloads. No third-party controllers, no additional CRDs, no vendor lock-in. The feature gate has been stable for twenty releases. Enable it, point your HPA at a queue metric, set minReplicas: 0, and stop paying for compute that's doing nothing.

The KEP tracking this feature is KEP-2021: HPA Scale to Zero — if you want to see it graduate to beta and become the default, that's the issue to watch and engage with.