Maciej Łopalewski

Senior Software Engineer

Scaling Dagster on Kubernetes: Best Practices for 50+ Code Locations

Sep 24, 20257 min read

As organizations scale their data platforms, many teams turn to Dagster on Kubernetes for reliable orchestration. What begins as a single, manageable code location often grows into multi-code location deployments, with dozens of pipelines owned by different teams or domains. While this modular approach is one of Dagster’s biggest strengths, it introduces new challenges in scaling, resource efficiency, and maintainability - especially once you’re managing 50 or more code locations in Kubernetes.

Managing significant number of code locations is not just a matter of adding more entries to a YAML file. It has profound implications for resource consumption, deployment speed, and overall maintainability.

This article dives deep into the trade-offs of managing Dagster at scale on Kubernetes. We’ll explore different deployment models, discuss performance and observability, and share real-world patterns (and anti-patterns) to help you design a data platform that is both powerful and efficient.

First, What Are Dagster Code Locations?

In Dagster, a code location is a collection of Dagster definitions (like assets, jobs, schedules, and sensors) that are loaded in a single Python environment. Think of it as a self-contained package of data pipelines. By isolating code into distinct locations, you achieve several benefits:

Fault Tolerance: An error in one code location (e.g., a missing Python dependency) won’t prevent other code locations from loading.
Independent Deployments: Team A can update their pipelines without forcing Team B to redeploy.
Dependency Management: Each code location can have its own requirements.txt or pyproject.toml, avoiding conflicts between teams that need different library versions.

Dagster’s central components, like the webserver and the daemon, communicate with these code locations via a gRPC API to fetch definitions and launch runs. This separation is key to its scalability.

The Hidden Costs of Managing 50+ Code Locations

When you deploy Dagster on Kubernetes using the official Helm chart, the standard approach is to use user code deployments. This feature creates a dedicated Kubernetes Deployment and Service for each code location you define.

A typical Dagster architecture on Kubernetes, where each code location runs in its own pod.

This model works perfectly for a handful of locations. But as you scale past 50, you start to feel the pain points:

Resource Overhead: Each code location pod consumes resources just by running. A baseline Python process, the gRPC server, and health checks require a certain amount of CPU and memory. While a single pod might only need 100MB of RAM, 50 of them instantly consume 5GB - and that’s before they even load your code. This idle resource consumption can become a significant cost.
Deployment Bottlenecks: If you need to update a shared library or a base Docker image used by all code locations, you trigger a massive rollout. Kubernetes must terminate 50+ old pods and schedule 50+ new ones. In a resource-constrained cluster, this can lead to long deployment times, "Pod unschedulable" events, and service degradation.
The "Launchpad" Problem: It's crucial to remember that these code location pods do not run your data pipelines. Their primary role is to serve metadata to the webserver and provide the necessary code to the Dagster daemon, which then launches another pod (the "run pod" or "job pod") to actually execute the pipeline. This means your infrastructure must support both the standing army of code location pods and the transient pods for active runs, further compounding resource pressure.

Kubernetes Deployment Models: Trade-Offs and Strategies

Given the challenges, let's analyze the two primary architectural models for deploying Dagster code locations on Kubernetes.

Model 1: The Standard "Pod per Code Location"

This is the default and recommended approach using user-code-deployments in the Dagster Helm chart.

How it works: You define each code location in your values.yaml file, and Helm creates a separate Kubernetes Deployment for each.

# values.yaml
userCodeDeployments:
  enabled: true
  deployments:
    - name: "sales-analytics"
      image:
        repository: "my-registry/sales-analytics"
        tag: "0.2.1"
      # ... resources, env vars, etc.
    - name: "marketing-etl"
      image:
        repository: "my-registry/marketing-etl"
        tag: "1.5.0"
      # ...
    # ... 50 more entries

Pros:

Full Isolation: The best model for fault tolerance and dependency management.
Clear Ownership: Easy to map a code location pod to a specific team or project.
Granular Updates: An update to the sales-analytics image only triggers a rollout for that single deployment.

Cons:

High Resource Overhead: The primary driver of idle resource consumption.
Slow Global Deployments: Updating all locations at once is slow and resource-intensive.
Cluster Limits: Can strain clusters that have a low limit on the total number of pods.

Model 2: The Monolithic "Single Pod" Approach (A Workaround)

For teams struggling with the overhead of the standard model, an alternative is to consolidate all code locations into a single process. This is not officially recommended as it moves away from Dagster's core isolation principles, but it can be a pragmatic solution in specific, resource-constrained scenarios.

How it works: You can "hack" the official Helm chart to run all your code locations within the main Dagster webserver and daemon pods. This involves building a single, monolithic Docker image containing the code for all pipelines and providing a workspace.yaml that loads them from the local filesystem.

# In your monolithic Dockerfile
COPY ./pipelines/sales_analytics /opt/dagster/app/sales_analytics
COPY ./pipelines/marketing_etl /opt/dagster/app/marketing_etl
# ...

# workspace.yaml loaded into the webserver/daemon
load_from:
  - python_module:
      module_name: sales_analytics.definitions
      working_directory: /opt/dagster/app/sales_analytics
  - python_module:
      module_name: marketing_etl.definitions
      working_directory: /opt/dagster/app/marketing_etl
  # ... all other locations

You would disable userCodeDeployments and ensure this workspace file is used by the main Dagster pods.

Pros:

Minimal Resource Footprint: Dramatically reduces the number of standing pods, saving significant idle resources.
Fast Deployments: An update involves rolling out just a few pods (webserver, daemon), which is much faster than 50+.

Cons:

No Fault Tolerance: A single broken dependency or syntax error in one code location can bring down the entire system.
Dependency Hell: All teams must agree on a single, shared set of Python dependencies.
Massive Pods: The webserver and daemon pods become huge, potentially requiring very large and expensive Kubernetes nodes to run.
Coupled Deployments: Any change requires rebuilding and redeploying the entire monolithic image.

Strategies for Maintainability and Scaling

Instead of choosing one extreme, the best strategy often lies in intelligent application of the standard model.

Use a Separate Image Per Code Location: Avoid using a single base image for all your code locations. While it seems efficient, it creates tight coupling. Instead, build and version a Docker image for each code location independently. This ensures that only the code locations that have actually changed will be redeployed during an update.
Aggressively Monitor Resources: Use tools like Prometheus and Grafana to monitor the CPU and memory usage of your code location pods. Are they constantly sitting at 5% of their requested resources? You are likely overprovisioning. Adjust their resources.requests in your Helm chart to free up capacity for run pods.
Optimize Deployment Times: Keep your Docker images lean. A smaller image pulls faster, leading to quicker pod startup times. Use multi-stage builds and avoid including unnecessary build-time dependencies in your final image.

Real-World Patterns and Anti-Patterns

Theory is one thing, but production issues are the best teacher. Here are some patterns to emulate and anti-patterns to avoid.

Anti-Pattern: The Heavyweight Code Location

A common mistake is to load large models or initialize expensive clients at the module level of your Dagster code. Remember: everything you import and define globally in your code location gets loaded into memory the moment the pod starts.

Real-world example: A team was using the libpostal library for address parsing. Simply adding import postal to their asset definitions caused the memory footprint of their code location pod to jump by 2GB. When several other teams copied this pattern, the cluster's memory usage skyrocketed, causing widespread performance issues.

# assets/address_parsing.py

from postal.parser import parse_address # <-- This import loads a large model into memory!

from dagster import asset

@asset
def parsed_addresses(raw_addresses):
    # This asset's code location pod now holds a 2GB model in memory,
    # even when the asset is not running.
    return [parse_address(addr) for addr in raw_addresses]

The Fix: There are two great ways to solve this problem.

Lazy Loading: The simplest fix is to lazily import or load expensive resources inside your asset or op functions. This ensures the resource is only loaded into memory in the short-lived run pod, not the long-running code location pod.

# A better approach
from dagster import asset

@asset
def parsed_addresses(raw_addresses):
    from postal.parser import parse_address # <-- Import inside the function

    return [parse_address(addr) for addr in raw_addresses]

Externalize as a Microservice: For an even more robust and scalable solution, you can externalize the heavy dependency entirely. You can deploy libpostal as a microservice (e.g., using a wrapper like libpostal-rest) to have more control over its resources. This centralizes the resource-intensive component into a single, dedicated instance that you can manage and scale independently, serving all your Dagster pipelines via a simple network call.

Pattern: Domain-Driven Consolidation

If you have many small, related code locations owned by the same team, consider consolidating them. Instead of having sales-team-daily, sales-team-weekly, and sales-team-hourly, merge them into a single sales-team code location. This reduces pod sprawl without creating a true monolith.

Conclusion: When to Split and When to Consolidate

Scaling Dagster on Kubernetes is ultimately about balance. The “Pod per Code Location” model should remain your default, as it offers the strongest isolation and aligns with Dagster’s design. To control resource usage, consolidate code locations that share ownership and dependencies, and apply best practices for monitoring and deployment optimization. Reserve the monolithic single-pod model only as a temporary workaround in resource-constrained environments. By making smart choices about when to split and when to consolidate, you can build a scalable, efficient Dagster deployment on Kubernetes that supports both rapid growth and long-term maintainability.

FAQ: Scaling Dagster on Kubernetes

How many code locations can Dagster handle on Kubernetes?

Dagster can handle dozens or even hundreds of code locations on Kubernetes, but resource overhead grows quickly. Beyond 50 code locations, you may face deployment delays, idle resource costs, and cluster limits, requiring careful optimization.

What is the best deployment model for Dagster code locations?

The recommended approach is the “Pod per Code Location” model, which offers strong isolation, fault tolerance, and team-level independence. It’s the default strategy in Dagster’s official Helm chart.

Can I run all Dagster code locations in one pod?

Yes, you can consolidate code locations into a monolithic single pod, but this approach sacrifices fault tolerance and dependency isolation. It should only be used in resource-constrained environments or as a short-term workaround.

How do I optimize resource usage in Dagster on Kubernetes?

Use separate Docker images per code location, monitor pod CPU and memory with Prometheus/Grafana, and keep Docker images lean. Consolidating small, related code locations also helps reduce pod sprawl.

When should I consolidate Dagster code locations?

Consolidation makes sense when multiple code locations are owned by the same team, share dependencies, and are deployed together. This reduces Kubernetes pod overhead while maintaining logical boundaries.