Other

Kubernetes-Native AI Agent Orchestration: Pods, Queues & Worker Pools

April 13, 2026

Agent workflows map naturally to Kubernetes concepts: workers become pods, tasks become queue messages, and coordination becomes service discovery. But most teams start with external orchestrators like Temporal or Prefect before realizing Kubernetes can handle agent coordination natively. Using Kubernetes primitives for agent orchestration eliminates external dependencies while leveraging the same infrastructure that's already running your inference workloads. This article shows how to build agent systems using only Kubernetes resources, explains when native orchestration makes sense, and demonstrates how GPU scheduling integrates with agent workload patterns.

Why Kubernetes-Native Agent Orchestration

External workflow orchestrators add operational complexity: another service to deploy, monitor, and scale. For many agent workloads, Kubernetes already provides the primitives you need.

Kubernetes Primitives for Agent Systems

Pods: Individual agent workers that process tasks Jobs and CronJobs: Batch agent workflows with scheduling Services: Load balancing across worker pools ConfigMaps and Secrets: Configuration and credentials for workers Persistent Volumes: Shared state and task queues Custom Resources: Application-specific workflow definitions

When Native Orchestration Works

Kubernetes-native orchestration fits when: - Agent workflows are task-based rather than long-running state machines - You need tight integration with GPU resource scheduling - Operational simplicity matters more than advanced workflow features - Your team already has strong Kubernetes expertise

When External Orchestrators Are Better

Complex workflows with branching logic, human approval steps, or multi-day executions often require dedicated workflow engines like Temporal or Prefect.

Pod-Based Agent Workers

The fundamental unit of agent execution in Kubernetes is the pod. Each worker pod pulls tasks from a queue, processes them with LLM inference calls, and reports results.

Basic Agent Worker Pod

apiVersion: v1
kind: Pod
metadata:
  name: agent-worker
  labels:
    app: agent-worker
    version: v1
spec:
  containers:
  - name: worker
    image: agent-worker:latest
    env:
    - name: QUEUE_URL
      valueFrom:
        configMapKeyRef:
          name: agent-config
          key: queue-url
    - name: INFERENCE_ENDPOINT
      value: "https://api.gmicloud.ai/v1"
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 500m
        memory: 512Mi

Deployment for Scalable Worker Pools

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-worker-pool
spec:
  replicas: 5
  selector:
    matchLabels:
      app: agent-worker
  template:
    metadata:
      labels:
        app: agent-worker
    spec:
      containers:
      - name: worker
        image: agent-worker:latest
        env:
        - name: WORKER_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.name

This deployment creates 5 identical worker pods that can be scaled independently based on queue depth or resource utilization.

Queue Integration Patterns

Kubernetes doesn't include a message queue, but it can orchestrate external queues or run queue services directly in the cluster.

Redis-Based Task Queue

Deploy Redis as a queue service within the cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-queue
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis-queue
  template:
    spec:
      containers:
      - name: redis
        image: redis:6-alpine
        ports:
        - containerPort: 6379
        volumeMounts:
        - name: redis-storage
          mountPath: /data
      volumes:
      - name: redis-storage
        persistentVolumeClaim:
          claimName: redis-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: redis-queue-service
spec:
  selector:
    app: redis-queue
  ports:
  - port: 6379
    targetPort: 6379

Workers connect to redis-queue-service:6379 to pull tasks and report results.

External Queue Integration

For production systems, external managed queues often provide better reliability:

## ConfigMap for external queue configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-config
data:
  queue-type: "aws-sqs"
  queue-url: "https://sqs.us-west-2.amazonaws.com/123456789/agent-tasks"
  dead-letter-queue: "https://sqs.us-west-2.amazonaws.com/123456789/agent-dlq"

Workers use this configuration to connect to AWS SQS, Google Cloud Tasks, or other external queue services.

GPU Scheduling for Agent Workloads

Agent workers that need GPU access for local inference require careful resource scheduling. Kubernetes provides several mechanisms for GPU allocation.

Node Selectors for GPU Instances

Target specific GPU-enabled node pools:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-agent-workers
spec:
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-h100
      containers:
      - name: worker
        image: agent-worker-gpu:latest
        resources:
          limits:
            nvidia.com/gpu: 1

Kubernetes Resource Requirements for AI Agent Workloads

Different agent deployment patterns have different resource and scheduling requirements:

Deployment Type CPU Request Memory Request GPU Requirement Scaling Pattern Cost Efficiency
Basic Agent Pod 100m 256Mi None Manual scaling ⭐⭐☆☆☆
GPU Agent Worker 200m 512Mi 1 nvidia.com/gpu Node selector ⭐⭐⭐☆☆
Batch Job Agent 500m 1Gi Optional Completion-based ⭐⭐⭐⭐☆
CronJob Agent 200m 512Mi None Scheduled ⭐⭐⭐⭐⭐
Shared GPU Worker 100m 256Mi 0.25 nvidia.com/gpu Time-slicing ⭐⭐⭐⭐⭐

Resource Scheduling Considerations

Control GPU resource consumption across agent workloads:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: agent-gpu-quota
spec:
  hard:
    requests.nvidia.com/gpu: "10"
    limits.nvidia.com/gpu: "10"

This quota allows agent workflows to use up to 10 GPUs total across all worker pods.

GPU Sharing Strategies

For agent workloads that don't saturate full GPU capacity:

Time-Slicing GPU Access

Multiple pods share the same physical GPU through time-slicing:

## GPU sharing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
data:
  sharing.policy: |
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 pods share 1 physical GPU

GPU Memory Partitioning

Use MIG (Multi-Instance GPU) for H100/H200 instances to create dedicated GPU slices:

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1  # Request 1/7 of H100 capacity

Job-Based Agent Patterns

For agent workloads with clear start and end points, Kubernetes Jobs provide better lifecycle management than long-running Deployments.

Batch Agent Processing

Process a batch of documents with automatic completion:

apiVersion: batch/v1
kind: Job
metadata:
  name: document-analysis-batch
spec:
  completions: 100        # Process 100 documents
  parallelism: 10         # Use 10 workers in parallel
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: analyzer
        image: document-agent:latest
        env:
        - name: BATCH_SIZE
          value: "10"
        - name: INFERENCE_MODEL
          value: "deepseek-v4-pro"
        resources:
          requests:
            cpu: 200m
            memory: 512Mi

Scheduled Agent Workflows

Run agent tasks on a schedule with CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report-agent
spec:
  schedule: "0 6 * * *"  # Run at 6 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: report-generator
            image: report-agent:latest
            env:
            - name: INFERENCE_ENDPOINT
              value: "https://api.gmicloud.ai/v1"
          restartPolicy: OnFailure

Service Discovery and Load Balancing

Agent workers often need to discover and connect to inference endpoints, databases, and other services within the cluster.

Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  selector:
    app: local-inference-server
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP
---
## For external inference services
apiVersion: v1
kind: Service
metadata:
  name: gmi-cloud-inference
spec:
  type: ExternalName
  externalName: api.gmicloud.ai
  ports:
  - port: 443

Workers can connect to inference-service:8000 for local inference or gmi-cloud-inference:443 for managed inference through consistent service names.

Load Balancing Across Worker Pools

Distribute incoming agent requests across worker pools:

apiVersion: v1
kind: Service
metadata:
  name: agent-worker-service
spec:
  selector:
    app: agent-worker
  ports:
  - port: 8080
    targetPort: 8080
  sessionAffinity: None  # Round-robin load balancing

Custom Resource Definitions for Agent Workflows

For complex agent workflows, Custom Resource Definitions (CRDs) let you define application-specific resources that Kubernetes can manage natively.

Agent Workflow CRD

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: agentworkflows.ai.example.com
spec:
  group: ai.example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              tasks:
                type: array
                items:
                  type: object
                  properties:
                    name:
                      type: string
                    model:
                      type: string
                    prompt:
                      type: string
              workers:
                type: integer
              gpu_required:
                type: boolean
  scope: Namespaced
  names:
    plural: agentworkflows
    singular: agentworkflow
    kind: AgentWorkflow

Using the Custom Resource

apiVersion: ai.example.com/v1
kind: AgentWorkflow
metadata:
  name: customer-analysis
spec:
  tasks:
  - name: sentiment-analysis
    model: "gpt-5.4-mini"
    prompt: "Analyze sentiment of customer feedback"
  - name: action-recommendation
    model: "deepseek-v4-pro"
    prompt: "Recommend actions based on analysis"
  workers: 5
  gpu_required: false

A custom controller watches for AgentWorkflow resources and creates the necessary Deployments, Services, and ConfigMaps automatically.

Integrating with Managed Inference Platforms

While Kubernetes can orchestrate agent workers natively, most production systems still use external inference providers for model serving.

GMI Cloud Integration Pattern

apiVersion: v1
kind: Secret
metadata:
  name: gmi-cloud-credentials
type: Opaque
data:
  api-key: <base64-encoded-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-workers
spec:
  template:
    spec:
      containers:
      - name: worker
        image: agent-worker:latest
        env:
        - name: GMI_CLOUD_API_KEY
          valueFrom:
            secretKeyRef:
              name: gmi-cloud-credentials
              key: api-key
        - name: INFERENCE_MODELS
          value: "deepseek-v4-pro,gpt-5.4-mini"

This pattern lets agent workers use GMI Cloud's serverless inference while Kubernetes handles worker lifecycle, scaling, and resource management.

Cost Optimization Through Native Scheduling

Kubernetes-native orchestration enables cost optimization through precise resource scheduling:

A worked example: A document processing agent system running 50 analysis jobs per hour. Using Kubernetes Jobs with completion-based scaling:

  • Job-based workers: Start only when documents are uploaded, complete and terminate when done
  • Resource requests: Each worker requests 200m CPU, 512Mi memory
  • GPU sharing: 4 workers share 1 H100 instance ($2.00/hr) through time-slicing
  • Inference calls: Pay per request through GMI Cloud serverless endpoints

Total cost: ~$150/month for infrastructure + ~$300/month for inference = $450/month

Compare to always-on approach: $1,440/month for continuous H100 + worker infrastructure.

Configuration Management and Secrets

Agent workflows require careful management of configuration and sensitive data:

ConfigMap for Application Settings

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-workflow-config
data:
  max-retries: "3"
  timeout-seconds: "300"
  batch-size: "10"
  models.json: |
    {
      "fast": "gpt-5.4-mini",
      "cost-effective": "deepseek-v4-pro",
      "analysis": "gemini-3.5-flash"
    }

Secret Management for API Keys

apiVersion: v1
kind: Secret
metadata:
  name: inference-credentials
type: Opaque
data:
  gmi-api-key: <base64-key>
  openai-api-key: <base64-key>
stringData:
  database-url: "postgresql://user:pass@db:5432/agents"

Monitoring and Observability

Kubernetes provides built-in monitoring for pod-level metrics, but agent workflows require application-specific observability:

Pod Metrics

  • Pod restart count: Indicates worker stability issues
  • Resource utilization: CPU/memory usage patterns
  • Ready/Running status: Worker availability

Application Metrics

Expose custom metrics from agent workers:

## Service monitor for Prometheus
apiVersion: v1
kind: Service
metadata:
  name: agent-worker-metrics
  labels:
    monitoring: enabled
spec:
  selector:
    app: agent-worker
  ports:
  - name: metrics
    port: 9090
    targetPort: 9090

Workers expose metrics like tasks completed, inference latency, and error rates on the metrics endpoint.

Best Practices for Kubernetes-Native Agents

Design for pod lifecycle: Workers should handle graceful shutdown, save progress, and resume cleanly after restart

Use resource requests and limits: Prevent resource contention between agent workers and other cluster workloads

Implement health checks: Liveness and readiness probes ensure Kubernetes can detect and recover from worker failures

Separate configuration from code: Use ConfigMaps and Secrets for environment-specific settings

Monitor resource quotas: Track GPU and CPU usage to prevent agent workloads from overwhelming the cluster

Best for task-based workflows: Kubernetes orchestration works well for discrete tasks, less well for long-running stateful processes

Not ideal for complex branching: Advanced workflow patterns may require external orchestrators

When Kubernetes Orchestration Is the Right Choice

Kubernetes-native agent orchestration reduces operational complexity when your infrastructure and team are already Kubernetes-centric. It works best for task-based agent workflows that can be broken into discrete jobs, need tight integration with GPU scheduling, or benefit from the same monitoring and deployment tools used for the rest of your infrastructure. The approach scales from simple job processing to complex multi-stage agent pipelines while leveraging the platform you're already running for inference and other services.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. For Kubernetes-orchestrated agent workers, GMI Cloud provides inference endpoints that integrate cleanly with your existing cluster networking and service discovery. You can explore integration patterns and pricing at console.gmicloud.ai and docs.gmicloud.ai.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started