Kubernetes-Native AI Agent Orchestration: Pods, Queues & Worker Pools
April 13, 2026
Agent workflows map naturally to Kubernetes concepts: workers become pods, tasks become queue messages, and coordination becomes service discovery. But most teams start with external orchestrators like Temporal or Prefect before realizing Kubernetes can handle agent coordination natively. Using Kubernetes primitives for agent orchestration eliminates external dependencies while leveraging the same infrastructure that's already running your inference workloads. This article shows how to build agent systems using only Kubernetes resources, explains when native orchestration makes sense, and demonstrates how GPU scheduling integrates with agent workload patterns.
Why Kubernetes-Native Agent Orchestration
External workflow orchestrators add operational complexity: another service to deploy, monitor, and scale. For many agent workloads, Kubernetes already provides the primitives you need.
Kubernetes Primitives for Agent Systems
Pods: Individual agent workers that process tasks Jobs and CronJobs: Batch agent workflows with scheduling Services: Load balancing across worker pools ConfigMaps and Secrets: Configuration and credentials for workers Persistent Volumes: Shared state and task queues Custom Resources: Application-specific workflow definitions
When Native Orchestration Works
Kubernetes-native orchestration fits when: - Agent workflows are task-based rather than long-running state machines - You need tight integration with GPU resource scheduling - Operational simplicity matters more than advanced workflow features - Your team already has strong Kubernetes expertise
When External Orchestrators Are Better
Complex workflows with branching logic, human approval steps, or multi-day executions often require dedicated workflow engines like Temporal or Prefect.
Pod-Based Agent Workers
The fundamental unit of agent execution in Kubernetes is the pod. Each worker pod pulls tasks from a queue, processes them with LLM inference calls, and reports results.
Basic Agent Worker Pod
apiVersion: v1
kind: Pod
metadata:
name: agent-worker
labels:
app: agent-worker
version: v1
spec:
containers:
- name: worker
image: agent-worker:latest
env:
- name: QUEUE_URL
valueFrom:
configMapKeyRef:
name: agent-config
key: queue-url
- name: INFERENCE_ENDPOINT
value: "https://api.gmicloud.ai/v1"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Deployment for Scalable Worker Pools
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-worker-pool
spec:
replicas: 5
selector:
matchLabels:
app: agent-worker
template:
metadata:
labels:
app: agent-worker
spec:
containers:
- name: worker
image: agent-worker:latest
env:
- name: WORKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
This deployment creates 5 identical worker pods that can be scaled independently based on queue depth or resource utilization.
Queue Integration Patterns
Kubernetes doesn't include a message queue, but it can orchestrate external queues or run queue services directly in the cluster.
Redis-Based Task Queue
Deploy Redis as a queue service within the cluster:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-queue
spec:
replicas: 1
selector:
matchLabels:
app: redis-queue
template:
spec:
containers:
- name: redis
image: redis:6-alpine
ports:
- containerPort: 6379
volumeMounts:
- name: redis-storage
mountPath: /data
volumes:
- name: redis-storage
persistentVolumeClaim:
claimName: redis-pvc
---
apiVersion: v1
kind: Service
metadata:
name: redis-queue-service
spec:
selector:
app: redis-queue
ports:
- port: 6379
targetPort: 6379
Workers connect to redis-queue-service:6379 to pull tasks and report results.
External Queue Integration
For production systems, external managed queues often provide better reliability:
## ConfigMap for external queue configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-config
data:
queue-type: "aws-sqs"
queue-url: "https://sqs.us-west-2.amazonaws.com/123456789/agent-tasks"
dead-letter-queue: "https://sqs.us-west-2.amazonaws.com/123456789/agent-dlq"
Workers use this configuration to connect to AWS SQS, Google Cloud Tasks, or other external queue services.
GPU Scheduling for Agent Workloads
Agent workers that need GPU access for local inference require careful resource scheduling. Kubernetes provides several mechanisms for GPU allocation.
Node Selectors for GPU Instances
Target specific GPU-enabled node pools:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-agent-workers
spec:
template:
spec:
nodeSelector:
accelerator: nvidia-h100
containers:
- name: worker
image: agent-worker-gpu:latest
resources:
limits:
nvidia.com/gpu: 1
Kubernetes Resource Requirements for AI Agent Workloads
Different agent deployment patterns have different resource and scheduling requirements:
| Deployment Type | CPU Request | Memory Request | GPU Requirement | Scaling Pattern | Cost Efficiency |
|---|---|---|---|---|---|
| Basic Agent Pod | 100m | 256Mi | None | Manual scaling | ⭐⭐☆☆☆ |
| GPU Agent Worker | 200m | 512Mi | 1 nvidia.com/gpu | Node selector | ⭐⭐⭐☆☆ |
| Batch Job Agent | 500m | 1Gi | Optional | Completion-based | ⭐⭐⭐⭐☆ |
| CronJob Agent | 200m | 512Mi | None | Scheduled | ⭐⭐⭐⭐⭐ |
| Shared GPU Worker | 100m | 256Mi | 0.25 nvidia.com/gpu | Time-slicing | ⭐⭐⭐⭐⭐ |
Resource Scheduling Considerations
Control GPU resource consumption across agent workloads:
apiVersion: v1
kind: ResourceQuota
metadata:
name: agent-gpu-quota
spec:
hard:
requests.nvidia.com/gpu: "10"
limits.nvidia.com/gpu: "10"
This quota allows agent workflows to use up to 10 GPUs total across all worker pods.
GPU Sharing Strategies
For agent workloads that don't saturate full GPU capacity:
Time-Slicing GPU Access
Multiple pods share the same physical GPU through time-slicing:
## GPU sharing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-sharing-config
data:
sharing.policy: |
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 pods share 1 physical GPU
GPU Memory Partitioning
Use MIG (Multi-Instance GPU) for H100/H200 instances to create dedicated GPU slices:
resources:
limits:
nvidia.com/mig-1g.5gb: 1 # Request 1/7 of H100 capacity
Job-Based Agent Patterns
For agent workloads with clear start and end points, Kubernetes Jobs provide better lifecycle management than long-running Deployments.
Batch Agent Processing
Process a batch of documents with automatic completion:
apiVersion: batch/v1
kind: Job
metadata:
name: document-analysis-batch
spec:
completions: 100 # Process 100 documents
parallelism: 10 # Use 10 workers in parallel
template:
spec:
restartPolicy: OnFailure
containers:
- name: analyzer
image: document-agent:latest
env:
- name: BATCH_SIZE
value: "10"
- name: INFERENCE_MODEL
value: "deepseek-v4-pro"
resources:
requests:
cpu: 200m
memory: 512Mi
Scheduled Agent Workflows
Run agent tasks on a schedule with CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-report-agent
spec:
schedule: "0 6 * * *" # Run at 6 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: report-generator
image: report-agent:latest
env:
- name: INFERENCE_ENDPOINT
value: "https://api.gmicloud.ai/v1"
restartPolicy: OnFailure
Service Discovery and Load Balancing
Agent workers often need to discover and connect to inference endpoints, databases, and other services within the cluster.
Service Configuration
apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
selector:
app: local-inference-server
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
---
## For external inference services
apiVersion: v1
kind: Service
metadata:
name: gmi-cloud-inference
spec:
type: ExternalName
externalName: api.gmicloud.ai
ports:
- port: 443
Workers can connect to inference-service:8000 for local inference or gmi-cloud-inference:443 for managed inference through consistent service names.
Load Balancing Across Worker Pools
Distribute incoming agent requests across worker pools:
apiVersion: v1
kind: Service
metadata:
name: agent-worker-service
spec:
selector:
app: agent-worker
ports:
- port: 8080
targetPort: 8080
sessionAffinity: None # Round-robin load balancing
Custom Resource Definitions for Agent Workflows
For complex agent workflows, Custom Resource Definitions (CRDs) let you define application-specific resources that Kubernetes can manage natively.
Agent Workflow CRD
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: agentworkflows.ai.example.com
spec:
group: ai.example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
tasks:
type: array
items:
type: object
properties:
name:
type: string
model:
type: string
prompt:
type: string
workers:
type: integer
gpu_required:
type: boolean
scope: Namespaced
names:
plural: agentworkflows
singular: agentworkflow
kind: AgentWorkflow
Using the Custom Resource
apiVersion: ai.example.com/v1
kind: AgentWorkflow
metadata:
name: customer-analysis
spec:
tasks:
- name: sentiment-analysis
model: "gpt-5.4-mini"
prompt: "Analyze sentiment of customer feedback"
- name: action-recommendation
model: "deepseek-v4-pro"
prompt: "Recommend actions based on analysis"
workers: 5
gpu_required: false
A custom controller watches for AgentWorkflow resources and creates the necessary Deployments, Services, and ConfigMaps automatically.
Integrating with Managed Inference Platforms
While Kubernetes can orchestrate agent workers natively, most production systems still use external inference providers for model serving.
GMI Cloud Integration Pattern
apiVersion: v1
kind: Secret
metadata:
name: gmi-cloud-credentials
type: Opaque
data:
api-key: <base64-encoded-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-workers
spec:
template:
spec:
containers:
- name: worker
image: agent-worker:latest
env:
- name: GMI_CLOUD_API_KEY
valueFrom:
secretKeyRef:
name: gmi-cloud-credentials
key: api-key
- name: INFERENCE_MODELS
value: "deepseek-v4-pro,gpt-5.4-mini"
This pattern lets agent workers use GMI Cloud's serverless inference while Kubernetes handles worker lifecycle, scaling, and resource management.
Cost Optimization Through Native Scheduling
Kubernetes-native orchestration enables cost optimization through precise resource scheduling:
A worked example: A document processing agent system running 50 analysis jobs per hour. Using Kubernetes Jobs with completion-based scaling:
- Job-based workers: Start only when documents are uploaded, complete and terminate when done
- Resource requests: Each worker requests 200m CPU, 512Mi memory
- GPU sharing: 4 workers share 1 H100 instance ($2.00/hr) through time-slicing
- Inference calls: Pay per request through GMI Cloud serverless endpoints
Total cost: ~$150/month for infrastructure + ~$300/month for inference = $450/month
Compare to always-on approach: $1,440/month for continuous H100 + worker infrastructure.
Configuration Management and Secrets
Agent workflows require careful management of configuration and sensitive data:
ConfigMap for Application Settings
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-workflow-config
data:
max-retries: "3"
timeout-seconds: "300"
batch-size: "10"
models.json: |
{
"fast": "gpt-5.4-mini",
"cost-effective": "deepseek-v4-pro",
"analysis": "gemini-3.5-flash"
}
Secret Management for API Keys
apiVersion: v1
kind: Secret
metadata:
name: inference-credentials
type: Opaque
data:
gmi-api-key: <base64-key>
openai-api-key: <base64-key>
stringData:
database-url: "postgresql://user:pass@db:5432/agents"
Monitoring and Observability
Kubernetes provides built-in monitoring for pod-level metrics, but agent workflows require application-specific observability:
Pod Metrics
- Pod restart count: Indicates worker stability issues
- Resource utilization: CPU/memory usage patterns
- Ready/Running status: Worker availability
Application Metrics
Expose custom metrics from agent workers:
## Service monitor for Prometheus
apiVersion: v1
kind: Service
metadata:
name: agent-worker-metrics
labels:
monitoring: enabled
spec:
selector:
app: agent-worker
ports:
- name: metrics
port: 9090
targetPort: 9090
Workers expose metrics like tasks completed, inference latency, and error rates on the metrics endpoint.
Best Practices for Kubernetes-Native Agents
Design for pod lifecycle: Workers should handle graceful shutdown, save progress, and resume cleanly after restart
Use resource requests and limits: Prevent resource contention between agent workers and other cluster workloads
Implement health checks: Liveness and readiness probes ensure Kubernetes can detect and recover from worker failures
Separate configuration from code: Use ConfigMaps and Secrets for environment-specific settings
Monitor resource quotas: Track GPU and CPU usage to prevent agent workloads from overwhelming the cluster
Best for task-based workflows: Kubernetes orchestration works well for discrete tasks, less well for long-running stateful processes
Not ideal for complex branching: Advanced workflow patterns may require external orchestrators
When Kubernetes Orchestration Is the Right Choice
Kubernetes-native agent orchestration reduces operational complexity when your infrastructure and team are already Kubernetes-centric. It works best for task-based agent workflows that can be broken into discrete jobs, need tight integration with GPU scheduling, or benefit from the same monitoring and deployment tools used for the rest of your infrastructure. The approach scales from simple job processing to complex multi-stage agent pipelines while leveraging the platform you're already running for inference and other services.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. For Kubernetes-orchestrated agent workers, GMI Cloud provides inference endpoints that integrate cleanly with your existing cluster networking and service discovery. You can explore integration patterns and pricing at console.gmicloud.ai and docs.gmicloud.ai.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
