KEDA Event-Driven Autoscaling for AI Agent Workers on Kubernetes
April 13, 2026
Agent workloads arrive in bursts. A document upload triggers 50 analysis tasks at once, then nothing for an hour, then another spike when the next batch arrives. Traditional Kubernetes autoscaling reacts to CPU and memory metrics that miss the real constraint: how many tasks are waiting in the queue. KEDA (Kubernetes Event-Driven Autoscaling) scales workloads based on external metrics like queue length, Redis streams, or custom application metrics, making it ideal for AI agent workers that need to scale from zero to hundreds of instances based on actual work demand.
Why CPU-Based Autoscaling Fails for Agent Workloads
Kubernetes Horizontal Pod Autoscaler (HPA) scales based on resource utilization: when CPU exceeds 70%, add more pods. This approach misses the actual bottleneck in agent systems.
Consider an agent worker making LLM inference calls:
Resource Utilization During Inference
- CPU usage: 5-15% while waiting for LLM API responses
- Memory usage: Stable, independent of queue depth
- Network I/O: Sporadic bursts during API calls
- Actual constraint: Number of pending tasks in the queue
HPA would see low CPU utilization and scale down, even when 1,000 tasks are waiting for processing. Conversely, HPA might maintain high pod count during idle periods simply because workers are keeping persistent connections alive.
The Queue Length Signal
The metric that actually matters is how much work is waiting to be processed: - Queue depth > 100: Scale up aggressively to reduce processing delay - Queue depth < 10: Scale down to reduce costs - Queue depth = 0: Scale to zero to eliminate all infrastructure costs
KEDA monitors these external metrics and scales based on the real demand signal.
How KEDA Works for AI Agent Scaling
KEDA operates as a custom controller that extends Kubernetes autoscaling with external metric sources. Instead of watching CPU percentages, it connects to message queues, databases, or HTTP endpoints to get the actual workload metrics.
Core KEDA Components
Scaler: Connects to external systems (Redis, RabbitMQ, AWS SQS) to fetch metrics
Operator: Translates external metrics into Kubernetes scaling decisions
Metrics Server: Provides custom metrics to Kubernetes HPA
Event-Driven Scaling Flow
- Tasks arrive in queue (Redis Stream, SQS, etc.)
- KEDA scaler queries queue length every 30 seconds
- If queue length > threshold, KEDA triggers pod creation
- Workers start, pull tasks, and begin processing
- As queue empties, KEDA scales down to minimum (often zero)
KEDA Configuration for AI Agent Workers
Here's a practical KEDA configuration for agent workers processing tasks from a Redis Stream:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-worker-scaler
spec:
scaleTargetRef:
name: agent-worker-deployment
minReplicaCount: 0 # Scale to zero during idle
maxReplicaCount: 50 # Limit max workers
triggers:
- type: redis-streams
metadata:
address: "redis:6379"
stream: "agent-tasks"
consumerGroup: "workers"
pendingEntriesCount: "10" # Target 10 pending tasks per worker
KEDA Scaling Triggers and Configuration Options
Different queue systems and metrics require different KEDA trigger configurations:
| Trigger Type | Metric Source | Scaling Threshold | Response Time | Best For |
|---|---|---|---|---|
| Redis Streams | Pending entries | 5-20 per worker | 15-30 seconds | Agent task queues |
| AWS SQS | Visible messages | 10-30 per worker | 30-60 seconds | Event-driven workflows |
| RabbitMQ | Queue length | 3-15 per worker | 10-20 seconds | High-frequency tasks |
| HTTP External | Custom endpoint | Application-defined | 30-90 seconds | Complex business logic |
| Prometheus | Time-series metrics | Query-based thresholds | 15-45 seconds | Multi-metric scaling |
Key Configuration Decisions
minReplicaCount: 0 enables true scale-to-zero, eliminating costs during idle periods
maxReplicaCount: 50 prevents runaway scaling that could overwhelm inference endpoints
pendingEntriesCount: 10 means KEDA maintains 1 worker per 10 pending tasks
This configuration scales from 0 to 50 workers based purely on how many agent tasks are waiting for processing.
Integrating KEDA Workers with GPU Inference
Agent workers need access to GPU inference, either through dedicated instances or managed endpoints. KEDA scaling affects how you provision and access inference resources.
Serverless Inference Pattern
Workers make HTTP calls to managed inference endpoints that scale independently:
## Agent worker with GMI Cloud integration
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-worker-deployment
spec:
template:
spec:
containers:
- name: agent-worker
image: agent-worker:latest
env:
- name: INFERENCE_ENDPOINT
value: "https://api.gmicloud.ai/v1"
- name: MODEL_NAME
value: "deepseek-v4-pro"
resources:
requests:
cpu: 100m
memory: 256Mi
Advantages: Workers and inference scale independently. GMI Cloud's serverless inference automatically handles worker demand spikes.
Dedicated GPU Pool Pattern
Workers connect to persistent GPU instances that stay running:
## Worker deployment targeting dedicated GPU nodes
spec:
template:
spec:
containers:
- name: agent-worker
env:
- name: INFERENCE_ENDPOINT
value: "http://h200-inference-service:8000"
- name: GPU_POOL_SIZE
value: "4" # Number of H200 instances
Considerations: GPU instances ($2.60/hr for H200) run continuously while workers scale based on demand. Cost-efficient when you have sustained base load but need worker elasticity for spikes.
Queue Metrics and Scaling Thresholds
Different queue systems provide different metrics that KEDA can use for scaling decisions:
Redis Streams
- Pending entries: Messages in consumer group that haven't been processed
- Stream length: Total messages in stream
- Consumer lag: How far behind each consumer group is
Recommended threshold: 5-20 pending entries per worker, depending on task processing time
AWS SQS
- Approximate messages visible: Tasks waiting to be processed
- Messages in flight: Tasks currently being processed
- Age of oldest message: How long tasks have been waiting
Recommended threshold: 10-30 visible messages per worker
Custom HTTP Metrics
KEDA can query custom endpoints that return application-specific metrics:
triggers:
- type: external
metadata:
scalerAddress: "http://metrics-service:8080/agent-queue-depth"
threshold: "15"
This approach lets you implement complex logic, like scaling based on task priority or predicted processing time.
Cost Optimization Through Event-Driven Scaling
KEDA's scale-to-zero capability eliminates infrastructure costs during idle periods. For agent workloads with variable demand, this creates significant cost savings.
Cost Comparison Example
Consider an agent system processing customer support tickets:
Traditional always-on approach: - 10 workers running 24/7 at $0.05/hour each = $360/month - Inference costs from dedicated H200 instances = $1,872/month (24 × 30 × $2.60) - Total: $2,232/month
KEDA + serverless inference approach: - Workers scale from 0 to 50 based on queue depth - Average utilization: 6 hours/day active scaling = $36/month worker costs - Inference via GMI Cloud serverless: $450/month (pay-per-request) - Total: $486/month
The event-driven approach saves ~78% by eliminating idle resource consumption.
Real-World Utilization Patterns
Agent workloads typically show: - Business hours peaks: 8am-6pm with 5-10x higher task volume - Overnight lulls: Scale to zero for 6-8 hours nightly - Weekly patterns: Higher demand Monday-Friday, minimal weekend activity - Seasonal spikes: End-of-month reports, holiday customer service peaks
KEDA automatically adapts to these patterns without manual intervention.
Worker Design for Stateless Scaling
For KEDA scaling to work effectively, agent workers must be designed for rapid startup and stateless operation:
Rapid Startup Requirements
- Container image optimization: Use multi-stage builds and minimal base images
- Dependency preloading: Include Python packages and libraries in the container
- Configuration externalization: Use environment variables and config maps instead of files
Stateless Operation
Workers should not maintain local state that would be lost during scale-down: - Task checkpointing: Save progress to external storage (Redis, database) - Connection pooling: Use connection pools that can be quickly recreated - Graceful shutdown: Handle SIGTERM to complete current tasks before termination
Example Worker Implementation
import asyncio
import redis
from gmi_cloud_client import GMIClient
class AgentWorker:
def __init__(self):
self.redis_client = redis.Redis(host='redis')
self.inference_client = GMIClient(api_key=os.getenv('GMI_API_KEY'))
async def process_tasks(self):
while True:
task = self.redis_client.xreadgroup('workers', 'worker-1',
{'agent-tasks': '>'},
count=1, block=1000)
if task:
await self.handle_task(task[0][1][0])
async def handle_task(self, task_data):
# Process task with LLM inference
response = await self.inference_client.inference(
model="deepseek-v4-pro",
prompt=task_data['prompt']
)
# Save results and acknowledge task
self.redis_client.xack('agent-tasks', 'workers', task_data['id'])
GMI Cloud Integration with KEDA Workers
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. For KEDA-scaled agent workers, GMI Cloud provides inference endpoints that scale independently of your worker autoscaling.
Key advantages for event-driven agent architectures: - Automatic inference scaling: GMI Cloud's serverless inference scales to match your KEDA worker demands without manual provisioning - Cost alignment: Pay only for inference requests actually made by active workers - Consistent availability: 99.99% platform availability SLA ensures inference endpoints remain responsive during scaling events - Model flexibility: Switch between cost-efficient models (DeepSeek-V4-Pro at $1.39/M) and faster options (GPT-5.4-mini at $0.40/$2.50/M) based on agent requirements
You can configure inference endpoints and explore model options at console.gmicloud.ai and gmicloud.ai/en/pricing.
Monitoring and Alerting for Event-Driven Agents
KEDA scaling creates new monitoring requirements since traditional pod-level metrics become less relevant:
Key Metrics to Track
- Queue depth over time: Trends in task accumulation
- Scale events frequency: How often KEDA triggers scaling up/down
- Processing latency: Time from task creation to completion
- Worker startup time: How quickly new workers become ready
- Inference endpoint latency: Performance of external LLM calls
Alerting Thresholds
- Queue depth > 1000: Potential capacity or processing issues
- Worker startup > 60 seconds: Container or infrastructure problems
- Scale-up frequency > 10/hour: Possible threshold tuning needed
Best Practices for KEDA Agent Scaling
Start conservative with thresholds: Begin with higher pending task counts per worker (20-30) and tune down based on observed performance
Monitor inference endpoint capacity: Ensure your LLM provider can handle the maximum worker concurrency you configure
Implement graceful degradation: Design agents to handle inference timeouts and queue backpressure
Test scale-to-zero behavior: Verify that workers properly save state and resume cleanly after scaling down
Not ideal for stateful agents: KEDA works best with task-based workflows, not agents that maintain conversation state
Not suitable for real-time requirements: Scale-up latency (30-90 seconds) makes this inappropriate for sub-second response requirements
Event-Driven Scaling as Infrastructure Strategy
KEDA transforms agent workloads from resource optimization problems into event response systems. Instead of guessing capacity requirements, you scale precisely based on actual demand signals. This approach reduces infrastructure costs, improves responsiveness during demand spikes, and simplifies capacity planning. The key is designing agent workers that start quickly, operate statelessly, and integrate cleanly with external inference services that can match their scaling behavior.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
