Other

KEDA Event-Driven Autoscaling for AI Agent Workers on Kubernetes

April 13, 2026

Agent workloads arrive in bursts. A document upload triggers 50 analysis tasks at once, then nothing for an hour, then another spike when the next batch arrives. Traditional Kubernetes autoscaling reacts to CPU and memory metrics that miss the real constraint: how many tasks are waiting in the queue. KEDA (Kubernetes Event-Driven Autoscaling) scales workloads based on external metrics like queue length, Redis streams, or custom application metrics, making it ideal for AI agent workers that need to scale from zero to hundreds of instances based on actual work demand.

Why CPU-Based Autoscaling Fails for Agent Workloads

Kubernetes Horizontal Pod Autoscaler (HPA) scales based on resource utilization: when CPU exceeds 70%, add more pods. This approach misses the actual bottleneck in agent systems.

Consider an agent worker making LLM inference calls:

Resource Utilization During Inference

  • CPU usage: 5-15% while waiting for LLM API responses
  • Memory usage: Stable, independent of queue depth
  • Network I/O: Sporadic bursts during API calls
  • Actual constraint: Number of pending tasks in the queue

HPA would see low CPU utilization and scale down, even when 1,000 tasks are waiting for processing. Conversely, HPA might maintain high pod count during idle periods simply because workers are keeping persistent connections alive.

The Queue Length Signal

The metric that actually matters is how much work is waiting to be processed: - Queue depth > 100: Scale up aggressively to reduce processing delay - Queue depth < 10: Scale down to reduce costs - Queue depth = 0: Scale to zero to eliminate all infrastructure costs

KEDA monitors these external metrics and scales based on the real demand signal.

How KEDA Works for AI Agent Scaling

KEDA operates as a custom controller that extends Kubernetes autoscaling with external metric sources. Instead of watching CPU percentages, it connects to message queues, databases, or HTTP endpoints to get the actual workload metrics.

Core KEDA Components

Scaler: Connects to external systems (Redis, RabbitMQ, AWS SQS) to fetch metrics Operator: Translates external metrics into Kubernetes scaling decisions
Metrics Server: Provides custom metrics to Kubernetes HPA

Event-Driven Scaling Flow

  1. Tasks arrive in queue (Redis Stream, SQS, etc.)
  2. KEDA scaler queries queue length every 30 seconds
  3. If queue length > threshold, KEDA triggers pod creation
  4. Workers start, pull tasks, and begin processing
  5. As queue empties, KEDA scales down to minimum (often zero)

KEDA Configuration for AI Agent Workers

Here's a practical KEDA configuration for agent workers processing tasks from a Redis Stream:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-worker-scaler
spec:
  scaleTargetRef:
    name: agent-worker-deployment
  minReplicaCount: 0    # Scale to zero during idle
  maxReplicaCount: 50   # Limit max workers
  triggers:
  - type: redis-streams
    metadata:
      address: "redis:6379"
      stream: "agent-tasks"
      consumerGroup: "workers"
      pendingEntriesCount: "10"  # Target 10 pending tasks per worker

KEDA Scaling Triggers and Configuration Options

Different queue systems and metrics require different KEDA trigger configurations:

Trigger Type Metric Source Scaling Threshold Response Time Best For
Redis Streams Pending entries 5-20 per worker 15-30 seconds Agent task queues
AWS SQS Visible messages 10-30 per worker 30-60 seconds Event-driven workflows
RabbitMQ Queue length 3-15 per worker 10-20 seconds High-frequency tasks
HTTP External Custom endpoint Application-defined 30-90 seconds Complex business logic
Prometheus Time-series metrics Query-based thresholds 15-45 seconds Multi-metric scaling

Key Configuration Decisions

minReplicaCount: 0 enables true scale-to-zero, eliminating costs during idle periods maxReplicaCount: 50 prevents runaway scaling that could overwhelm inference endpoints
pendingEntriesCount: 10 means KEDA maintains 1 worker per 10 pending tasks

This configuration scales from 0 to 50 workers based purely on how many agent tasks are waiting for processing.

Integrating KEDA Workers with GPU Inference

Agent workers need access to GPU inference, either through dedicated instances or managed endpoints. KEDA scaling affects how you provision and access inference resources.

Serverless Inference Pattern

Workers make HTTP calls to managed inference endpoints that scale independently:

## Agent worker with GMI Cloud integration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-worker-deployment
spec:
  template:
    spec:
      containers:
      - name: agent-worker
        image: agent-worker:latest
        env:
        - name: INFERENCE_ENDPOINT
          value: "https://api.gmicloud.ai/v1"
        - name: MODEL_NAME
          value: "deepseek-v4-pro"
        resources:
          requests:
            cpu: 100m
            memory: 256Mi

Advantages: Workers and inference scale independently. GMI Cloud's serverless inference automatically handles worker demand spikes.

Dedicated GPU Pool Pattern

Workers connect to persistent GPU instances that stay running:

## Worker deployment targeting dedicated GPU nodes
spec:
  template:
    spec:
      containers:
      - name: agent-worker
        env:
        - name: INFERENCE_ENDPOINT
          value: "http://h200-inference-service:8000"
        - name: GPU_POOL_SIZE
          value: "4"  # Number of H200 instances

Considerations: GPU instances ($2.60/hr for H200) run continuously while workers scale based on demand. Cost-efficient when you have sustained base load but need worker elasticity for spikes.

Queue Metrics and Scaling Thresholds

Different queue systems provide different metrics that KEDA can use for scaling decisions:

Redis Streams

  • Pending entries: Messages in consumer group that haven't been processed
  • Stream length: Total messages in stream
  • Consumer lag: How far behind each consumer group is

Recommended threshold: 5-20 pending entries per worker, depending on task processing time

AWS SQS

  • Approximate messages visible: Tasks waiting to be processed
  • Messages in flight: Tasks currently being processed
  • Age of oldest message: How long tasks have been waiting

Recommended threshold: 10-30 visible messages per worker

Custom HTTP Metrics

KEDA can query custom endpoints that return application-specific metrics:

triggers:
- type: external
  metadata:
    scalerAddress: "http://metrics-service:8080/agent-queue-depth"
    threshold: "15"

This approach lets you implement complex logic, like scaling based on task priority or predicted processing time.

Cost Optimization Through Event-Driven Scaling

KEDA's scale-to-zero capability eliminates infrastructure costs during idle periods. For agent workloads with variable demand, this creates significant cost savings.

Cost Comparison Example

Consider an agent system processing customer support tickets:

Traditional always-on approach: - 10 workers running 24/7 at $0.05/hour each = $360/month - Inference costs from dedicated H200 instances = $1,872/month (24 × 30 × $2.60) - Total: $2,232/month

KEDA + serverless inference approach: - Workers scale from 0 to 50 based on queue depth - Average utilization: 6 hours/day active scaling = $36/month worker costs - Inference via GMI Cloud serverless: $450/month (pay-per-request) - Total: $486/month

The event-driven approach saves ~78% by eliminating idle resource consumption.

Real-World Utilization Patterns

Agent workloads typically show: - Business hours peaks: 8am-6pm with 5-10x higher task volume - Overnight lulls: Scale to zero for 6-8 hours nightly - Weekly patterns: Higher demand Monday-Friday, minimal weekend activity - Seasonal spikes: End-of-month reports, holiday customer service peaks

KEDA automatically adapts to these patterns without manual intervention.

Worker Design for Stateless Scaling

For KEDA scaling to work effectively, agent workers must be designed for rapid startup and stateless operation:

Rapid Startup Requirements

  • Container image optimization: Use multi-stage builds and minimal base images
  • Dependency preloading: Include Python packages and libraries in the container
  • Configuration externalization: Use environment variables and config maps instead of files

Stateless Operation

Workers should not maintain local state that would be lost during scale-down: - Task checkpointing: Save progress to external storage (Redis, database) - Connection pooling: Use connection pools that can be quickly recreated - Graceful shutdown: Handle SIGTERM to complete current tasks before termination

Example Worker Implementation

import asyncio
import redis
from gmi_cloud_client import GMIClient
class AgentWorker:
    def __init__(self):
        self.redis_client = redis.Redis(host='redis')
        self.inference_client = GMIClient(api_key=os.getenv('GMI_API_KEY'))
    async def process_tasks(self):
        while True:
            task = self.redis_client.xreadgroup('workers', 'worker-1', 
                                               {'agent-tasks': '>'}, 
                                               count=1, block=1000)
            if task:
                await self.handle_task(task[0][1][0])
    async def handle_task(self, task_data):
        # Process task with LLM inference
        response = await self.inference_client.inference(
            model="deepseek-v4-pro",
            prompt=task_data['prompt']
        )
        # Save results and acknowledge task
        self.redis_client.xack('agent-tasks', 'workers', task_data['id'])

GMI Cloud Integration with KEDA Workers

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. For KEDA-scaled agent workers, GMI Cloud provides inference endpoints that scale independently of your worker autoscaling.

Key advantages for event-driven agent architectures: - Automatic inference scaling: GMI Cloud's serverless inference scales to match your KEDA worker demands without manual provisioning - Cost alignment: Pay only for inference requests actually made by active workers - Consistent availability: 99.99% platform availability SLA ensures inference endpoints remain responsive during scaling events - Model flexibility: Switch between cost-efficient models (DeepSeek-V4-Pro at $1.39/M) and faster options (GPT-5.4-mini at $0.40/$2.50/M) based on agent requirements

You can configure inference endpoints and explore model options at console.gmicloud.ai and gmicloud.ai/en/pricing.

Monitoring and Alerting for Event-Driven Agents

KEDA scaling creates new monitoring requirements since traditional pod-level metrics become less relevant:

Key Metrics to Track

  • Queue depth over time: Trends in task accumulation
  • Scale events frequency: How often KEDA triggers scaling up/down
  • Processing latency: Time from task creation to completion
  • Worker startup time: How quickly new workers become ready
  • Inference endpoint latency: Performance of external LLM calls

Alerting Thresholds

  • Queue depth > 1000: Potential capacity or processing issues
  • Worker startup > 60 seconds: Container or infrastructure problems
  • Scale-up frequency > 10/hour: Possible threshold tuning needed

Best Practices for KEDA Agent Scaling

Start conservative with thresholds: Begin with higher pending task counts per worker (20-30) and tune down based on observed performance

Monitor inference endpoint capacity: Ensure your LLM provider can handle the maximum worker concurrency you configure

Implement graceful degradation: Design agents to handle inference timeouts and queue backpressure

Test scale-to-zero behavior: Verify that workers properly save state and resume cleanly after scaling down

Not ideal for stateful agents: KEDA works best with task-based workflows, not agents that maintain conversation state

Not suitable for real-time requirements: Scale-up latency (30-90 seconds) makes this inappropriate for sub-second response requirements

Event-Driven Scaling as Infrastructure Strategy

KEDA transforms agent workloads from resource optimization problems into event response systems. Instead of guessing capacity requirements, you scale precisely based on actual demand signals. This approach reduces infrastructure costs, improves responsiveness during demand spikes, and simplifies capacity planning. The key is designing agent workers that start quickly, operate statelessly, and integrate cleanly with external inference services that can match their scaling behavior.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started