ChanlChanl
Learning AI

Agentic AI in Production: From Prototype to Reliable Service

Ship agentic AI that doesn't break at 2 AM. Covers orchestration patterns (ReAct, planning loops), error handling, circuit breakers, graceful degradation, observability, and scaling — with TypeScript implementations you can steal.

LDLucas DalamartaEngineering LeadFollow
March 10, 2026
24 min read
Watercolor illustration of an engineer monitoring a production AI agent dashboard with reliability metrics

Last October, a team I work with shipped a customer onboarding agent. Demo day went perfectly — the agent collected information, verified documents, and created accounts in under two minutes. Three weeks into production, it was silently failing on 23% of requests. Tool timeouts cascaded into infinite retry loops. The LLM occasionally generated malformed JSON that crashed the parser. On one memorable Friday night, a third-party API outage caused the agent to apologize to customers in an infinite loop, burning through $400 of tokens before anyone noticed.

The agent wasn't broken — it had never been production-ready.

Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. LangChain's 2025 State of Agent Engineering survey found that while 57% of organizations now have agents in production, quality remains the top barrier — 32% called it the single biggest challenge. The gap between "works in demo" and "runs reliably at scale" is where most projects die.

This article bridges that gap. We'll build production-grade patterns for orchestration, error handling, graceful degradation, observability, and scaling — all in TypeScript, all battle-tested. Not theory. Working code you can adapt to your own agent infrastructure.

What you'll buildWhy it matters
ReAct orchestration with bounded executionPrevents agents from looping forever on ambiguous tasks
Error classification and retry strategiesStops transient failures from becoming permanent outages
Circuit breaker for tool callsProtects downstream services and your token budget
Graceful degradation pipelineKeeps agents useful even when components fail
OpenTelemetry-based tracingMakes agent behavior debuggable in production
Queue-based horizontal scalingHandles traffic spikes without dropping requests

Prerequisites

You should be comfortable with TypeScript and async/await patterns. Familiarity with LLM APIs (OpenAI, Anthropic) helps but isn't strictly required — the patterns apply regardless of provider.

If you're new to agent tool infrastructure, start with AI Agent Tools: MCP, OpenAPI, and Tool Management. For retrieval-augmented generation patterns referenced in the degradation section, see RAG from Scratch.

bash
npm install openai zod bullmq ioredis @opentelemetry/api @opentelemetry/sdk-node

All code examples are self-contained TypeScript. They reference simplified versions of production patterns — adapt the types and error handling to your own stack.

Agent Orchestration Patterns That Survive Production

The orchestration pattern you choose determines how your agent reasons, recovers from errors, and consumes resources. ReAct and plan-and-execute are the two dominant patterns, and most production systems use elements of both. Picking the wrong one — or implementing either without execution bounds — is how agents end up in infinite loops burning tokens at 3 AM.

ReAct: Think, Act, Observe

ReAct (Reasoning + Acting) interleaves reasoning steps with tool calls. The agent thinks about what to do, does it, observes the result, and decides the next step. It's the default pattern in LangChain, LangGraph, and most agent frameworks because it maps naturally to how LLMs generate responses.

Here's a production ReAct loop with the guardrails a prototype would skip — step limits, timeouts, and structured error handling:

typescript
interface AgentState {
  messages: Message[];
  steps: AgentStep[];
  startTime: number;
  totalTokens: number;
}
 
interface AgentStep {
  thought: string;
  action?: { tool: string; input: Record<string, unknown> };
  observation?: string;
  error?: string;
  durationMs: number;
}
 
interface AgentConfig {
  maxSteps: number;        // Hard cap on reasoning cycles
  maxTokens: number;       // Budget ceiling per run
  timeoutMs: number;       // Wall-clock deadline
  tools: ToolDefinition[];
}
 
async function runReActAgent(
  prompt: string,
  config: AgentConfig
): Promise<AgentResult> {
  const state: AgentState = {
    messages: [{ role: 'user', content: prompt }],
    steps: [],
    startTime: Date.now(),
    totalTokens: 0,
  };
 
  for (let step = 0; step < config.maxSteps; step++) {
    // Guard: wall-clock timeout
    const elapsed = Date.now() - state.startTime;
    if (elapsed > config.timeoutMs) {
      return finalize(state, 'timeout',
        `Agent timed out after ${elapsed}ms (limit: ${config.timeoutMs}ms)`);
    }
 
    // Guard: token budget
    if (state.totalTokens > config.maxTokens) {
      return finalize(state, 'budget_exceeded',
        `Token budget exceeded: ${state.totalTokens}/${config.maxTokens}`);
    }
 
    const stepStart = Date.now();
 
    // Reason: ask the LLM what to do next
    const response = await callLLM(state.messages, config.tools);
    state.totalTokens += response.usage.totalTokens;
 
    // Check if the agent wants to respond (no tool call)
    if (!response.toolCall) {
      state.steps.push({
        thought: response.content,
        durationMs: Date.now() - stepStart,
      });
      return finalize(state, 'complete', response.content);
    }
 
    // Act: execute the tool
    const { tool, input } = response.toolCall;
    let observation: string;
    let error: string | undefined;
 
    try {
      observation = await executeTool(tool, input, config);
    } catch (err) {
      error = err instanceof Error ? err.message : 'Unknown tool error';
      observation = `Tool "${tool}" failed: ${error}`;
    }
 
    // Record the step
    state.steps.push({
      thought: response.content,
      action: { tool, input },
      observation,
      error,
      durationMs: Date.now() - stepStart,
    });
 
    // Observe: feed the result back to the LLM
    state.messages.push(
      { role: 'assistant', content: response.content, toolCall: response.toolCall },
      { role: 'tool', content: observation, toolCallId: response.toolCall.id }
    );
  }
 
  // Exhausted all steps without completing
  return finalize(state, 'max_steps_exceeded',
    `Agent reached ${config.maxSteps} steps without completing`);
}

Three things make this production-grade that prototypes skip: the step counter prevents infinite loops, the token budget prevents runaway costs, and the wall-clock timeout prevents the agent from hanging when an LLM call takes 45 seconds instead of the usual 3. Without all three, you'll eventually hit each failure mode.

Plan-and-Execute: Think First, Then Do

For complex multi-step tasks — onboarding workflows, data pipeline orchestration, multi-document analysis — ReAct's step-by-step approach can wander. Plan-and-execute generates a structured plan upfront, validates it, then executes each step methodically.

The plan phase gives you a critical checkpoint that ReAct doesn't: the ability to inspect and approve the agent's approach before it starts calling tools and spending money.

typescript
interface ExecutionPlan {
  goal: string;
  steps: PlanStep[];
  estimatedCost: number;
  estimatedDurationMs: number;
}
 
interface PlanStep {
  id: string;
  description: string;
  tool: string;
  dependencies: string[];  // IDs of steps that must complete first
  fallback?: string;       // Alternative tool if primary fails
}
 
async function planAndExecute(
  prompt: string,
  config: AgentConfig
): Promise<AgentResult> {
  // Phase 1: Generate plan
  const plan = await generatePlan(prompt, config.tools);
 
  // Phase 1b: Validate — reject plans that exceed budget or use unknown tools
  const validation = validatePlan(plan, config);
  if (!validation.valid) {
    return { status: 'plan_rejected', reason: validation.errors };
  }
 
  // Phase 2: Execute with dependency ordering
  const completed = new Map<string, StepResult>();
  const sortedSteps = topologicalSort(plan.steps);
 
  for (const step of sortedSteps) {
    // Wait for dependencies
    const depResults = step.dependencies.map(id => completed.get(id));
    if (depResults.some(r => r?.status === 'failed' && !step.fallback)) {
      completed.set(step.id, {
        status: 'skipped',
        reason: 'Dependency failed'
      });
      continue;
    }
 
    // Execute step with retry
    const result = await executeStepWithRetry(step, depResults, config);
    completed.set(step.id, result);
 
    // If step failed and has a fallback, try the alternative
    if (result.status === 'failed' && step.fallback) {
      const fallbackResult = await executeStepWithRetry(
        { ...step, tool: step.fallback }, depResults, config
      );
      completed.set(step.id, fallbackResult);
    }
  }
 
  return synthesizeResults(plan, completed);
}

Notice the dependencies array on each step — this is what makes plan-and-execute powerful for workflows. Steps without dependencies can execute in parallel (we'll cover this in the scaling section), while dependent steps wait for their inputs. The fallback field gives each step a Plan B without re-planning the entire workflow.

Which Pattern When?

Don't overthink this. Here's the decision framework we use:

1-3 tools 4+ tools Conversational Workflow Yes No New agent task How many toolsinvolved? Conversationalor workflow? Plan-and-Execute ReAct Steps havedependencies? Plan-and-Executewith DAG ordering Plan-and-Executewith parallel steps ReAct withbounded execution

Google's research on scaling agent systems, published in early 2026, backs this up quantitatively. Their evaluation of 180 agent configurations found that multi-step coordination improves performance on parallelizable tasks but degrades it on sequential ones. The practical takeaway: don't use a complex orchestration pattern when a simple ReAct loop does the job.

Error Handling That Doesn't Lie to You

Most agent error handling falls into two categories: catch-all try/catch blocks that swallow errors silently, or no error handling at all. Both are dishonest — they hide failures from operators and users. Production agents need error classification, because the correct response to "the Stripe API returned a 429" is fundamentally different from "the customer's order ID doesn't exist."

Classifying Errors: Transient vs. Permanent

The first decision when an error occurs is whether to retry. Retrying a permanent error wastes time and tokens. Not retrying a transient error turns a momentary hiccup into a user-facing failure.

This classification system handles the errors we've actually seen in production agent systems, not just the textbook cases:

typescript
enum ErrorCategory {
  TRANSIENT = 'transient',     // Retry with backoff
  PERMANENT = 'permanent',     // Fail fast, don't retry
  RATE_LIMIT = 'rate_limit',   // Retry with longer backoff
  CONTEXT = 'context',         // LLM misunderstood — rephrase
  BUDGET = 'budget',           // Resource limit hit — escalate
}
 
function classifyError(error: unknown): ErrorCategory {
  if (error instanceof Error) {
    const msg = error.message.toLowerCase();
    const status = (error as any).status || (error as any).statusCode;
 
    // HTTP status-based classification
    if (status === 429) return ErrorCategory.RATE_LIMIT;
    if (status === 408 || status === 502 || status === 503 || status === 504) {
      return ErrorCategory.TRANSIENT;
    }
    if (status === 400 || status === 404 || status === 422) {
      return ErrorCategory.PERMANENT;
    }
    if (status === 402) return ErrorCategory.BUDGET;
 
    // Message-based classification for LLM-specific errors
    if (msg.includes('timeout') || msg.includes('econnreset')) {
      return ErrorCategory.TRANSIENT;
    }
    if (msg.includes('context length') || msg.includes('token limit')) {
      return ErrorCategory.CONTEXT;
    }
    if (msg.includes('invalid') || msg.includes('not found')) {
      return ErrorCategory.PERMANENT;
    }
  }
 
  return ErrorCategory.TRANSIENT; // Default to retryable
}

Exponential Backoff with Jitter

Once you've classified an error as retryable, you need a backoff strategy that doesn't hammer a recovering service. AWS's architecture blog popularized the "full jitter" approach years ago, and it remains the best default for distributed systems. The jitter prevents the thundering herd problem — without it, all your retrying agents will slam the service at the exact same intervals.

The implementation below handles both standard transient errors and rate limits (which need longer cooldowns):

typescript
interface RetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitterFactor: number;     // 0-1, how much randomness
}
 
const RETRY_CONFIGS: Record<ErrorCategory, RetryConfig | null> = {
  [ErrorCategory.TRANSIENT]: {
    maxRetries: 3,
    baseDelayMs: 500,
    maxDelayMs: 10_000,
    jitterFactor: 0.5,
  },
  [ErrorCategory.RATE_LIMIT]: {
    maxRetries: 5,
    baseDelayMs: 2_000,
    maxDelayMs: 60_000,
    jitterFactor: 1.0,       // Full jitter to spread load
  },
  [ErrorCategory.PERMANENT]: null,  // Never retry
  [ErrorCategory.CONTEXT]: null,    // Rephrase, don't retry same call
  [ErrorCategory.BUDGET]: null,     // Escalate, don't retry
};
 
function calculateDelay(attempt: number, config: RetryConfig): number {
  // Exponential: 500, 1000, 2000, 4000...
  const exponential = config.baseDelayMs * Math.pow(2, attempt);
  const capped = Math.min(exponential, config.maxDelayMs);
 
  // Full jitter: uniform random between 0 and capped value
  const jitter = capped * config.jitterFactor * Math.random();
  const base = capped * (1 - config.jitterFactor);
 
  return base + jitter;
}
 
async function executeWithRetry<T>(
  fn: () => Promise<T>,
  label: string
): Promise<T> {
  let lastError: Error | undefined;
 
  for (let attempt = 0; ; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error instanceof Error ? error : new Error(String(error));
      const category = classifyError(error);
      const retryConfig = RETRY_CONFIGS[category];
 
      // Non-retryable or exhausted retries
      if (!retryConfig || attempt >= retryConfig.maxRetries) {
        throw new AgentError(
          `${label} failed after ${attempt + 1} attempts: ${lastError.message}`,
          { category, attempts: attempt + 1, originalError: lastError }
        );
      }
 
      const delay = calculateDelay(attempt, retryConfig);
      console.warn(
        `[retry] ${label} attempt ${attempt + 1}/${retryConfig.maxRetries}` +
        ` (${category}), waiting ${Math.round(delay)}ms`
      );
      await sleep(delay);
    }
  }
}

The Circuit Breaker: When Retries Aren't Enough

Retries handle individual failures. But what happens when a downstream service is genuinely down — not a blip, but a sustained outage? Your agents will exhaust their retry budgets on every single request, adding latency and burning tokens on error-handling conversations. A circuit breaker detects this pattern and short-circuits the calls entirely.

The state machine has three states: closed (normal operation), open (service is down, fail immediately), and half-open (tentatively testing if the service has recovered):

typescript
enum CircuitState {
  CLOSED = 'closed',       // Normal: requests flow through
  OPEN = 'open',           // Tripped: fail immediately
  HALF_OPEN = 'half_open', // Testing: allow one probe request
}
 
class CircuitBreaker {
  private state = CircuitState.CLOSED;
  private failureCount = 0;
  private lastFailureTime = 0;
  private successCount = 0;
 
  constructor(
    private readonly name: string,
    private readonly config: {
      failureThreshold: number;  // Failures before opening
      resetTimeoutMs: number;    // How long to stay open
      halfOpenSuccesses: number; // Successes needed to close
    }
  ) {}
 
  async execute<T>(fn: () => Promise<T>, fallback?: () => Promise<T>): Promise<T> {
    // Check if circuit should transition from open to half-open
    if (this.state === CircuitState.OPEN) {
      const elapsed = Date.now() - this.lastFailureTime;
      if (elapsed >= this.config.resetTimeoutMs) {
        this.state = CircuitState.HALF_OPEN;
        this.successCount = 0;
      } else if (fallback) {
        return fallback();
      } else {
        throw new CircuitOpenError(
          `Circuit "${this.name}" is open. Retry after ` +
          `${this.config.resetTimeoutMs - elapsed}ms`
        );
      }
    }
 
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
 
      if (fallback && this.state === CircuitState.OPEN) {
        return fallback();
      }
      throw error;
    }
  }
 
  private onSuccess(): void {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= this.config.halfOpenSuccesses) {
        this.state = CircuitState.CLOSED;
        this.failureCount = 0;
      }
    } else {
      this.failureCount = 0;
    }
  }
 
  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();
 
    if (this.failureCount >= this.config.failureThreshold) {
      this.state = CircuitState.OPEN;
    }
  }
 
  getState(): { state: CircuitState; failures: number } {
    return { state: this.state, failures: this.failureCount };
  }
}

In production, you'd have one circuit breaker per external dependency — one for OpenAI, one for your CRM API, one for the knowledge base. When the CRM circuit opens, your agent can still answer questions using the knowledge base and LLM; it just can't look up customer records until the CRM recovers.

Failures >= threshold Reset timeout elapsed N successes Any failure Success (reset counter) Closed Open HalfOpen

Graceful Degradation: Useful When Broken

What does your agent do when GPT-4o is down? When the knowledge base is unreachable? When the customer's CRM integration returns errors? If the answer is "crash" or "show a generic error message," you're leaving reliability on the table.

Graceful degradation means designing your agent to provide reduced-but-still-useful functionality when components fail. It's the difference between "Sorry, I'm having trouble right now" and "I can't look up your order status right now, but I can help you with general questions about our return policy."

The Degradation Pipeline

Think of degradation as a series of fallback layers. Each layer is simpler and less capable, but more reliable. The agent drops to the highest-functioning layer available:

typescript
interface DegradationLayer {
  name: string;
  isAvailable: () => Promise<boolean>;
  execute: (input: AgentInput) => Promise<AgentOutput>;
}
 
class DegradationPipeline {
  private layers: DegradationLayer[];
 
  constructor(layers: DegradationLayer[]) {
    // Ordered from most capable to most resilient
    this.layers = layers;
  }
 
  async execute(input: AgentInput): Promise<AgentOutput & { layer: string }> {
    for (const layer of this.layers) {
      try {
        const available = await Promise.race([
          layer.isAvailable(),
          sleep(2000).then(() => false), // Health check timeout
        ]);
 
        if (!available) continue;
 
        const result = await layer.execute(input);
        return { ...result, layer: layer.name };
      } catch {
        continue; // Fall through to next layer
      }
    }
 
    // All layers exhausted — return the hardcoded safety net
    return {
      response: "I'm currently experiencing technical difficulties. " +
                "Please try again in a few minutes or contact support.",
      layer: 'safety_net',
      confidence: 0,
    };
  }
}
 
// Example: customer support agent with four degradation layers
const supportAgent = new DegradationPipeline([
  {
    name: 'full_agent',
    isAvailable: async () => {
      const llm = await checkLLMHealth();
      const kb = await checkKnowledgeBaseHealth();
      const crm = await checkCRMHealth();
      return llm && kb && crm;
    },
    execute: async (input) => {
      // Full capability: LLM + knowledge base + CRM tools
      return runReActAgent(input.query, fullConfig);
    },
  },
  {
    name: 'no_crm',
    isAvailable: async () => {
      const llm = await checkLLMHealth();
      const kb = await checkKnowledgeBaseHealth();
      return llm && kb;
    },
    execute: async (input) => {
      // LLM + knowledge base only — no customer-specific data
      const result = await runReActAgent(input.query, kbOnlyConfig);
      result.response += "\n\nNote: I couldn't access your account details. " +
        "For order-specific questions, please contact support.";
      return result;
    },
  },
  {
    name: 'cached_responses',
    isAvailable: async () => true, // Cache is always "available"
    execute: async (input) => {
      // Semantic search over cached responses for common queries
      const cached = await searchResponseCache(input.query);
      if (cached && cached.similarity > 0.85) {
        return { response: cached.response, confidence: cached.similarity };
      }
      throw new Error('No suitable cached response');
    },
  },
  {
    name: 'static_fallback',
    isAvailable: async () => true,
    execute: async (input) => {
      // Rule-based response for known intents
      const intent = classifyIntentLocally(input.query);
      const staticResponse = STATIC_RESPONSES[intent] || STATIC_RESPONSES.default;
      return { response: staticResponse, confidence: 0.3 };
    },
  },
]);

Model Fallback Chains

LLM provider outages happen more often than you'd expect. Claude has a bad hour, GPT-4o has a capacity crunch, Gemini returns degraded responses during a rollout. A model fallback chain lets your agent switch providers transparently.

This pattern treats models as a priority-ordered list — try the best one first, fall back progressively:

typescript
interface ModelConfig {
  provider: 'openai' | 'anthropic' | 'google';
  model: string;
  maxTokens: number;
  costPer1kTokens: number;
}
 
const MODEL_CHAIN: ModelConfig[] = [
  { provider: 'anthropic', model: 'claude-sonnet-4-20250514', maxTokens: 8192, costPer1kTokens: 0.003 },
  { provider: 'openai', model: 'gpt-4o', maxTokens: 4096, costPer1kTokens: 0.005 },
  { provider: 'google', model: 'gemini-2.0-flash', maxTokens: 8192, costPer1kTokens: 0.001 },
];
 
async function callLLMWithFallback(
  messages: Message[],
  tools: ToolDefinition[]
): Promise<LLMResponse> {
  const errors: Array<{ model: string; error: string }> = [];
 
  for (const config of MODEL_CHAIN) {
    const breaker = getCircuitBreaker(`llm:${config.provider}`);
 
    try {
      return await breaker.execute(
        () => callProvider(config, messages, tools),
      );
    } catch (error) {
      const msg = error instanceof Error ? error.message : 'Unknown';
      errors.push({ model: `${config.provider}/${config.model}`, error: msg });
    }
  }
 
  throw new AllModelsFailedError(
    `All ${MODEL_CHAIN.length} models failed`,
    errors
  );
}

Each provider has its own circuit breaker. If Anthropic has been failing consistently, the circuit opens and subsequent requests skip straight to OpenAI — no wasted latency on a known-down service. When Anthropic recovers, the half-open state lets a single probe request through, and if it succeeds, traffic resumes.

Monitoring and Observability: Seeing What Your Agent Actually Does

Have you ever debugged a production issue where the agent "did something weird" but nobody could tell you exactly what happened? Without observability, agent debugging is guesswork. With it, you can reconstruct every thought, tool call, and decision point that led to a bad outcome.

The LangChain 2025 survey found that 89% of organizations with production agents have implemented some form of observability, and 71.5% have full step-level tracing. This isn't optional infrastructure — it's the difference between "the agent sometimes gives wrong answers" and "the agent hallucinated on step 3 because the knowledge base returned zero results for this query type."

Structured Tracing with OpenTelemetry

OpenTelemetry's GenAI semantic conventions now define standard attributes for LLM calls, token usage, tool invocations, and agent steps. This means you can instrument once and export to LangSmith, Datadog, Langfuse, or any OTel-compatible backend.

The trace structure mirrors the agent's execution — a parent span for the full run, child spans for each reasoning step, and nested spans for LLM calls and tool executions:

typescript
import { trace, SpanKind, SpanStatusCode, context } from '@opentelemetry/api';
 
const tracer = trace.getTracer('agent-service', '1.0.0');
 
async function tracedAgentRun(
  prompt: string,
  config: AgentConfig
): Promise<AgentResult> {
  return tracer.startActiveSpan('agent.run', {
    kind: SpanKind.SERVER,
    attributes: {
      'gen_ai.system': 'custom',
      'gen_ai.request.model': config.model,
      'agent.max_steps': config.maxSteps,
      'agent.timeout_ms': config.timeoutMs,
    },
  }, async (runSpan) => {
    try {
      const result = await runReActAgent(prompt, config);
 
      runSpan.setAttributes({
        'agent.status': result.status,
        'agent.steps_taken': result.steps.length,
        'agent.total_tokens': result.totalTokens,
        'agent.duration_ms': result.durationMs,
      });
      runSpan.setStatus({ code: SpanStatusCode.OK });
 
      return result;
    } catch (error) {
      runSpan.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : 'Unknown',
      });
      runSpan.recordException(error as Error);
      throw error;
    } finally {
      runSpan.end();
    }
  });
}
 
async function tracedToolCall(
  toolName: string,
  input: Record<string, unknown>
): Promise<string> {
  return tracer.startActiveSpan(`tool.${toolName}`, {
    kind: SpanKind.CLIENT,
    attributes: {
      'tool.name': toolName,
      'tool.input_keys': Object.keys(input).join(','),
    },
  }, async (toolSpan) => {
    const start = Date.now();
    try {
      const result = await executeTool(toolName, input);
      toolSpan.setAttributes({
        'tool.duration_ms': Date.now() - start,
        'tool.output_length': result.length,
        'tool.success': true,
      });
      return result;
    } catch (error) {
      toolSpan.setAttributes({
        'tool.duration_ms': Date.now() - start,
        'tool.success': false,
        'tool.error_category': classifyError(error),
      });
      throw error;
    } finally {
      toolSpan.end();
    }
  });
}

Diagram error: failed to render

gantt
    title Agent Run Trace (Waterfall)
    dateFormat X
    axisFormat %L ms
 
    section Agent Run
    agent.run           :0, 4200
 
    section Step 1
    llm.call (think)    :0, 800
    tool.search_kb      :800, 1400
 
    section Step 2
    llm.call (reason)   :1400, 2100
    tool.check_order    :2100, 2800
 
    section Step 3
    llm.call (respond)  :2800, 4200

What to Alert On

Dashboards are nice. Alerts that wake you up at the right time are essential. Here are the five alerts we've found actually matter for production agents, ranked by how often they fire:

typescript
interface AgentAlert {
  name: string;
  condition: string;
  severity: 'warning' | 'critical';
  runbook: string;
}
 
const PRODUCTION_ALERTS: AgentAlert[] = [
  {
    name: 'success_rate_drop',
    condition: 'success_rate < 0.95 over 5 minutes',
    severity: 'critical',
    runbook: 'Check LLM provider status, review recent tool errors, ' +
             'verify knowledge base connectivity',
  },
  {
    name: 'p95_latency_spike',
    condition: 'p95_duration_ms > 15000 over 5 minutes',
    severity: 'warning',
    runbook: 'Check LLM response times, look for tool timeout patterns, ' +
             'verify no infinite loop in agent steps',
  },
  {
    name: 'token_cost_anomaly',
    condition: 'hourly_token_cost > 2x rolling_7d_average',
    severity: 'warning',
    runbook: 'Check for prompt injection attempts, review agent step counts, ' +
             'look for context window overflow patterns',
  },
  {
    name: 'circuit_breaker_open',
    condition: 'any circuit breaker state == open',
    severity: 'critical',
    runbook: 'Identify which dependency tripped, check downstream service health, ' +
             'verify degradation pipeline is activating',
  },
  {
    name: 'eval_score_regression',
    condition: 'rolling_eval_score < baseline - 0.5 over 1 hour',
    severity: 'warning',
    runbook: 'Compare recent prompts to baseline, check for model version change, ' +
             'review knowledge base freshness',
  },
];

The token cost anomaly alert is the one people skip and then regret. A prompt injection attack or a stuck agent loop can burn hundreds of dollars in minutes. If your hourly cost doubles unexpectedly, something is wrong — and you want to know before the invoice arrives.

For production monitoring of agent quality beyond raw metrics, automated scorecards catch behavioral regressions that latency alerts miss. If your agent starts being technically correct but tonally inappropriate, a scorecard catches it; a latency dashboard won't.

Data analyst reviewing metrics
Total Calls
0+12%
Avg Duration
4:23-8s
Resolution
0%+3%
Live Dashboard
Active calls23
Avg wait0:04
Satisfaction98%

Scaling Agent Workloads

A single agent run might take 30 seconds, make 5 LLM calls, invoke 3 tools, and hold conversation state the entire time — fundamentally different from a REST endpoint that responds in 50ms and forgets everything. When traffic spikes from 10 to 10,000 requests per minute, the scaling strategies that work for stateless APIs will fail you.

Why CPU-Based Autoscaling Doesn't Work

Here's the trap: agents spend most of their time waiting for LLM API responses. CPU utilization stays low even when the system is completely overloaded. If you autoscale on CPU, you'll never add capacity until it's too late.

Scale on queue depth instead. A BullMQ-based architecture decouples request acceptance from processing, letting you absorb traffic spikes without dropping requests:

typescript
import { Queue, Worker, Job } from 'bullmq';
import IORedis from 'ioredis';
 
const connection = new IORedis({
  host: process.env.REDIS_HOST || 'localhost',
  port: parseInt(process.env.REDIS_PORT || '6379'),
  maxRetriesPerRequest: null,
});
 
// Producer: accepts requests and enqueues them
const agentQueue = new Queue('agent-runs', {
  connection,
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { age: 86400 },    // Keep completed jobs for 24h
    removeOnFail: { age: 604800 },       // Keep failed jobs for 7 days
  },
});
 
async function enqueueAgentRun(request: AgentRequest): Promise<string> {
  const job = await agentQueue.add('run', request, {
    priority: request.priority || 5,      // Lower = higher priority
    timeout: request.timeoutMs || 120_000,
  });
  return job.id!;
}
 
// Worker: processes agent runs from the queue
const worker = new Worker('agent-runs', async (job: Job<AgentRequest>) => {
  const { prompt, config, conversationId } = job.data;
 
  // Load conversation state from external store (not process memory)
  const history = await loadConversationHistory(conversationId);
 
  const result = await tracedAgentRun(prompt, {
    ...config,
    messages: history,
  });
 
  // Persist updated state
  await saveConversationHistory(conversationId, result.messages);
 
  // Publish result for the waiting client
  await publishResult(job.id!, result);
 
  return result;
}, {
  connection,
  concurrency: parseInt(process.env.WORKER_CONCURRENCY || '10'),
  limiter: {
    max: 50,           // Max 50 jobs per duration window
    duration: 60_000,  // Per minute — prevents LLM rate limits
  },
});
 
worker.on('failed', (job, err) => {
  console.error(`Agent run ${job?.id} failed:`, err.message);
  // Emit metric for alerting
  metrics.increment('agent.run.failed', {
    error_category: classifyError(err)
  });
});

Externalizing State

The key line in that worker is loadConversationHistory — not this.conversationHistory. If conversation state lives in process memory, you can't scale horizontally. A request must be able to land on any worker instance and pick up where the previous step left off.

Redis is the natural choice for hot state (active conversations, in-progress executions). Move cold state (completed conversations, historical context) to your primary database. This two-tier approach keeps Redis lean while maintaining full conversation history:

typescript
interface ConversationStore {
  // Hot path: Redis (active conversations)
  getActive(conversationId: string): Promise<ConversationState | null>;
  setActive(conversationId: string, state: ConversationState, ttlSeconds?: number): Promise<void>;
 
  // Cold path: database (historical)
  archive(conversationId: string): Promise<void>;
  getArchived(conversationId: string): Promise<ConversationState | null>;
}
 
class RedisConversationStore implements ConversationStore {
  constructor(private redis: IORedis) {}
 
  async getActive(id: string): Promise<ConversationState | null> {
    const data = await this.redis.get(`conv:${id}`);
    return data ? JSON.parse(data) : null;
  }
 
  async setActive(
    id: string,
    state: ConversationState,
    ttlSeconds = 3600
  ): Promise<void> {
    await this.redis.setex(
      `conv:${id}`,
      ttlSeconds,
      JSON.stringify(state)
    );
  }
 
  async archive(id: string): Promise<void> {
    const state = await this.getActive(id);
    if (state) {
      await db.conversations.insertOne({ ...state, archivedAt: new Date() });
      await this.redis.del(`conv:${id}`);
    }
  }
 
  async getArchived(id: string): Promise<ConversationState | null> {
    return db.conversations.findOne({ conversationId: id });
  }
}

Scaling Decision Matrix

Not every agent workload needs the same scaling strategy. A low-volume internal agent is over-served by a queue architecture. A customer-facing agent handling 10K concurrent conversations needs every piece of this infrastructure.

Traffic PatternStrategyQueue Needed?State Store
< 100 req/minSingle process, in-memory stateNoProcess memory
100-1K req/minMultiple workers, external stateYesRedis
1K-10K req/minAuto-scaled workers, partitioned queuesYesRedis + database
> 10K req/minRegional deployment, priority queuesYesRedis cluster + database

The transition from "single process" to "queue + workers" is the most painful one. Do it before you need to — retrofitting stateless architecture into a stateful agent is significantly harder than starting with external state from day one.

Putting It All Together: The Production Agent Architecture

Theory is one thing. Wiring it all together is another. Here's how the patterns from this article compose into a production deployment.

Every request flows through the same pipeline: accept, validate, enqueue, execute with retries and circuit breakers, degrade gracefully if needed, trace everything, and return a result. No component is optional — skip the circuit breaker and you'll learn why you needed it at 2 AM:

Closed Open Client Request API Gateway Request Validation Agent Queue Worker Pool Orchestration ReAct Loop Plan & Execute Tool Execution Circuit Breaker External Service Fallback / Cache Result Degradation Check Response OTel Traces Metrics & Alerts

A Production Checklist

Before deploying any agent to production, run through this checklist. Every item exists because someone (often me) learned the hard way that skipping it causes real problems:

Execution Bounds

  • Maximum step count set (prevents infinite reasoning loops)
  • Wall-clock timeout configured (prevents hung executions)
  • Token budget enforced per run (prevents cost runaway)

Error Handling

  • Errors classified as transient/permanent/rate-limit
  • Exponential backoff with jitter on retries
  • Circuit breakers on every external dependency
  • Dead letter queue for permanently failed jobs

Degradation

  • Model fallback chain tested (primary → secondary → tertiary)
  • Cached response layer for common queries
  • Static fallback for when all LLM providers are down
  • User-facing messages explain limitations honestly

Observability

  • Every agent run produces a trace with step-level detail
  • Token usage, latency, and error rates exported as metrics
  • Alerts configured for success rate, latency, and cost anomalies
  • Traces link to conversation IDs for customer support debugging

Scaling

  • Conversation state externalized (Redis or database)
  • Queue-based processing with priority support
  • Autoscaling based on queue depth, not CPU
  • Rate limits prevent LLM API quota exhaustion

If you're using scenario testing to validate agent behavior, run your scenario suite against the degraded modes too — not just the happy path. An agent that handles degradation gracefully during testing but crashes in production on a mode you didn't test is not production-ready.

What Breaks Next: Patterns to Watch

Agent infrastructure is evolving fast, but three trends are reshaping what "production-ready" means.

Structured output as a reliability layer. OpenAI, Anthropic, and Google all now support constrained JSON generation. This eliminates an entire category of production failures — malformed tool call arguments, unparseable responses, schema violations. Combined with solid prompt engineering fundamentals, structured output mode is the single highest-impact change you can make if you're still parsing free-text LLM output with regex.

Agent-native evaluation in CI. Teams are moving from "run evals manually when someone remembers" to automated eval suites that gate deployments. If your prompt management pipeline doesn't include automated quality checks, you're deploying blind. We covered the framework for this in How to Evaluate AI Agents — the pattern is maturing rapidly.

Persistent memory changing scaling assumptions. When agents have long-term memory — remembering previous conversations, customer preferences, learned procedures — the state management story gets more complex. You can't just externalize conversation history to Redis; you need a memory layer that persists across sessions, handles conflicts between concurrent updates, and degrades gracefully when the memory store is unavailable.

The teams that survive the Gartner 40% cancellation rate will be the ones that treated production reliability as a first-class concern from the start — not something to bolt on after the demo impressed the executives. Every pattern in this article exists because someone shipped without it and paid the price.

Build the boring infrastructure. Your agents will thank you at 2 AM.

Ship agents that don't break at 2 AM

Chanl handles monitoring, scenario testing, and quality scorecards — so your agents stay reliable in production without the infrastructure overhead.

Start building free
LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions