ChanlChanl
Learning AI

AI Agent Observability: What to Monitor When Your Agent Goes Live

Build a production observability pipeline for AI agents. Covers latency, token usage, tool success rates, conversation quality, drift detection, structured logging, alerting strategies, and the critical difference between LLM and agent observability.

LDLucas DalamartaEngineering LeadFollow
March 10, 2026
28 min read read
Watercolor illustration of an engineering team monitoring AI agent dashboards with data flowing across screens

Your agent passed every test in staging. The prompts are polished, the tools work, the latency looks great. You deploy to production, and for the first three days everything runs smoothly. On day four, a customer reports getting wildly incorrect pricing information. You check the logs — no errors, no exceptions, no alerts fired. The agent was technically healthy the entire time. It was just wrong.

This is the observability gap that catches most teams. Traditional monitoring answers "is it running?" Agent observability answers "is it behaving correctly?" And those are fundamentally different questions. According to Gartner, over 40% of agentic AI projects will be canceled by the end of 2027, with inadequate risk controls and unclear operational visibility among the top reasons. PwC's 2025 AI Agent Survey found that while 79% of organizations have adopted AI agents, the majority struggle to trace failures through multi-step workflows.

This guide builds a production observability pipeline from the ground up. We'll cover every metric that matters, build structured logging and tracing in TypeScript, set up alerting that doesn't drown you in noise, and address the specific challenges that make agent observability different from anything you've monitored before.

What you'll learnWhy it matters
The 5 pillars of agent observabilityKnow exactly what to measure before writing a single line of instrumentation
Structured logging for agentsBuild traceable, queryable logs that connect reasoning steps to outcomes
Tool call monitoringDetect both hard failures and subtle misuse before users notice
Drift detection pipelineCatch gradual degradation that averages and error rates miss
Alerting without fatigueComposite signals that page you when it matters, stay quiet when it doesn't

Prerequisites

You'll need Node.js 20+, TypeScript 5+, and familiarity with basic observability concepts (logs, metrics, traces). Experience with AI agents — even a simple one — will make the examples more concrete. If you're new to agent tool infrastructure, AI Agent Tools: MCP, OpenAPI, and Tool Management That Actually Scales covers the foundational patterns we'll be instrumenting here.

The code examples use TypeScript throughout and are framework-agnostic. You'll also want these dependencies installed:

bash
npm install @opentelemetry/api @opentelemetry/sdk-node prom-client zod

We'll reference OpenTelemetry's experimental GenAI semantic conventions where applicable, but the patterns work with any tracing backend.

LLM Observability vs. Agent Observability: Why the Distinction Matters

LLM observability watches a single model call — prompt in, completion out, how long it took, how many tokens it consumed. Agent observability watches an entire reasoning workflow that might span dozens of model calls, tool executions, memory lookups, and planning decisions, all connected in a causal chain.

Here's why this distinction isn't academic. Consider a customer support agent that handles a refund request. In a single conversation turn, it might:

  1. Retrieve the customer's order history (memory lookup)
  2. Check the return policy for the product category (knowledge base query)
  3. Verify the order is within the return window (tool call)
  4. Calculate the refund amount (LLM reasoning)
  5. Process the refund (tool call)
  6. Generate a confirmation message (LLM completion)

LLM observability sees steps 4 and 6 as two isolated model calls. Agent observability sees one workflow with six connected steps, where a failure in step 2 (wrong policy retrieved) cascades into step 4 (incorrect calculation) and step 6 (confidently wrong confirmation). The model calls themselves were fast and well-formed. The agent behavior was broken.

User Request Memory Lookup KB Query Tool: Check Window LLM: Calculate Tool: Process Refund LLM: Confirm

The table below maps the key differences across every dimension that matters for production monitoring:

DimensionLLM ObservabilityAgent Observability
ScopeSingle model callMulti-step workflow
LatencyTime-to-first-token, completion timeEnd-to-end turn latency across all steps
ErrorsAPI failures, rate limitsCascading failures across reasoning chain
QualityOutput relevance for one promptTask completion across full conversation
CostTokens per callTokens + tool calls + memory lookups per conversation
TracesRequest/response pairsDirected acyclic graph of decisions
DriftOutput distribution shiftBehavioral pattern change over time

As Arize AI's research puts it, agent observability provides "the connective tissue to reconstruct the agent's path from the initial prompt to the outcome." Without that connective tissue, you're debugging in the dark.

The Five Pillars of Agent Observability

Every production agent needs monitoring across five dimensions. Miss one, and you'll have a blind spot that bites you at 2 AM. Let's walk through each pillar, what to measure, and what the numbers actually mean.

Pillar 1: Latency — Beyond Averages

Track p50, p95, and p99 latency for every step in your agent's workflow, not just the total. A multi-step agent might respond in 800ms most of the time but take 15 seconds on complex queries that trigger multiple tool calls. The average looks healthy. The users hitting that tail don't agree.

Here's a latency collector that captures per-step timing across the full agent execution:

typescript
import { Histogram } from 'prom-client';
 
const agentLatency = new Histogram({
  name: 'agent_turn_duration_seconds',
  help: 'End-to-end latency for a single agent turn',
  labelNames: ['agent_id', 'step_type', 'model'],
  buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30],
});
 
const stepLatency = new Histogram({
  name: 'agent_step_duration_seconds',
  help: 'Latency for individual agent steps',
  labelNames: ['agent_id', 'step_type', 'step_name'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});
 
interface StepTiming {
  stepType: 'llm_call' | 'tool_call' | 'memory_lookup' | 'kb_query';
  stepName: string;
  startTime: number;
  endTime?: number;
}
 
class LatencyTracker {
  private steps: StepTiming[] = [];
  private turnStart: number;
 
  constructor(private agentId: string, private model: string) {
    this.turnStart = performance.now();
  }
 
  startStep(stepType: StepTiming['stepType'], stepName: string): () => void {
    const step: StepTiming = {
      stepType,
      stepName,
      startTime: performance.now(),
    };
    this.steps.push(step);
 
    // Return a function to end this step
    return () => {
      step.endTime = performance.now();
      const durationSec = (step.endTime - step.startTime) / 1000;
      stepLatency.observe(
        { agent_id: this.agentId, step_type: stepType, step_name: stepName },
        durationSec,
      );
    };
  }
 
  finishTurn(): { totalMs: number; steps: StepTiming[] } {
    const totalMs = performance.now() - this.turnStart;
    agentLatency.observe(
      { agent_id: this.agentId, step_type: 'full_turn', model: this.model },
      totalMs / 1000,
    );
    return { totalMs, steps: this.steps };
  }
}

What matters here isn't just the numbers — it's the segmentation. A p99 spike in tool_call steps while llm_call stays flat tells you the problem is in your external integrations, not your model. A p95 increase in llm_call with stable tool latency suggests model degradation or prompt bloat.

Thresholds worth watching:

MetricHealthyInvestigateAlert
p50 turn latency< 2s2-5s> 5s
p95 turn latency< 5s5-10s> 10s
p99 turn latency< 10s10-20s> 20s
Step-to-step variance< 3x p503-5x p50> 5x p50

These are starting points. Your actual thresholds should come from your own baseline data after two weeks of production traffic.

Pillar 2: Token Usage and Cost

Token consumption is both a cost signal and a behavioral signal. Sudden spikes in tokens per conversation often indicate the agent is hedging, repeating itself, or pulling too much context — all symptoms of deeper issues.

This tracker captures token metrics per conversation with enough granularity to pinpoint where tokens are being spent:

typescript
interface TokenMetrics {
  conversationId: string;
  agentId: string;
  turns: Array<{
    turnIndex: number;
    inputTokens: number;
    outputTokens: number;
    model: string;
    toolCallCount: number;
    memoryTokens: number;
    kbContextTokens: number;
  }>;
}
 
function analyzeTokenUsage(metrics: TokenMetrics): {
  totalCost: number;
  avgTokensPerTurn: number;
  contextRatio: number;
  anomalies: string[];
} {
  const anomalies: string[] = [];
  let totalInput = 0;
  let totalOutput = 0;
  let totalContext = 0;
 
  for (const turn of metrics.turns) {
    totalInput += turn.inputTokens;
    totalOutput += turn.outputTokens;
    totalContext += turn.memoryTokens + turn.kbContextTokens;
 
    // Flag turns where context exceeds 60% of input
    const contextShare = (turn.memoryTokens + turn.kbContextTokens) / turn.inputTokens;
    if (contextShare > 0.6) {
      anomalies.push(
        `Turn ${turn.turnIndex}: context is ${(contextShare * 100).toFixed(0)}% of input — consider trimming retrieval`
      );
    }
 
    // Flag unusually long outputs (hedging signal)
    if (turn.outputTokens > 800 && turn.toolCallCount === 0) {
      anomalies.push(
        `Turn ${turn.turnIndex}: ${turn.outputTokens} output tokens with no tool calls — possible hedging`
      );
    }
  }
 
  // Cost calculation (adjust rates per model)
  const MODEL_COSTS: Record<string, { input: number; output: number }> = {
    'gpt-4o': { input: 2.50, output: 10.00 },        // per 1M tokens
    'gpt-4o-mini': { input: 0.15, output: 0.60 },
    'claude-sonnet-4-20250514': { input: 3.00, output: 15.00 },
  };
 
  let totalCost = 0;
  for (const turn of metrics.turns) {
    const rates = MODEL_COSTS[turn.model] ?? MODEL_COSTS['gpt-4o-mini'];
    totalCost += (turn.inputTokens * rates.input + turn.outputTokens * rates.output) / 1_000_000;
  }
 
  return {
    totalCost,
    avgTokensPerTurn: (totalInput + totalOutput) / metrics.turns.length,
    contextRatio: totalContext / totalInput,
    anomalies,
  };
}

A mid-sized deployment handling 1,000 daily conversations with multi-turn interactions can consume 5 to 10 million tokens per month. At GPT-4o rates, that's $25 to $125 per month for a light workload — but costs scale non-linearly when reasoning tokens, retrieval context, and retries stack up. Teams routinely pass four to eight full documents into context when a few targeted chunks would suffice. Monitoring token distribution by source (system prompt, memory, retrieval, user input) reveals exactly where to cut.

Pillar 3: Tool Call Success Rates

Your agent's tools are its connection to the real world, and they're the most common point of failure in production. Track both hard failures (exceptions, timeouts) and soft failures (the tool returned data, but the agent misused it or the data was stale).

Here's a tool monitoring wrapper that captures everything you need:

typescript
import { Counter, Histogram } from 'prom-client';
 
const toolCallCounter = new Counter({
  name: 'agent_tool_calls_total',
  help: 'Total tool call attempts',
  labelNames: ['agent_id', 'tool_name', 'status'],
});
 
const toolLatencyHist = new Histogram({
  name: 'agent_tool_call_duration_seconds',
  help: 'Tool call execution time',
  labelNames: ['agent_id', 'tool_name'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
 
interface ToolCallRecord {
  toolName: string;
  agentId: string;
  conversationId: string;
  input: Record<string, unknown>;
  output?: unknown;
  error?: string;
  durationMs: number;
  status: 'success' | 'error' | 'timeout' | 'invalid_input';
  timestamp: Date;
}
 
async function monitoredToolCall<T>(
  agentId: string,
  conversationId: string,
  toolName: string,
  input: Record<string, unknown>,
  executor: (input: Record<string, unknown>) => Promise<T>,
  timeoutMs = 5000,
): Promise<{ result: T | null; record: ToolCallRecord }> {
  const start = performance.now();
  let record: ToolCallRecord;
 
  try {
    const result = await Promise.race([
      executor(input),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error('Tool call timeout')), timeoutMs)
      ),
    ]);
 
    const durationMs = performance.now() - start;
    record = {
      toolName, agentId, conversationId, input,
      output: result,
      durationMs,
      status: 'success',
      timestamp: new Date(),
    };
 
    toolCallCounter.inc({ agent_id: agentId, tool_name: toolName, status: 'success' });
    toolLatencyHist.observe({ agent_id: agentId, tool_name: toolName }, durationMs / 1000);
 
    return { result, record };
  } catch (err) {
    const durationMs = performance.now() - start;
    const isTimeout = err instanceof Error && err.message === 'Tool call timeout';
 
    record = {
      toolName, agentId, conversationId, input,
      error: err instanceof Error ? err.message : String(err),
      durationMs,
      status: isTimeout ? 'timeout' : 'error',
      timestamp: new Date(),
    };
 
    toolCallCounter.inc({
      agent_id: agentId,
      tool_name: toolName,
      status: record.status,
    });
 
    return { result: null, record };
  }
}

The critical insight: a tool with a 99% success rate sounds great until you realize that tool is called three times per conversation on average. That's a 3% per-conversation failure rate — roughly one in every 33 customers hits a tool error. At 1,000 daily conversations, that's 30 broken interactions per day.

Tool health dashboard essentials:

MetricWhat it reveals
Success rate per tool (hourly)Which tools are failing and when
p95 latency per toolExternal dependency slowdowns
Call frequency per conversationIs the agent over-calling tools?
Input validation failure rateSchema mismatches between LLM output and tool expectations
Timeout rate vs. error rateNetwork issues vs. logic errors

If you've built your tools using the patterns from AI Agent Tools: MCP, OpenAPI, and Tool Management, each tool already has a schema. Monitoring input validation failures against that schema is one of the cheapest, highest-signal checks you can run.

Pillar 4: Conversation Quality

Latency and error rates tell you if the system is healthy. Quality scores tell you if the agent is actually helping people. This is the metric most teams add last and wish they'd added first.

Automated quality evaluation uses an LLM judge to score a sample of production conversations against a rubric. If you've already built an eval framework, you can reuse those same rubrics in production — just run them asynchronously on sampled traffic instead of blocking the response path.

Here's a production quality evaluator that runs alongside your agent without adding latency:

typescript
interface QualityEvaluation {
  conversationId: string;
  agentId: string;
  scores: {
    accuracy: number;       // 1-5: factual correctness
    completeness: number;   // 1-5: did the agent address the full request?
    tone: number;           // 1-5: appropriate, professional, empathetic
    policyAdherence: number; // 1-5: stayed within defined guardrails
    taskCompletion: number; // 1-5: did the user's goal get accomplished?
  };
  reasoning: string;
  flagged: boolean;
  evaluatedAt: Date;
}
 
async function evaluateConversation(
  conversation: { role: string; content: string }[],
  agentId: string,
  conversationId: string,
  policyContext: string,
): Promise<QualityEvaluation> {
  const judgePrompt = `You are evaluating an AI agent conversation for quality.
 
CONVERSATION:
${conversation.map(m => `${m.role}: ${m.content}`).join('\n')}
 
POLICY CONTEXT:
${policyContext}
 
Score each dimension from 1 (poor) to 5 (excellent):
 
1. ACCURACY: Is the information factually correct? Did the agent state anything false?
2. COMPLETENESS: Did the agent address all parts of the user's request?
3. TONE: Was the agent appropriately professional and empathetic?
4. POLICY ADHERENCE: Did the agent stay within the defined policies and guardrails?
5. TASK COMPLETION: Was the user's underlying goal accomplished?
 
Respond in JSON:
{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "tone": <1-5>,
  "policyAdherence": <1-5>,
  "taskCompletion": <1-5>,
  "reasoning": "<2-3 sentences explaining scores>",
  "flagged": <true if any score is 2 or below>
}`;
 
  // Call your judge model here (implementation depends on your LLM client)
  const judgeResponse = await callJudgeModel(judgePrompt);
  const scores = JSON.parse(judgeResponse);
 
  return {
    conversationId,
    agentId,
    scores: {
      accuracy: scores.accuracy,
      completeness: scores.completeness,
      tone: scores.tone,
      policyAdherence: scores.policyAdherence,
      taskCompletion: scores.taskCompletion,
    },
    reasoning: scores.reasoning,
    flagged: scores.flagged,
    evaluatedAt: new Date(),
  };
}

Sample 5-10% of conversations for evaluation. That's enough to detect trends without blowing your API budget on judge calls. For high-stakes domains — healthcare, finance, legal — bump to 20-30%. Always sample randomly; cherry-picking "interesting" conversations introduces selection bias that makes your trend data worthless.

Track rolling averages per dimension. A 0.3-point decline in accuracy over two weeks is meaningful, even if every other dimension holds steady. That single-dimension drop tells you the agent is drifting on factual content while maintaining good tone — a pattern that's invisible in an aggregate quality score. If you want to go deeper on building quality rubrics, the eval framework guide covers multi-criteria scoring in detail.

Pillar 5: Drift Detection

Drift is the silent killer of production agents. Everything looks fine day to day, but over weeks or months, the agent's behavior slowly shifts — responses get longer, certain topic areas get less accurate, tool usage patterns change. No single conversation triggers an alert. The degradation only becomes visible in aggregate.

There are two types of drift to watch for. Prompt drift happens when the model's behavior changes even though your prompts haven't — caused by model updates, input distribution shifts, or context window changes. Behavioral drift happens when the agent's overall patterns shift: it starts preferring different tools, generating longer responses, or handling certain topics differently.

Here's a drift detector that compares recent behavior against a baseline:

typescript
interface DriftWindow {
  period: string;         // "2026-03-01_to_2026-03-07"
  agentId: string;
  metrics: {
    avgTokensPerTurn: number;
    avgToolCallsPerConversation: number;
    avgQualityScore: number;
    topicDistribution: Record<string, number>;   // topic -> percentage
    avgResponseLength: number;
    escalationRate: number;
  };
}
 
interface DriftReport {
  agentId: string;
  baselineWindow: string;
  currentWindow: string;
  alerts: DriftAlert[];
  overallRisk: 'low' | 'medium' | 'high';
}
 
interface DriftAlert {
  metric: string;
  baseline: number;
  current: number;
  changePercent: number;
  severity: 'info' | 'warning' | 'critical';
}
 
function detectDrift(baseline: DriftWindow, current: DriftWindow): DriftReport {
  const alerts: DriftAlert[] = [];
 
  const checks: Array<{
    metric: string;
    baselineVal: number;
    currentVal: number;
    warnThreshold: number;
    critThreshold: number;
  }> = [
    {
      metric: 'avg_tokens_per_turn',
      baselineVal: baseline.metrics.avgTokensPerTurn,
      currentVal: current.metrics.avgTokensPerTurn,
      warnThreshold: 0.25,   // 25% change
      critThreshold: 0.50,   // 50% change
    },
    {
      metric: 'avg_quality_score',
      baselineVal: baseline.metrics.avgQualityScore,
      currentVal: current.metrics.avgQualityScore,
      warnThreshold: 0.05,   // 5% decline (quality is sensitive)
      critThreshold: 0.10,   // 10% decline
    },
    {
      metric: 'avg_tool_calls_per_conversation',
      baselineVal: baseline.metrics.avgToolCallsPerConversation,
      currentVal: current.metrics.avgToolCallsPerConversation,
      warnThreshold: 0.30,
      critThreshold: 0.60,
    },
    {
      metric: 'escalation_rate',
      baselineVal: baseline.metrics.escalationRate,
      currentVal: current.metrics.escalationRate,
      warnThreshold: 0.15,   // 15% increase
      critThreshold: 0.30,   // 30% increase
    },
    {
      metric: 'avg_response_length',
      baselineVal: baseline.metrics.avgResponseLength,
      currentVal: current.metrics.avgResponseLength,
      warnThreshold: 0.35,
      critThreshold: 0.60,
    },
  ];
 
  for (const check of checks) {
    if (check.baselineVal === 0) continue;
    const changePct = (check.currentVal - check.baselineVal) / check.baselineVal;
    const absChange = Math.abs(changePct);
 
    if (absChange >= check.critThreshold) {
      alerts.push({
        metric: check.metric,
        baseline: check.baselineVal,
        current: check.currentVal,
        changePercent: changePct * 100,
        severity: 'critical',
      });
    } else if (absChange >= check.warnThreshold) {
      alerts.push({
        metric: check.metric,
        baseline: check.baselineVal,
        current: check.currentVal,
        changePercent: changePct * 100,
        severity: 'warning',
      });
    }
  }
 
  const criticalCount = alerts.filter(a => a.severity === 'critical').length;
  const warningCount = alerts.filter(a => a.severity === 'warning').length;
 
  return {
    agentId: current.agentId,
    baselineWindow: baseline.period,
    currentWindow: current.period,
    alerts,
    overallRisk: criticalCount > 0 ? 'high' : warningCount >= 2 ? 'medium' : 'low',
  };
}

Run drift detection weekly. Compare each week against a two-week rolling baseline (not a fixed baseline from launch — your baseline should evolve as your agent legitimately improves). As IBM's research notes, agentic drift occurs as underlying models update, training data shifts, and business contexts change — it's not a one-time risk but an ongoing operational reality.

Data analyst reviewing metrics
Total Calls
0+12%
Avg Duration
4:23-8s
Resolution
0%+3%
Live Dashboard
Active calls23
Avg wait0:04
Satisfaction98%

Building an Observability Pipeline

Now that you know what to measure, let's build the pipeline that collects, processes, and surfaces all of it. A production observability pipeline has four layers: instrumentation, collection, storage, and presentation.

Structured Logs Traces / Spans Metrics Real-time Batch Aggregate Pages Tickets Agent Runtime Log Collector Trace Collector Metrics Collector Processing Layer Alerting Engine Quality Evaluation Dashboard / Analytics On-Call Issue Tracker

Instrumentation: The Agent Trace

The foundation of agent observability is the trace — a tree of spans that captures every step the agent took to produce a response. OpenTelemetry's experimental GenAI semantic conventions give us a starting vocabulary, but you'll extend it with agent-specific attributes.

Here's a trace wrapper that creates the full agent execution tree:

typescript
import { trace, SpanKind, SpanStatusCode, Span } from '@opentelemetry/api';
 
const tracer = trace.getTracer('agent-service', '1.0.0');
 
interface AgentContext {
  agentId: string;
  conversationId: string;
  turnIndex: number;
  userId?: string;
  workspaceId: string;
}
 
async function tracedAgentTurn<T>(
  ctx: AgentContext,
  handler: (span: Span) => Promise<T>,
): Promise<T> {
  return tracer.startActiveSpan(
    'agent.turn',
    {
      kind: SpanKind.SERVER,
      attributes: {
        'agent.id': ctx.agentId,
        'conversation.id': ctx.conversationId,
        'agent.turn.index': ctx.turnIndex,
        'user.id': ctx.userId ?? 'anonymous',
        'workspace.id': ctx.workspaceId,
      },
    },
    async (span) => {
      try {
        const result = await handler(span);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : 'Unknown error',
        });
        span.recordException(error as Error);
        throw error;
      } finally {
        span.end();
      }
    },
  );
}
 
async function tracedLLMCall(
  model: string,
  inputTokens: number,
  handler: () => Promise<{ output: string; outputTokens: number }>,
): Promise<{ output: string; outputTokens: number }> {
  return tracer.startActiveSpan(
    'gen_ai.chat',
    {
      kind: SpanKind.CLIENT,
      attributes: {
        'gen_ai.system': 'openai',
        'gen_ai.request.model': model,
        'gen_ai.usage.input_tokens': inputTokens,
      },
    },
    async (span) => {
      const result = await handler();
      span.setAttribute('gen_ai.usage.output_tokens', result.outputTokens);
      span.setAttribute('gen_ai.response.model', model);
      span.end();
      return result;
    },
  );
}
 
async function tracedToolCall(
  toolName: string,
  input: Record<string, unknown>,
  handler: () => Promise<unknown>,
): Promise<unknown> {
  return tracer.startActiveSpan(
    'agent.tool_call',
    {
      kind: SpanKind.CLIENT,
      attributes: {
        'tool.name': toolName,
        'tool.input': JSON.stringify(input),
      },
    },
    async (span) => {
      try {
        const result = await handler();
        span.setAttribute('tool.status', 'success');
        span.end();
        return result;
      } catch (error) {
        span.setAttribute('tool.status', 'error');
        span.setAttribute('tool.error', error instanceof Error ? error.message : 'Unknown');
        span.setStatus({ code: SpanStatusCode.ERROR });
        span.end();
        throw error;
      }
    },
  );
}

The key design decision here: every span in the tree shares the conversation.id and agent.id attributes. This lets you query for "every step in conversation X" or "every tool call agent Y made today" without joining across data sources. Datadog, which now natively supports OpenTelemetry GenAI semantic conventions (v1.37+), renders these traces as flame graphs where you can see exactly where time was spent and where failures originated.

Collection and Processing

Your collection layer needs to handle three signal types simultaneously: structured logs (events with context), distributed traces (span trees), and metrics (counters, histograms, gauges). Here's the processing layer that ties them together:

typescript
interface ObservabilityEvent {
  type: 'log' | 'metric' | 'eval';
  timestamp: Date;
  agentId: string;
  conversationId: string;
  data: Record<string, unknown>;
}
 
class ObservabilityPipeline {
  private buffer: ObservabilityEvent[] = [];
  private flushInterval: NodeJS.Timer;
 
  constructor(
    private readonly batchSize = 100,
    private readonly flushMs = 5000,
    private readonly sinks: Array<(events: ObservabilityEvent[]) => Promise<void>>,
  ) {
    this.flushInterval = setInterval(() => this.flush(), flushMs);
  }
 
  emit(event: ObservabilityEvent): void {
    this.buffer.push(event);
    if (this.buffer.length >= this.batchSize) {
      this.flush();
    }
  }
 
  private async flush(): Promise<void> {
    if (this.buffer.length === 0) return;
 
    const batch = this.buffer.splice(0, this.batchSize);
    await Promise.allSettled(
      this.sinks.map(sink => sink(batch))
    );
  }
 
  async shutdown(): Promise<void> {
    clearInterval(this.flushInterval);
    await this.flush();
  }
}

Batch processing is essential. Sending every event individually creates backpressure that can slow down your agent's response path. Buffer events, flush in batches every 5 seconds or every 100 events (whichever comes first), and use Promise.allSettled so a failing sink doesn't block the others.

Structured Logging for Agents

Traditional application logs are lines of text. Agent logs need to be structured, queryable, and causally connected — you need to reconstruct what the agent was thinking, not just what it did.

The difference is between this:

text
INFO: Agent processed user request in 2.3s

And this:

typescript
interface AgentLogEntry {
  level: 'debug' | 'info' | 'warn' | 'error';
  timestamp: string;
  traceId: string;
  spanId: string;
  agentId: string;
  conversationId: string;
  turnIndex: number;
  event: string;
  step: {
    type: 'planning' | 'llm_call' | 'tool_call' | 'memory' | 'response';
    name: string;
    durationMs: number;
    metadata: Record<string, unknown>;
  };
}

Every log entry carries a traceId and conversationId that lets you reconstruct the full sequence. Here's a logger that enforces this structure:

typescript
class AgentLogger {
  constructor(
    private agentId: string,
    private conversationId: string,
    private traceId: string,
  ) {}
 
  private log(level: AgentLogEntry['level'], event: string, step: AgentLogEntry['step']): void {
    const entry: AgentLogEntry = {
      level,
      timestamp: new Date().toISOString(),
      traceId: this.traceId,
      spanId: trace.getActiveSpan()?.spanContext().spanId ?? 'unknown',
      agentId: this.agentId,
      conversationId: this.conversationId,
      turnIndex: 0, // Set by caller
      event,
      step,
    };
 
    // Emit as structured JSON — your log aggregator (ELK, Datadog, CloudWatch)
    // can parse, index, and query these fields directly
    process.stdout.write(JSON.stringify(entry) + '\n');
  }
 
  planningStart(plan: string): void {
    this.log('info', 'agent.planning.start', {
      type: 'planning',
      name: 'plan_generation',
      durationMs: 0,
      metadata: { plan },
    });
  }
 
  toolCallStart(toolName: string, input: Record<string, unknown>): void {
    this.log('info', 'agent.tool.start', {
      type: 'tool_call',
      name: toolName,
      durationMs: 0,
      metadata: { input },
    });
  }
 
  toolCallEnd(toolName: string, durationMs: number, status: string): void {
    this.log(status === 'error' ? 'error' : 'info', 'agent.tool.end', {
      type: 'tool_call',
      name: toolName,
      durationMs,
      metadata: { status },
    });
  }
 
  llmCallEnd(model: string, durationMs: number, tokens: { input: number; output: number }): void {
    this.log('info', 'agent.llm.end', {
      type: 'llm_call',
      name: model,
      durationMs,
      metadata: { tokens },
    });
  }
 
  responseGenerated(responseLength: number, durationMs: number): void {
    this.log('info', 'agent.response.generated', {
      type: 'response',
      name: 'final_response',
      durationMs,
      metadata: { responseLength },
    });
  }
}

This might feel like over-engineering for a single agent. It isn't. The moment you have two agents in production — or one agent with a bug you can't reproduce — you'll wish every event was structured and correlated. The structured format also enables automated analysis: you can write queries like "show me all conversations where a tool call failed and the quality score was below 3" without parsing text.

A critical rule: never log the full user message or agent response in production by default. Log a hash or a truncated preview. Full content goes to a separate, access-controlled store for quality evaluation. This keeps your primary log pipeline fast and avoids accidentally indexing PII in your search infrastructure.

Alerting Strategies That Actually Work

The goal of alerting isn't to tell you about every anomaly. It's to wake you up for real incidents and send everything else to a next-business-day queue. Most teams get this backwards — they set tight thresholds on individual metrics and drown in noise.

Composite Signals Over Single Metrics

A single metric spike is usually noise. A correlated shift across multiple metrics is usually a real problem. Build your alerts around signal combinations:

typescript
interface AlertRule {
  name: string;
  severity: 'critical' | 'warning' | 'info';
  conditions: AlertCondition[];
  requiredConditions: number;    // How many conditions must fire
  windowMinutes: number;
  cooldownMinutes: number;
}
 
interface AlertCondition {
  metric: string;
  operator: 'gt' | 'lt' | 'change_pct_gt';
  threshold: number;
}
 
const alertRules: AlertRule[] = [
  {
    name: 'agent_quality_degradation',
    severity: 'critical',
    conditions: [
      { metric: 'quality_score_avg', operator: 'lt', threshold: 3.5 },
      { metric: 'token_usage_per_turn', operator: 'change_pct_gt', threshold: 35 },
      { metric: 'escalation_rate', operator: 'change_pct_gt', threshold: 20 },
    ],
    requiredConditions: 2,        // Any 2 of 3 must fire
    windowMinutes: 30,
    cooldownMinutes: 120,
  },
  {
    name: 'tool_infrastructure_failure',
    severity: 'critical',
    conditions: [
      { metric: 'tool_error_rate', operator: 'gt', threshold: 0.10 },
      { metric: 'tool_p95_latency_seconds', operator: 'gt', threshold: 8 },
    ],
    requiredConditions: 1,        // Either condition alone is concerning
    windowMinutes: 10,
    cooldownMinutes: 60,
  },
  {
    name: 'cost_anomaly',
    severity: 'warning',
    conditions: [
      { metric: 'daily_token_cost', operator: 'change_pct_gt', threshold: 50 },
      { metric: 'avg_tokens_per_conversation', operator: 'change_pct_gt', threshold: 40 },
    ],
    requiredConditions: 1,
    windowMinutes: 1440,          // Daily check
    cooldownMinutes: 1440,
  },
  {
    name: 'drift_detected',
    severity: 'warning',
    conditions: [
      { metric: 'quality_score_weekly_change', operator: 'lt', threshold: -0.3 },
      { metric: 'response_length_weekly_change_pct', operator: 'change_pct_gt', threshold: 30 },
    ],
    requiredConditions: 1,
    windowMinutes: 10080,         // Weekly check
    cooldownMinutes: 10080,
  },
];

Notice the requiredConditions field. For quality degradation, we require two of three signals to fire before paging anyone. A quality score dip alone might be a sampling artifact. A quality score dip plus a spike in token usage plus rising escalation rates — that's a real incident.

Tiered Response

Not every alert needs the same response. Structure your alerts into three tiers:

TierResponse TimeExamplesChannel
Critical (P1)Immediate, pages on-callQuality below threshold + tool failures, complete agent outagePagerDuty/Opsgenie
Warning (P2)Same business dayCost anomaly, single-metric drift, elevated error rateSlack channel
Informational (P3)Weekly reviewSlow trends, minor drift signals, new topic clustersDashboard/email digest

Threshold Calibration

Don't set thresholds from industry benchmarks. They're meaningless for your specific agent. Instead:

  1. Run for two weeks with monitoring enabled but no alerts
  2. Calculate your mean and standard deviation for each metric
  3. Set warning at 2x standard deviation, critical at 3x
  4. Review and adjust monthly as your baseline evolves

This approach means your first two weeks in production are "fly-by-instruments" with manual review. That's the correct tradeoff. Premature alerting causes alert fatigue, and alert fatigue causes missed incidents.

Putting It All Together: The Observability Loop

Observability isn't a one-time setup — it's a feedback loop. Monitoring surfaces issues, evaluation quantifies them, and the insights feed back into agent improvement.

Deploy Agent Monitor Production Detect Anomaly Evaluate Quality Diagnose Root Cause Fix: Prompt / Tool / Data Regression Test

Here's what that loop looks like in practice for a real incident:

Day 1: Drift detector flags a 0.4-point quality decline in accuracy scores over the past week. Severity: warning. No page, shows up in the weekly review dashboard.

Day 2: Engineer reviews the flagged conversations. Quality evaluations show the agent is giving outdated pricing information — a knowledge base article was updated but the agent's retrieval is pulling cached versions.

Day 3: Fix deployed — KB cache TTL reduced, retrieval pipeline flushed. Regression tests confirm the fix. Quality scores recover within 24 hours.

Without observability, this issue would've continued for weeks. Users would've gotten wrong pricing. Support tickets would've piled up. Someone would've eventually noticed and filed a bug. The fix itself was trivial. The detection was everything.

The Monitoring Stack Decision

You don't need to build all of this from scratch. The ecosystem has matured significantly. Here's how the tooling breaks down:

LayerBuild vs. BuyReasoning
TracesUse OpenTelemetry + existing backendOTel is the standard; send to Datadog, Jaeger, or Grafana Tempo
MetricsUse Prometheus/Grafana or your cloud providerCommodity infrastructure
Structured logsBuild the schema, use existing aggregationYour log format is custom; ELK/CloudWatch/Datadog for storage
Quality evaluationBuild or use specialized toolingRubrics are domain-specific; platforms like Chanl's scorecards or Langfuse can accelerate this
Drift detectionBuildToo specific to your agent's behavior for off-the-shelf tools
AlertingBuild composite rules on existing alerting infraPagerDuty/Opsgenie for routing; your rules for logic

For teams running production voice or chat agents, platforms that integrate monitoring, analytics, and memory management into a unified pipeline reduce the operational overhead of wiring these layers together manually.

Advanced Patterns: What Comes Next

Once your core observability pipeline is running, three advanced patterns become possible.

Conversation-Level Analytics

Move beyond per-turn metrics to conversation-level analysis. How many turns does it take to resolve a request? Where do conversations get stuck? Which topics have the highest escalation rates?

typescript
interface ConversationAnalytics {
  conversationId: string;
  agentId: string;
  metrics: {
    turnCount: number;
    totalDurationSeconds: number;
    totalTokens: number;
    totalCost: number;
    toolCallCount: number;
    toolFailureCount: number;
    topic: string;
    sentiment: 'positive' | 'neutral' | 'negative';
    outcome: 'resolved' | 'escalated' | 'abandoned';
    qualityScore: number;
  };
}
 
function aggregateConversationMetrics(
  conversations: ConversationAnalytics[],
): Record<string, {
  count: number;
  avgTurns: number;
  resolutionRate: number;
  avgQuality: number;
  avgCost: number;
}> {
  const byTopic: Record<string, ConversationAnalytics[]> = {};
 
  for (const conv of conversations) {
    const topic = conv.metrics.topic;
    if (!byTopic[topic]) byTopic[topic] = [];
    byTopic[topic].push(conv);
  }
 
  const result: Record<string, any> = {};
  for (const [topic, convs] of Object.entries(byTopic)) {
    const resolved = convs.filter(c => c.metrics.outcome === 'resolved').length;
    result[topic] = {
      count: convs.length,
      avgTurns: convs.reduce((s, c) => s + c.metrics.turnCount, 0) / convs.length,
      resolutionRate: resolved / convs.length,
      avgQuality: convs.reduce((s, c) => s + c.metrics.qualityScore, 0) / convs.length,
      avgCost: convs.reduce((s, c) => s + c.metrics.totalCost, 0) / convs.length,
    };
  }
 
  return result;
}

This analysis reveals that your agent resolves billing questions in 2.3 turns on average but takes 6.8 turns for technical troubleshooting — and troubleshooting conversations cost 4x more in tokens. That's actionable intelligence for where to invest in better tools or retrieval.

Regression Testing from Production Data

Your production observability data is a goldmine for regression testing. Export flagged conversations (quality score below threshold) as test cases, and run them through your eval framework before every deploy.

typescript
async function exportRegressionCases(
  evaluations: QualityEvaluation[],
  threshold: number,
): Promise<Array<{ input: string; expectedBehavior: string; failureReason: string }>> {
  return evaluations
    .filter(e => {
      const avgScore = Object.values(e.scores).reduce((a, b) => a + b, 0) / 5;
      return avgScore < threshold;
    })
    .map(e => ({
      input: e.conversationId, // Resolve to actual conversation data
      expectedBehavior: `All quality scores >= ${threshold}`,
      failureReason: e.reasoning,
    }));
}

This creates a growing test suite that's grounded in real production failures — not hypothetical edge cases. Every bad conversation becomes a regression test that prevents the same failure from recurring.

Model Comparison in Production

When you're considering a model switch (or your provider releases an update), shadow-test the new model against production traffic:

typescript
interface ShadowTestResult {
  conversationId: string;
  primaryModel: { model: string; latencyMs: number; tokens: number; qualityScore: number };
  shadowModel: { model: string; latencyMs: number; tokens: number; qualityScore: number };
}
 
async function shadowTest(
  conversation: { role: string; content: string }[],
  primaryHandler: () => Promise<{ output: string; latencyMs: number; tokens: number }>,
  shadowHandler: () => Promise<{ output: string; latencyMs: number; tokens: number }>,
): Promise<ShadowTestResult> {
  // Primary model serves the user (blocking)
  const primary = await primaryHandler();
 
  // Shadow model runs async (non-blocking, fire-and-forget for the user)
  const shadow = await shadowHandler();
 
  // Evaluate both
  const [primaryQuality, shadowQuality] = await Promise.all([
    evaluateResponse(conversation, primary.output),
    evaluateResponse(conversation, shadow.output),
  ]);
 
  return {
    conversationId: crypto.randomUUID(),
    primaryModel: { model: 'current', ...primary, qualityScore: primaryQuality },
    shadowModel: { model: 'candidate', ...shadow, qualityScore: shadowQuality },
  };
}

After a week of shadow testing, you'll have statistically significant quality, latency, and cost comparisons — grounded in your actual traffic patterns, not synthetic benchmarks.

Common Pitfalls and How to Avoid Them

Before wrapping up, here are the mistakes that catch teams most often.

Monitoring averages instead of percentiles. Average latency of 1.2 seconds feels fine. But if your p99 is 18 seconds, 1% of your users are having a terrible experience. Always track p50, p95, and p99.

Over-relying on error rates. An agent can have a 0% error rate and still be consistently wrong. Errors catch infrastructure failures. Quality evaluation catches behavioral failures. You need both.

Evaluating too late. Teams that add quality evaluation after their first production incident spend weeks building what should've been there from day one. Instrument quality scoring from the start, even if you only sample 5%.

Setting static thresholds. Your agent's behavior changes as it handles new topics, as you update prompts, and as models evolve. Re-baseline your thresholds monthly. A static threshold set at launch will either alert constantly or miss everything within six months.

Ignoring cost signals. A 40% jump in tokens per conversation isn't just a budget issue — it's often the first signal of prompt drift or retrieval degradation. Treat cost anomalies as behavioral signals, not just financial ones.

Logging everything at full fidelity. Full prompt and response logging at 100% sampling creates a storage problem, a privacy problem, and a query performance problem. Log metadata at 100%, full content at 5-10%, and keep the two streams separate.

Where This Goes from Here

Agent observability is still a young field. OpenTelemetry's GenAI semantic conventions are experimental. The evaluation tooling ecosystem is fragmented. Most teams are still stitching together monitoring from pieces.

But the direction is clear. The teams that invest in observability early — structured traces, automated quality evaluation, drift detection, composite alerting — are the ones whose agents improve over time instead of silently degrading. The observability pipeline isn't overhead. It's the feedback mechanism that turns a deployed agent into a continuously improving system.

If you're building agents with prompt engineering techniques and connecting them to tools, the observability layer is what makes those investments compound. Without it, you're shipping code into a black box and hoping for the best.

Start with the four core metrics: latency percentiles, tool success rates, token usage, and sampled quality scores. Add drift detection after your first two weeks. Build composite alerts after your first month. By then, you'll have enough baseline data to set thresholds that actually mean something — and enough production experience to know which signals matter most for your specific agents.


Sources

LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions