What's the difference between LLM observability and agent observability?

LLM observability watches individual model calls — latency, token counts, output quality for a single prompt-response pair. Agent observability watches the entire system: multi-step reasoning chains, tool call sequences, memory retrieval, planning decisions, and how all those pieces connect. An agent can make five LLM calls, three tool calls, and two memory lookups in a single conversation turn. LLM observability sees five isolated calls. Agent observability sees one workflow.

Which metrics should I monitor first when launching an AI agent?

Start with four: end-to-end latency (p50, p95, p99), tool call success rate, token usage per conversation, and a sampled quality score. These give you system health, reliability, cost visibility, and behavioral quality. Add drift detection and conversation-level analytics once your baseline is stable — usually after two to four weeks of production traffic.

How do you detect agent drift without manual review?

Run automated quality evaluations on a sample of production traffic (5-10% is typically sufficient). Compare scores week-over-week using a rolling window. A consistent downward trend of 0.3 points or more over two weeks signals drift. Complement this with embedding-based semantic similarity against a reference corpus — Population Stability Index values above 0.1 indicate significant distribution shift.

Why do average latency numbers hide real problems?

A multi-step agent might respond in 800ms for 90% of queries but take 15 seconds for complex ones requiring multiple tool calls. The average looks fine at 1.2 seconds. But those 10% of users are waiting 15 seconds. Track p50, p95, and p99 separately, and segment by conversation complexity (number of tool calls, reasoning steps) to find the real bottlenecks.

How should tool call failures be categorized for monitoring?

Separate hard failures (exceptions, timeouts, HTTP errors) from soft failures (tool returned data but the agent misused it, or the tool returned stale data). Hard failures show up in error rates. Soft failures only surface through quality evaluation. Track both: hard failures per tool per hour, and a tool-outcome quality score from sampled evaluations.

What's the right sampling rate for production quality evaluation?

5-10% of conversations gives statistical significance for trend detection without excessive cost. For high-stakes domains like healthcare or finance, sample 20-30%. Always sample randomly — don't cherry-pick conversations that look interesting. If you have fewer than 100 conversations per day, evaluate all of them.

How do you set alert thresholds that don't cause alert fatigue?

Base thresholds on your own baseline data, not industry benchmarks. Measure your p95 latency, error rate, and quality scores for two weeks, then set alerts at 2x standard deviation from your baseline. Use composite signals — a single metric blip is noise, but latency spike plus quality score drop plus token usage increase is a real incident. Separate critical (pages on-call) from informational (next-business-day review).

Can OpenTelemetry handle AI agent tracing or do I need specialized tools?

OpenTelemetry now has experimental semantic conventions for generative AI agent spans, covering model calls, tool executions, and agent planning steps. It handles the tracing structure well. But you'll still need additional tooling for quality evaluation (LLM-as-judge scoring), cost tracking, and drift detection — those aren't part of the OTel spec. Most production teams use OTel for traces plus a specialized layer for behavioral evaluation.

AI Agent Observability: What to Monitor When Your Agent Goes Live | Chanl Blog

Your agent passed every test in staging. The prompts are polished, the tools work, the latency looks great. You deploy to production, and for the first three days everything runs smoothly. On day four, a customer reports getting wildly incorrect pricing information. You check the logs — no errors, no exceptions, no alerts fired. The agent was technically healthy the entire time. It was just wrong.

This is the observability gap that catches most teams. Traditional monitoring answers "is it running?" Agent observability answers "is it behaving correctly?" And those are fundamentally different questions. According to Gartner, over 40% of agentic AI projects will be canceled by the end of 2027, with inadequate risk controls and unclear operational visibility among the top reasons. PwC's 2025 AI Agent Survey found that while 79% of organizations have adopted AI agents, the majority struggle to trace failures through multi-step workflows.

This guide builds a production observability pipeline from the ground up. We'll cover every metric that matters, build structured logging and tracing in TypeScript, set up alerting that doesn't drown you in noise, and address the specific challenges that make agent observability different from anything you've monitored before.

What you'll learn	Why it matters
The 5 pillars of agent observability	Know exactly what to measure before writing a single line of instrumentation
Structured logging for agents	Build traceable, queryable logs that connect reasoning steps to outcomes
Tool call monitoring	Detect both hard failures and subtle misuse before users notice
Drift detection pipeline	Catch gradual degradation that averages and error rates miss
Alerting without fatigue	Composite signals that page you when it matters, stay quiet when it doesn't

Prerequisites

You'll need Node.js 20+, TypeScript 5+, and familiarity with basic observability concepts (logs, metrics, traces). Experience with AI agents — even a simple one — will make the examples more concrete. If you're new to agent tool infrastructure, AI Agent Tools: MCP, OpenAPI, and Tool Management That Actually Scales covers the foundational patterns we'll be instrumenting here.

The code examples use TypeScript throughout and are framework-agnostic. You'll also want these dependencies installed:

bash

npm install @opentelemetry/api @opentelemetry/sdk-node prom-client zod

We'll reference OpenTelemetry's experimental GenAI semantic conventions where applicable, but the patterns work with any tracing backend.

LLM Observability vs. Agent Observability: Why the Distinction Matters

LLM observability watches a single model call — prompt in, completion out, how long it took, how many tokens it consumed. Agent observability watches an entire reasoning workflow that might span dozens of model calls, tool executions, memory lookups, and planning decisions, all connected in a causal chain.

Here's why this distinction isn't academic. Consider a customer support agent that handles a refund request. In a single conversation turn, it might:

Retrieve the customer's order history (memory lookup)
Check the return policy for the product category (knowledge base query)
Verify the order is within the return window (tool call)
Calculate the refund amount (LLM reasoning)
Process the refund (tool call)
Generate a confirmation message (LLM completion)

LLM observability sees steps 4 and 6 as two isolated model calls. Agent observability sees one workflow with six connected steps, where a failure in step 2 (wrong policy retrieved) cascades into step 4 (incorrect calculation) and step 6 (confidently wrong confirmation). The model calls themselves were fast and well-formed. The agent behavior was broken.

The table below maps the key differences across every dimension that matters for production monitoring:

Dimension	LLM Observability	Agent Observability
Scope	Single model call	Multi-step workflow
Latency	Time-to-first-token, completion time	End-to-end turn latency across all steps
Errors	API failures, rate limits	Cascading failures across reasoning chain
Quality	Output relevance for one prompt	Task completion across full conversation
Cost	Tokens per call	Tokens + tool calls + memory lookups per conversation
Traces	Request/response pairs	Directed acyclic graph of decisions
Drift	Output distribution shift	Behavioral pattern change over time

As Arize AI's research puts it, agent observability provides "the connective tissue to reconstruct the agent's path from the initial prompt to the outcome." Without that connective tissue, you're debugging in the dark.

The Five Pillars of Agent Observability

Every production agent needs monitoring across five dimensions. Miss one, and you'll have a blind spot that bites you at 2 AM. Let's walk through each pillar, what to measure, and what the numbers actually mean.

Pillar 1: Latency — Beyond Averages

Track p50, p95, and p99 latency for every step in your agent's workflow, not just the total. A multi-step agent might respond in 800ms most of the time but take 15 seconds on complex queries that trigger multiple tool calls. The average looks healthy. The users hitting that tail don't agree.

Here's a latency collector that captures per-step timing across the full agent execution:

typescript

import { Histogram } from 'prom-client';
 
const agentLatency = new Histogram({
  name: 'agent_turn_duration_seconds',
  help: 'End-to-end latency for a single agent turn',
  labelNames: ['agent_id', 'step_type', 'model'],
  buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30],
});
 
const stepLatency = new Histogram({
  name: 'agent_step_duration_seconds',
  help: 'Latency for individual agent steps',
  labelNames: ['agent_id', 'step_type', 'step_name'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});
 
interface StepTiming {
  stepType: 'llm_call' | 'tool_call' | 'memory_lookup' | 'kb_query';
  stepName: string;
  startTime: number;
  endTime?: number;
}
 
class LatencyTracker {
  private steps: StepTiming[] = [];
  private turnStart: number;
 
  constructor(private agentId: string, private model: string) {
    this.turnStart = performance.now();
  }
 
  startStep(stepType: StepTiming['stepType'], stepName: string): () => void {
    const step: StepTiming = {
      stepType,
      stepName,
      startTime: performance.now(),
    };
    this.steps.push(step);
 
    // Return a function to end this step
    return () => {
      step.endTime = performance.now();
      const durationSec = (step.endTime - step.startTime) / 1000;
      stepLatency.observe(
        { agent_id: this.agentId, step_type: stepType, step_name: stepName },
        durationSec,
      );
    };
  }
 
  finishTurn(): { totalMs: number; steps: StepTiming[] } {
    const totalMs = performance.now() - this.turnStart;
    agentLatency.observe(
      { agent_id: this.agentId, step_type: 'full_turn', model: this.model },
      totalMs / 1000,
    );
    return { totalMs, steps: this.steps };
  }
}

What matters here isn't just the numbers — it's the segmentation. A p99 spike in tool_call steps while llm_call stays flat tells you the problem is in your external integrations, not your model. A p95 increase in llm_call with stable tool latency suggests model degradation or prompt bloat.

Thresholds worth watching:

Metric	Healthy	Investigate	Alert
p50 turn latency	< 2s	2-5s	> 5s
p95 turn latency	< 5s	5-10s	> 10s
p99 turn latency	< 10s	10-20s	> 20s
Step-to-step variance	< 3x p50	3-5x p50	> 5x p50

These are starting points. Your actual thresholds should come from your own baseline data after two weeks of production traffic.

Pillar 2: Token Usage and Cost

Token consumption is both a cost signal and a behavioral signal. Sudden spikes in tokens per conversation often indicate the agent is hedging, repeating itself, or pulling too much context — all symptoms of deeper issues.

This tracker captures token metrics per conversation with enough granularity to pinpoint where tokens are being spent:

typescript

interface TokenMetrics {
  conversationId: string;
  agentId: string;
  turns: Array<{
    turnIndex: number;
    inputTokens: number;
    outputTokens: number;
    model: string;
    toolCallCount: number;
    memoryTokens: number;
    kbContextTokens: number;
  }>;
}
 
function analyzeTokenUsage(metrics: TokenMetrics): {
  totalCost: number;
  avgTokensPerTurn: number;
  contextRatio: number;
  anomalies: string[];
} {
  const anomalies: string[] = [];
  let totalInput = 0;
  let totalOutput = 0;
  let totalContext = 0;
 
  for (const turn of metrics.turns) {
    totalInput += turn.inputTokens;
    totalOutput += turn.outputTokens;
    totalContext += turn.memoryTokens + turn.kbContextTokens;
 
    // Flag turns where context exceeds 60% of input
    const contextShare = (turn.memoryTokens + turn.kbContextTokens) / turn.inputTokens;
    if (contextShare > 0.6) {
      anomalies.push(
        `Turn ${turn.turnIndex}: context is ${(contextShare * 100).toFixed(0)}% of input — consider trimming retrieval`
      );
    }
 
    // Flag unusually long outputs (hedging signal)
    if (turn.outputTokens > 800 && turn.toolCallCount === 0) {
      anomalies.push(
        `Turn ${turn.turnIndex}: ${turn.outputTokens} output tokens with no tool calls — possible hedging`
      );
    }
  }
 
  // Cost calculation (adjust rates per model)
  const MODEL_COSTS: Record<string, { input: number; output: number }> = {
    'gpt-4o': { input: 2.50, output: 10.00 },        // per 1M tokens
    'gpt-4o-mini': { input: 0.15, output: 0.60 },
    'claude-sonnet-4-20250514': { input: 3.00, output: 15.00 },
  };
 
  let totalCost = 0;
  for (const turn of metrics.turns) {
    const rates = MODEL_COSTS[turn.model] ?? MODEL_COSTS['gpt-4o-mini'];
    totalCost += (turn.inputTokens * rates.input + turn.outputTokens * rates.output) / 1_000_000;
  }
 
  return {
    totalCost,
    avgTokensPerTurn: (totalInput + totalOutput) / metrics.turns.length,
    contextRatio: totalContext / totalInput,
    anomalies,
  };
}

A mid-sized deployment handling 1,000 daily conversations with multi-turn interactions can consume 5 to 10 million tokens per month. At GPT-4o rates, that's $25 to $125 per month for a light workload — but costs scale non-linearly when reasoning tokens, retrieval context, and retries stack up. Teams routinely pass four to eight full documents into context when a few targeted chunks would suffice. Monitoring token distribution by source (system prompt, memory, retrieval, user input) reveals exactly where to cut.

Pillar 3: Tool Call Success Rates

Your agent's tools are its connection to the real world, and they're the most common point of failure in production. Track both hard failures (exceptions, timeouts) and soft failures (the tool returned data, but the agent misused it or the data was stale).

Here's a tool monitoring wrapper that captures everything you need:

typescript

import { Counter, Histogram } from 'prom-client';
 
const toolCallCounter = new Counter({
  name: 'agent_tool_calls_total',
  help: 'Total tool call attempts',
  labelNames: ['agent_id', 'tool_name', 'status'],
});
 
const toolLatencyHist = new Histogram({
  name: 'agent_tool_call_duration_seconds',
  help: 'Tool call execution time',
  labelNames: ['agent_id', 'tool_name'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
 
interface ToolCallRecord {
  toolName: string;
  agentId: string;
  conversationId: string;
  input: Record<string, unknown>;
  output?: unknown;
  error?: string;
  durationMs: number;
  status: 'success' | 'error' | 'timeout' | 'invalid_input';
  timestamp: Date;
}
 
async function monitoredToolCall<T>(
  agentId: string,
  conversationId: string,
  toolName: string,
  input: Record<string, unknown>,
  executor: (input: Record<string, unknown>) => Promise<T>,
  timeoutMs = 5000,
): Promise<{ result: T | null; record: ToolCallRecord }> {
  const start = performance.now();
  let record: ToolCallRecord;
 
  try {
    const result = await Promise.race([
      executor(input),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error('Tool call timeout')), timeoutMs)
      ),
    ]);
 
    const durationMs = performance.now() - start;
    record = {
      toolName, agentId, conversationId, input,
      output: result,
      durationMs,
      status: 'success',
      timestamp: new Date(),
    };
 
    toolCallCounter.inc({ agent_id: agentId, tool_name: toolName, status: 'success' });
    toolLatencyHist.observe({ agent_id: agentId, tool_name: toolName }, durationMs / 1000);
 
    return { result, record };
  } catch (err) {
    const durationMs = performance.now() - start;
    const isTimeout = err instanceof Error && err.message === 'Tool call timeout';
 
    record = {
      toolName, agentId, conversationId, input,
      error: err instanceof Error ? err.message : String(err),
      durationMs,
      status: isTimeout ? 'timeout' : 'error',
      timestamp: new Date(),
    };
 
    toolCallCounter.inc({
      agent_id: agentId,
      tool_name: toolName,
      status: record.status,
    });
 
    return { result: null, record };
  }
}

The critical insight: a tool with a 99% success rate sounds great until you realize that tool is called three times per conversation on average. That's a 3% per-conversation failure rate — roughly one in every 33 customers hits a tool error. At 1,000 daily conversations, that's 30 broken interactions per day.

Tool health dashboard essentials:

Metric	What it reveals
Success rate per tool (hourly)	Which tools are failing and when
p95 latency per tool	External dependency slowdowns
Call frequency per conversation	Is the agent over-calling tools?
Input validation failure rate	Schema mismatches between LLM output and tool expectations
Timeout rate vs. error rate	Network issues vs. logic errors

If you've built your tools using the patterns from AI Agent Tools: MCP, OpenAPI, and Tool Management, each tool already has a schema. Monitoring input validation failures against that schema is one of the cheapest, highest-signal checks you can run.

Pillar 4: Conversation Quality

Latency and error rates tell you if the system is healthy. Quality scores tell you if the agent is actually helping people. This is the metric most teams add last and wish they'd added first.

Automated quality evaluation uses an LLM judge to score a sample of production conversations against a rubric. If you've already built an eval framework, you can reuse those same rubrics in production — just run them asynchronously on sampled traffic instead of blocking the response path.

Here's a production quality evaluator that runs alongside your agent without adding latency:

typescript

interface QualityEvaluation {
  conversationId: string;
  agentId: string;
  scores: {
    accuracy: number;       // 1-5: factual correctness
    completeness: number;   // 1-5: did the agent address the full request?
    tone: number;           // 1-5: appropriate, professional, empathetic
    policyAdherence: number; // 1-5: stayed within defined guardrails
    taskCompletion: number; // 1-5: did the user's goal get accomplished?
  };
  reasoning: string;
  flagged: boolean;
  evaluatedAt: Date;
}
 
async function evaluateConversation(
  conversation: { role: string; content: string }[],
  agentId: string,
  conversationId: string,
  policyContext: string,
): Promise<QualityEvaluation> {
  const judgePrompt = `You are evaluating an AI agent conversation for quality.
 
CONVERSATION:
${conversation.map(m => `${m.role}: ${m.content}`).join('\n')}
 
POLICY CONTEXT:
${policyContext}
 
Score each dimension from 1 (poor) to 5 (excellent):
 
1. ACCURACY: Is the information factually correct? Did the agent state anything false?
2. COMPLETENESS: Did the agent address all parts of the user's request?
3. TONE: Was the agent appropriately professional and empathetic?
4. POLICY ADHERENCE: Did the agent stay within the defined policies and guardrails?
5. TASK COMPLETION: Was the user's underlying goal accomplished?
 
Respond in JSON:
{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "tone": <1-5>,
  "policyAdherence": <1-5>,
  "taskCompletion": <1-5>,
  "reasoning": "<2-3 sentences explaining scores>",
  "flagged": <true if any score is 2 or below>
}`;
 
  // Call your judge model here (implementation depends on your LLM client)
  const judgeResponse = await callJudgeModel(judgePrompt);
  const scores = JSON.parse(judgeResponse);
 
  return {
    conversationId,
    agentId,
    scores: {
      accuracy: scores.accuracy,
      completeness: scores.completeness,
      tone: scores.tone,
      policyAdherence: scores.policyAdherence,
      taskCompletion: scores.taskCompletion,
    },
    reasoning: scores.reasoning,
    flagged: scores.flagged,
    evaluatedAt: new Date(),
  };
}

Sample 5-10% of conversations for evaluation. That's enough to detect trends without blowing your API budget on judge calls. For high-stakes domains — healthcare, finance, legal — bump to 20-30%. Always sample randomly; cherry-picking "interesting" conversations introduces selection bias that makes your trend data worthless.

Track rolling averages per dimension. A 0.3-point decline in accuracy over two weeks is meaningful, even if every other dimension holds steady. That single-dimension drop tells you the agent is drifting on factual content while maintaining good tone — a pattern that's invisible in an aggregate quality score. If you want to go deeper on building quality rubrics, the eval framework guide covers multi-criteria scoring in detail.

Pillar 5: Drift Detection

Drift is the silent killer of production agents. Everything looks fine day to day, but over weeks or months, the agent's behavior slowly shifts — responses get longer, certain topic areas get less accurate, tool usage patterns change. No single conversation triggers an alert. The degradation only becomes visible in aggregate.

There are two types of drift to watch for. Prompt drift happens when the model's behavior changes even though your prompts haven't — caused by model updates, input distribution shifts, or context window changes. Behavioral drift happens when the agent's overall patterns shift: it starts preferring different tools, generating longer responses, or handling certain topics differently.

Here's a drift detector that compares recent behavior against a baseline:

typescript

interface DriftWindow {
  period: string;         // "2026-03-01_to_2026-03-07"
  agentId: string;
  metrics: {
    avgTokensPerTurn: number;
    avgToolCallsPerConversation: number;
    avgQualityScore: number;
    topicDistribution: Record<string, number>;   // topic -> percentage
    avgResponseLength: number;
    escalationRate: number;
  };
}
 
interface DriftReport {
  agentId: string;
  baselineWindow: string;
  currentWindow: string;
  alerts: DriftAlert[];
  overallRisk: 'low' | 'medium' | 'high';
}
 
interface DriftAlert {
  metric: string;
  baseline: number;
  current: number;
  changePercent: number;
  severity: 'info' | 'warning' | 'critical';
}
 
function detectDrift(baseline: DriftWindow, current: DriftWindow): DriftReport {
  const alerts: DriftAlert[] = [];
 
  const checks: Array<{
    metric: string;
    baselineVal: number;
    currentVal: number;
    warnThreshold: number;
    critThreshold: number;
  }> = [
    {
      metric: 'avg_tokens_per_turn',
      baselineVal: baseline.metrics.avgTokensPerTurn,
      currentVal: current.metrics.avgTokensPerTurn,
      warnThreshold: 0.25,   // 25% change
      critThreshold: 0.50,   // 50% change
    },
    {
      metric: 'avg_quality_score',
      baselineVal: baseline.metrics.avgQualityScore,
      currentVal: current.metrics.avgQualityScore,
      warnThreshold: 0.05,   // 5% decline (quality is sensitive)
      critThreshold: 0.10,   // 10% decline
    },
    {
      metric: 'avg_tool_calls_per_conversation',
      baselineVal: baseline.metrics.avgToolCallsPerConversation,
      currentVal: current.metrics.avgToolCallsPerConversation,
      warnThreshold: 0.30,
      critThreshold: 0.60,
    },
    {
      metric: 'escalation_rate',
      baselineVal: baseline.metrics.escalationRate,
      currentVal: current.metrics.escalationRate,
      warnThreshold: 0.15,   // 15% increase
      critThreshold: 0.30,   // 30% increase
    },
    {
      metric: 'avg_response_length',
      baselineVal: baseline.metrics.avgResponseLength,
      currentVal: current.metrics.avgResponseLength,
      warnThreshold: 0.35,
      critThreshold: 0.60,
    },
  ];
 
  for (const check of checks) {
    if (check.baselineVal === 0) continue;
    const changePct = (check.currentVal - check.baselineVal) / check.baselineVal;
    const absChange = Math.abs(changePct);
 
    if (absChange >= check.critThreshold) {
      alerts.push({
        metric: check.metric,
        baseline: check.baselineVal,
        current: check.currentVal,
        changePercent: changePct * 100,
        severity: 'critical',
      });
    } else if (absChange >= check.warnThreshold) {
      alerts.push({
        metric: check.metric,
        baseline: check.baselineVal,
        current: check.currentVal,
        changePercent: changePct * 100,
        severity: 'warning',
      });
    }
  }
 
  const criticalCount = alerts.filter(a => a.severity === 'critical').length;
  const warningCount = alerts.filter(a => a.severity === 'warning').length;
 
  return {
    agentId: current.agentId,
    baselineWindow: baseline.period,
    currentWindow: current.period,
    alerts,
    overallRisk: criticalCount > 0 ? 'high' : warningCount >= 2 ? 'medium' : 'low',
  };
}

Run drift detection weekly. Compare each week against a two-week rolling baseline (not a fixed baseline from launch — your baseline should evolve as your agent legitimately improves). As IBM's research notes, agentic drift occurs as underlying models update, training data shifts, and business contexts change — it's not a one-time risk but an ongoing operational reality.

Total Calls

0+12%

Avg Duration

4:23-8s

Resolution

0%+3%

Live Dashboard

Active calls23

Avg wait0:04

Satisfaction98%

Building an Observability Pipeline

Now that you know what to measure, let's build the pipeline that collects, processes, and surfaces all of it. A production observability pipeline has four layers: instrumentation, collection, storage, and presentation.

Instrumentation: The Agent Trace

The foundation of agent observability is the trace — a tree of spans that captures every step the agent took to produce a response. OpenTelemetry's experimental GenAI semantic conventions give us a starting vocabulary, but you'll extend it with agent-specific attributes.

Here's a trace wrapper that creates the full agent execution tree:

typescript

import { trace, SpanKind, SpanStatusCode, Span } from '@opentelemetry/api';
 
const tracer = trace.getTracer('agent-service', '1.0.0');
 
interface AgentContext {
  agentId: string;
  conversationId: string;
  turnIndex: number;
  userId?: string;
  workspaceId: string;
}
 
async function tracedAgentTurn<T>(
  ctx: AgentContext,
  handler: (span: Span) => Promise<T>,
): Promise<T> {
  return tracer.startActiveSpan(
    'agent.turn',
    {
      kind: SpanKind.SERVER,
      attributes: {
        'agent.id': ctx.agentId,
        'conversation.id': ctx.conversationId,
        'agent.turn.index': ctx.turnIndex,
        'user.id': ctx.userId ?? 'anonymous',
        'workspace.id': ctx.workspaceId,
      },
    },
    async (span) => {
      try {
        const result = await handler(span);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : 'Unknown error',
        });
        span.recordException(error as Error);
        throw error;
      } finally {
        span.end();
      }
    },
  );
}
 
async function tracedLLMCall(
  model: string,
  inputTokens: number,
  handler: () => Promise<{ output: string; outputTokens: number }>,
): Promise<{ output: string; outputTokens: number }> {
  return tracer.startActiveSpan(
    'gen_ai.chat',
    {
      kind: SpanKind.CLIENT,
      attributes: {
        'gen_ai.system': 'openai',
        'gen_ai.request.model': model,
        'gen_ai.usage.input_tokens': inputTokens,
      },
    },
    async (span) => {
      const result = await handler();
      span.setAttribute('gen_ai.usage.output_tokens', result.outputTokens);
      span.setAttribute('gen_ai.response.model', model);
      span.end();
      return result;
    },
  );
}
 
async function tracedToolCall(
  toolName: string,
  input: Record<string, unknown>,
  handler: () => Promise<unknown>,
): Promise<unknown> {
  return tracer.startActiveSpan(
    'agent.tool_call',
    {
      kind: SpanKind.CLIENT,
      attributes: {
        'tool.name': toolName,
        'tool.input': JSON.stringify(input),
      },
    },
    async (span) => {
      try {
        const result = await handler();
        span.setAttribute('tool.status', 'success');
        span.end();
        return result;
      } catch (error) {
        span.setAttribute('tool.status', 'error');
        span.setAttribute('tool.error', error instanceof Error ? error.message : 'Unknown');
        span.setStatus({ code: SpanStatusCode.ERROR });
        span.end();
        throw error;
      }
    },
  );
}

The key design decision here: every span in the tree shares the conversation.id and agent.id attributes. This lets you query for "every step in conversation X" or "every tool call agent Y made today" without joining across data sources. Datadog, which now natively supports OpenTelemetry GenAI semantic conventions (v1.37+), renders these traces as flame graphs where you can see exactly where time was spent and where failures originated.

Collection and Processing

Your collection layer needs to handle three signal types simultaneously: structured logs (events with context), distributed traces (span trees), and metrics (counters, histograms, gauges). Here's the processing layer that ties them together:

typescript

interface ObservabilityEvent {
  type: 'log' | 'metric' | 'eval';
  timestamp: Date;
  agentId: string;
  conversationId: string;
  data: Record<string, unknown>;
}
 
class ObservabilityPipeline {
  private buffer: ObservabilityEvent[] = [];
  private flushInterval: NodeJS.Timer;
 
  constructor(
    private readonly batchSize = 100,
    private readonly flushMs = 5000,
    private readonly sinks: Array<(events: ObservabilityEvent[]) => Promise<void>>,
  ) {
    this.flushInterval = setInterval(() => this.flush(), flushMs);
  }
 
  emit(event: ObservabilityEvent): void {
    this.buffer.push(event);
    if (this.buffer.length >= this.batchSize) {
      this.flush();
    }
  }
 
  private async flush(): Promise<void> {
    if (this.buffer.length === 0) return;
 
    const batch = this.buffer.splice(0, this.batchSize);
    await Promise.allSettled(
      this.sinks.map(sink => sink(batch))
    );
  }
 
  async shutdown(): Promise<void> {
    clearInterval(this.flushInterval);
    await this.flush();
  }
}

Batch processing is essential. Sending every event individually creates backpressure that can slow down your agent's response path. Buffer events, flush in batches every 5 seconds or every 100 events (whichever comes first), and use Promise.allSettled so a failing sink doesn't block the others.

Structured Logging for Agents

Traditional application logs are lines of text. Agent logs need to be structured, queryable, and causally connected — you need to reconstruct what the agent was thinking, not just what it did.

The difference is between this:

text

INFO: Agent processed user request in 2.3s

And this:

typescript

interface AgentLogEntry {
  level: 'debug' | 'info' | 'warn' | 'error';
  timestamp: string;
  traceId: string;
  spanId: string;
  agentId: string;
  conversationId: string;
  turnIndex: number;
  event: string;
  step: {
    type: 'planning' | 'llm_call' | 'tool_call' | 'memory' | 'response';
    name: string;
    durationMs: number;
    metadata: Record<string, unknown>;
  };
}

Every log entry carries a traceId and conversationId that lets you reconstruct the full sequence. Here's a logger that enforces this structure:

typescript

class AgentLogger {
  constructor(
    private agentId: string,
    private conversationId: string,
    private traceId: string,
  ) {}
 
  private log(level: AgentLogEntry['level'], event: string, step: AgentLogEntry['step']): void {
    const entry: AgentLogEntry = {
      level,
      timestamp: new Date().toISOString(),
      traceId: this.traceId,
      spanId: trace.getActiveSpan()?.spanContext().spanId ?? 'unknown',
      agentId: this.agentId,
      conversationId: this.conversationId,
      turnIndex: 0, // Set by caller
      event,
      step,
    };
 
    // Emit as structured JSON — your log aggregator (ELK, Datadog, CloudWatch)
    // can parse, index, and query these fields directly
    process.stdout.write(JSON.stringify(entry) + '\n');
  }
 
  planningStart(plan: string): void {
    this.log('info', 'agent.planning.start', {
      type: 'planning',
      name: 'plan_generation',
      durationMs: 0,
      metadata: { plan },
    });
  }
 
  toolCallStart(toolName: string, input: Record<string, unknown>): void {
    this.log('info', 'agent.tool.start', {
      type: 'tool_call',
      name: toolName,
      durationMs: 0,
      metadata: { input },
    });
  }
 
  toolCallEnd(toolName: string, durationMs: number, status: string): void {
    this.log(status === 'error' ? 'error' : 'info', 'agent.tool.end', {
      type: 'tool_call',
      name: toolName,
      durationMs,
      metadata: { status },
    });
  }
 
  llmCallEnd(model: string, durationMs: number, tokens: { input: number; output: number }): void {
    this.log('info', 'agent.llm.end', {
      type: 'llm_call',
      name: model,
      durationMs,
      metadata: { tokens },
    });
  }
 
  responseGenerated(responseLength: number, durationMs: number): void {
    this.log('info', 'agent.response.generated', {
      type: 'response',
      name: 'final_response',
      durationMs,
      metadata: { responseLength },
    });
  }
}

This might feel like over-engineering for a single agent. It isn't. The moment you have two agents in production — or one agent with a bug you can't reproduce — you'll wish every event was structured and correlated. The structured format also enables automated analysis: you can write queries like "show me all conversations where a tool call failed and the quality score was below 3" without parsing text.

A critical rule: never log the full user message or agent response in production by default. Log a hash or a truncated preview. Full content goes to a separate, access-controlled store for quality evaluation. This keeps your primary log pipeline fast and avoids accidentally indexing PII in your search infrastructure.

Alerting Strategies That Actually Work

The goal of alerting isn't to tell you about every anomaly. It's to wake you up for real incidents and send everything else to a next-business-day queue. Most teams get this backwards — they set tight thresholds on individual metrics and drown in noise.

Composite Signals Over Single Metrics

A single metric spike is usually noise. A correlated shift across multiple metrics is usually a real problem. Build your alerts around signal combinations:

typescript

interface AlertRule {
  name: string;
  severity: 'critical' | 'warning' | 'info';
  conditions: AlertCondition[];
  requiredConditions: number;    // How many conditions must fire
  windowMinutes: number;
  cooldownMinutes: number;
}
 
interface AlertCondition {
  metric: string;
  operator: 'gt' | 'lt' | 'change_pct_gt';
  threshold: number;
}
 
const alertRules: AlertRule[] = [
  {
    name: 'agent_quality_degradation',
    severity: 'critical',
    conditions: [
      { metric: 'quality_score_avg', operator: 'lt', threshold: 3.5 },
      { metric: 'token_usage_per_turn', operator: 'change_pct_gt', threshold: 35 },
      { metric: 'escalation_rate', operator: 'change_pct_gt', threshold: 20 },
    ],
    requiredConditions: 2,        // Any 2 of 3 must fire
    windowMinutes: 30,
    cooldownMinutes: 120,
  },
  {
    name: 'tool_infrastructure_failure',
    severity: 'critical',
    conditions: [
      { metric: 'tool_error_rate', operator: 'gt', threshold: 0.10 },
      { metric: 'tool_p95_latency_seconds', operator: 'gt', threshold: 8 },
    ],
    requiredConditions: 1,        // Either condition alone is concerning
    windowMinutes: 10,
    cooldownMinutes: 60,
  },
  {
    name: 'cost_anomaly',
    severity: 'warning',
    conditions: [
      { metric: 'daily_token_cost', operator: 'change_pct_gt', threshold: 50 },
      { metric: 'avg_tokens_per_conversation', operator: 'change_pct_gt', threshold: 40 },
    ],
    requiredConditions: 1,
    windowMinutes: 1440,          // Daily check
    cooldownMinutes: 1440,
  },
  {
    name: 'drift_detected',
    severity: 'warning',
    conditions: [
      { metric: 'quality_score_weekly_change', operator: 'lt', threshold: -0.3 },
      { metric: 'response_length_weekly_change_pct', operator: 'change_pct_gt', threshold: 30 },
    ],
    requiredConditions: 1,
    windowMinutes: 10080,         // Weekly check
    cooldownMinutes: 10080,
  },
];

Notice the requiredConditions field. For quality degradation, we require two of three signals to fire before paging anyone. A quality score dip alone might be a sampling artifact. A quality score dip plus a spike in token usage plus rising escalation rates — that's a real incident.

Tiered Response

Not every alert needs the same response. Structure your alerts into three tiers:

Tier	Response Time	Examples	Channel
Critical (P1)	Immediate, pages on-call	Quality below threshold + tool failures, complete agent outage	PagerDuty/Opsgenie
Warning (P2)	Same business day	Cost anomaly, single-metric drift, elevated error rate	Slack channel
Informational (P3)	Weekly review	Slow trends, minor drift signals, new topic clusters	Dashboard/email digest

Threshold Calibration

Don't set thresholds from industry benchmarks. They're meaningless for your specific agent. Instead:

Run for two weeks with monitoring enabled but no alerts
Calculate your mean and standard deviation for each metric
Set warning at 2x standard deviation, critical at 3x
Review and adjust monthly as your baseline evolves

This approach means your first two weeks in production are "fly-by-instruments" with manual review. That's the correct tradeoff. Premature alerting causes alert fatigue, and alert fatigue causes missed incidents.

Putting It All Together: The Observability Loop

Observability isn't a one-time setup — it's a feedback loop. Monitoring surfaces issues, evaluation quantifies them, and the insights feed back into agent improvement.

Here's what that loop looks like in practice for a real incident:

Day 1: Drift detector flags a 0.4-point quality decline in accuracy scores over the past week. Severity: warning. No page, shows up in the weekly review dashboard.

Day 2: Engineer reviews the flagged conversations. Quality evaluations show the agent is giving outdated pricing information — a knowledge base article was updated but the agent's retrieval is pulling cached versions.

Day 3: Fix deployed — KB cache TTL reduced, retrieval pipeline flushed. Regression tests confirm the fix. Quality scores recover within 24 hours.

Without observability, this issue would've continued for weeks. Users would've gotten wrong pricing. Support tickets would've piled up. Someone would've eventually noticed and filed a bug. The fix itself was trivial. The detection was everything.

The Monitoring Stack Decision

You don't need to build all of this from scratch. The ecosystem has matured significantly. Here's how the tooling breaks down:

Layer	Build vs. Buy	Reasoning
Traces	Use OpenTelemetry + existing backend	OTel is the standard; send to Datadog, Jaeger, or Grafana Tempo
Metrics	Use Prometheus/Grafana or your cloud provider	Commodity infrastructure
Structured logs	Build the schema, use existing aggregation	Your log format is custom; ELK/CloudWatch/Datadog for storage
Quality evaluation	Build or use specialized tooling	Rubrics are domain-specific; platforms like Chanl's scorecards or Langfuse can accelerate this
Drift detection	Build	Too specific to your agent's behavior for off-the-shelf tools
Alerting	Build composite rules on existing alerting infra	PagerDuty/Opsgenie for routing; your rules for logic

For teams running production voice or chat agents, platforms that integrate monitoring, analytics, and memory management into a unified pipeline reduce the operational overhead of wiring these layers together manually.

Advanced Patterns: What Comes Next

Once your core observability pipeline is running, three advanced patterns become possible.

Conversation-Level Analytics

Move beyond per-turn metrics to conversation-level analysis. How many turns does it take to resolve a request? Where do conversations get stuck? Which topics have the highest escalation rates?

typescript

interface ConversationAnalytics {
  conversationId: string;
  agentId: string;
  metrics: {
    turnCount: number;
    totalDurationSeconds: number;
    totalTokens: number;
    totalCost: number;
    toolCallCount: number;
    toolFailureCount: number;
    topic: string;
    sentiment: 'positive' | 'neutral' | 'negative';
    outcome: 'resolved' | 'escalated' | 'abandoned';
    qualityScore: number;
  };
}
 
function aggregateConversationMetrics(
  conversations: ConversationAnalytics[],
): Record<string, {
  count: number;
  avgTurns: number;
  resolutionRate: number;
  avgQuality: number;
  avgCost: number;
}> {
  const byTopic: Record<string, ConversationAnalytics[]> = {};
 
  for (const conv of conversations) {
    const topic = conv.metrics.topic;
    if (!byTopic[topic]) byTopic[topic] = [];
    byTopic[topic].push(conv);
  }
 
  const result: Record<string, any> = {};
  for (const [topic, convs] of Object.entries(byTopic)) {
    const resolved = convs.filter(c => c.metrics.outcome === 'resolved').length;
    result[topic] = {
      count: convs.length,
      avgTurns: convs.reduce((s, c) => s + c.metrics.turnCount, 0) / convs.length,
      resolutionRate: resolved / convs.length,
      avgQuality: convs.reduce((s, c) => s + c.metrics.qualityScore, 0) / convs.length,
      avgCost: convs.reduce((s, c) => s + c.metrics.totalCost, 0) / convs.length,
    };
  }
 
  return result;
}

This analysis reveals that your agent resolves billing questions in 2.3 turns on average but takes 6.8 turns for technical troubleshooting — and troubleshooting conversations cost 4x more in tokens. That's actionable intelligence for where to invest in better tools or retrieval.

Regression Testing from Production Data

Your production observability data is a goldmine for regression testing. Export flagged conversations (quality score below threshold) as test cases, and run them through your eval framework before every deploy.

typescript

async function exportRegressionCases(
  evaluations: QualityEvaluation[],
  threshold: number,
): Promise<Array<{ input: string; expectedBehavior: string; failureReason: string }>> {
  return evaluations
    .filter(e => {
      const avgScore = Object.values(e.scores).reduce((a, b) => a + b, 0) / 5;
      return avgScore < threshold;
    })
    .map(e => ({
      input: e.conversationId, // Resolve to actual conversation data
      expectedBehavior: `All quality scores >= ${threshold}`,
      failureReason: e.reasoning,
    }));
}

This creates a growing test suite that's grounded in real production failures — not hypothetical edge cases. Every bad conversation becomes a regression test that prevents the same failure from recurring.

Model Comparison in Production

When you're considering a model switch (or your provider releases an update), shadow-test the new model against production traffic:

typescript

interface ShadowTestResult {
  conversationId: string;
  primaryModel: { model: string; latencyMs: number; tokens: number; qualityScore: number };
  shadowModel: { model: string; latencyMs: number; tokens: number; qualityScore: number };
}
 
async function shadowTest(
  conversation: { role: string; content: string }[],
  primaryHandler: () => Promise<{ output: string; latencyMs: number; tokens: number }>,
  shadowHandler: () => Promise<{ output: string; latencyMs: number; tokens: number }>,
): Promise<ShadowTestResult> {
  // Primary model serves the user (blocking)
  const primary = await primaryHandler();
 
  // Shadow model runs async (non-blocking, fire-and-forget for the user)
  const shadow = await shadowHandler();
 
  // Evaluate both
  const [primaryQuality, shadowQuality] = await Promise.all([
    evaluateResponse(conversation, primary.output),
    evaluateResponse(conversation, shadow.output),
  ]);
 
  return {
    conversationId: crypto.randomUUID(),
    primaryModel: { model: 'current', ...primary, qualityScore: primaryQuality },
    shadowModel: { model: 'candidate', ...shadow, qualityScore: shadowQuality },
  };
}

After a week of shadow testing, you'll have statistically significant quality, latency, and cost comparisons — grounded in your actual traffic patterns, not synthetic benchmarks.

Common Pitfalls and How to Avoid Them

Before wrapping up, here are the mistakes that catch teams most often.

Monitoring averages instead of percentiles. Average latency of 1.2 seconds feels fine. But if your p99 is 18 seconds, 1% of your users are having a terrible experience. Always track p50, p95, and p99.

Over-relying on error rates. An agent can have a 0% error rate and still be consistently wrong. Errors catch infrastructure failures. Quality evaluation catches behavioral failures. You need both.

Evaluating too late. Teams that add quality evaluation after their first production incident spend weeks building what should've been there from day one. Instrument quality scoring from the start, even if you only sample 5%.

Setting static thresholds. Your agent's behavior changes as it handles new topics, as you update prompts, and as models evolve. Re-baseline your thresholds monthly. A static threshold set at launch will either alert constantly or miss everything within six months.

Ignoring cost signals. A 40% jump in tokens per conversation isn't just a budget issue — it's often the first signal of prompt drift or retrieval degradation. Treat cost anomalies as behavioral signals, not just financial ones.

Logging everything at full fidelity. Full prompt and response logging at 100% sampling creates a storage problem, a privacy problem, and a query performance problem. Log metadata at 100%, full content at 5-10%, and keep the two streams separate.

Where This Goes from Here

Agent observability is still a young field. OpenTelemetry's GenAI semantic conventions are experimental. The evaluation tooling ecosystem is fragmented. Most teams are still stitching together monitoring from pieces.

But the direction is clear. The teams that invest in observability early — structured traces, automated quality evaluation, drift detection, composite alerting — are the ones whose agents improve over time instead of silently degrading. The observability pipeline isn't overhead. It's the feedback mechanism that turns a deployed agent into a continuously improving system.

If you're building agents with prompt engineering techniques and connecting them to tools, the observability layer is what makes those investments compound. Without it, you're shipping code into a black box and hoping for the best.

Start with the four core metrics: latency percentiles, tool success rates, token usage, and sampled quality scores. Add drift detection after your first two weeks. Build composite alerts after your first month. By then, you'll have enough baseline data to set thresholds that actually mean something — and enough production experience to know which signals matter most for your specific agents.

Sources

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

learning-ai observability monitoring typescript analytics production agent-infrastructure

Lucas Dalamarta

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.