ChanlChanl
Operations

What to Trace When Your AI Agent Hits Production

OpenTelemetry GenAI conventions are the production standard for agent tracing. What to instrument, what to skip, and what breaks — from a 2 AM debugging war story.

DGDean GroverCo-founderFollow
March 20, 2026
18 min read read
Watercolor illustration of distributed trace spans flowing through an AI agent pipeline with OpenTelemetry instrumentation

It was 2 AM on a Tuesday, and our production agent was hallucinating order statuses. Not for every customer. Not in a pattern we could spot from dashboards. Just 3-4% of conversations, scattered across timezones, where the agent confidently told customers their orders had shipped when they hadn't. Our error rate was zero. Our latency was nominal. Every health check was green.

We spent six hours grep-ing through unstructured logs before we found it: a tool execution was silently returning cached data from a stale connection pool, and the agent was treating stale data as fresh. The LLM wasn't wrong. The tool wasn't throwing errors. The trace was invisible because we weren't tracing tool executions as first-class spans.

That night taught us something that saved us months later: agent observability isn't about watching the model. It's about watching the entire execution graph — every LLM call, every tool invocation, every memory retrieval, every planning step — as a connected trace. And as of 2026, there's finally a standard way to do it.


In this article:


The Standard That Changed Everything

OpenTelemetry's GenAI semantic conventions standardized how the industry traces LLM and agent workloads. Released in late 2025 and widely adopted throughout early 2026, they define attribute names, span kinds, and event structures specifically for generative AI. Before this, every observability vendor invented their own schema.

The conventions cover three layers:

  1. LLM call spans — model name, provider, token counts (input/output/total), finish reason, temperature, response format
  2. Tool execution spans — tool name, parameters, result status, execution duration
  3. Agent orchestration spans — planning steps, reasoning chains, memory operations, goal completion

Here's what the attribute namespace looks like in practice:

typescript
// The three namespaces that matter — everything else is optional
const GenAIAttributes = {
  // WHO made the call (swap providers without changing dashboards)
  SYSTEM: 'gen_ai.system',              // 'openai', 'anthropic', 'google'
  REQUEST_MODEL: 'gen_ai.request.model', // 'claude-sonnet-4-20250514'
 
  // WHAT it cost (the only attributes your CFO cares about)
  INPUT_TOKENS: 'gen_ai.usage.input_tokens',
  OUTPUT_TOKENS: 'gen_ai.usage.output_tokens',
 
  // WHERE in the agent graph (connects LLM calls to business logic)
  AGENT_ID: 'gen_ai.agent.id',
  TOOL_NAME: 'gen_ai.tool.name',
};

The adoption was fast. Braintrust shipped OTEL-native tracing in Q1 2026. Arize Phoenix built their entire trace viewer around the conventions. Langfuse added OTEL ingestion alongside their native SDK. Grafana Cloud released an AI observability plugin that reads GenAI spans natively. This convergence means you instrument once and export to any backend.

Before the conventions, switching from Langfuse to Grafana meant re-instrumenting every trace point. Now you swap an exporter.

The Four Spans Every Agent Needs

Every production agent trace needs four span types: LLM calls, tool executions, memory operations, and agent orchestration. Miss any one of these and you'll have blind spots that only surface as customer complaints.

Here's the span hierarchy for a single agent turn:

What's the status of my order? Retrieve conversation history Last 5 messages + customer context Plan: determine intent + needed data Intent: order_status, need: order_id lookup_order(customer_id="cust_123") {status: "shipped", tracking: "1Z999..."} Generate response with order data Your order shipped yesterday... Store turn in memory Your order shipped yesterday... Span: agent.turn (root) Span: memory.retrieve Span: gen_ai.chat (planning) Span: tool.execute Span: gen_ai.chat (response) Span: memory.store User Agent Orchestrator Memory Store LLM Tool (CRM Lookup)

That's seven spans for one conversation turn. Multiply by 5-10 turns per conversation, and you understand why production agents generate 10-100x more telemetry than traditional APIs.

Instrumenting LLM Calls

The LLM span is the most straightforward. Wrap every model call with a span that captures the GenAI attributes:

typescript
import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
 
const tracer = trace.getTracer('agent-service', '1.0.0');
 
async function tracedLLMCall(options: LLMCallOptions, callFn: () => Promise<LLMResult>) {
  // 'gen_ai.chat' — the standard span name. Backends key off this for AI-specific views.
  return tracer.startActiveSpan('gen_ai.chat', {
    kind: SpanKind.CLIENT, // CLIENT because we're calling an external LLM API
    attributes: {
      'gen_ai.system': options.provider,        // Makes multi-provider dashboards work
      'gen_ai.request.model': options.model,    // Track cost per model, not per request
      'gen_ai.agent.id': options.agentId,       // Links this LLM call to the agent that made it
      'app.conversation_id': options.conversationId, // YOUR join key — not in the spec, but essential
    },
  }, async (span) => {
    try {
      const result = await callFn();
      // Token counts AFTER the call — you don't know output tokens until the response arrives
      span.setAttributes({
        'gen_ai.usage.input_tokens': result.usage.inputTokens,
        'gen_ai.usage.output_tokens': result.usage.outputTokens,
      });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.recordException(error as Error); // Attaches stack trace to the span for debugging
      throw error;
    } finally {
      span.end(); // ALWAYS end in finally — leaked spans poison your trace tree
    }
  });
}

Instrumenting Tool Executions

Tool spans are where most debugging value lives. A clean LLM call with a broken tool execution produces a confident, wrong answer — exactly our 2 AM incident. If we'd had this span, we would have seen app.tool.success: true with stale data in the response, and the duration_ms would have been suspiciously fast (cache hit, not a real API call). Thirty seconds of trace inspection instead of six hours of grep.

typescript
async function tracedToolExecution(options: ToolCallOptions, executeFn: () => Promise<unknown>) {
  return tracer.startActiveSpan('tool.execute', {
    kind: SpanKind.INTERNAL, // INTERNAL, not CLIENT — the agent is calling its own tool
    attributes: {
      'gen_ai.tool.name': options.toolName,       // Standard attr: backends auto-group by tool
      'gen_ai.agent.id': options.agentId,
      'app.conversation_id': options.conversationId,
      // Redact BEFORE setting — you can't un-export PII from your observability backend
      'app.tool.parameters': JSON.stringify(redactSensitiveFields(options.parameters)),
    },
  }, async (span) => {
    const start = performance.now();
    try {
      const result = await executeFn();
      // success=true is what you'll filter on: "show me all tool failures this week"
      span.setAttributes({ 'app.tool.success': true, 'app.tool.duration_ms': Math.round(performance.now() - start) });
      return result;
    } catch (error) {
      // error_type lets you distinguish timeout vs auth vs data errors in dashboards
      span.setAttributes({ 'app.tool.success': false, 'app.tool.error_type': error.constructor.name });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

The app.* attributes live outside the convention namespace. The spec gives you the structure; you fill in the domain-specific debugging fields.

Instrumenting Memory Operations

Memory retrieval spans close the last major blind spot. When an agent pulls stale context or misses relevant history, the conversation goes sideways in ways that look like model failures but aren't.

typescript
async function tracedMemoryRetrieval(agentId: string, conversationId: string, retrieveFn: () => Promise<MemoryResult[]>) {
  return tracer.startActiveSpan('memory.retrieve', {
    attributes: { 'gen_ai.agent.id': agentId, 'app.conversation_id': conversationId },
  }, async (span) => {
    const results = await retrieveFn();
    span.setAttributes({
      'app.memory.results_count': results.length,        // Zero results = agent is flying blind
      'app.memory.oldest_result_age_hours': results[0]   // Stale memory is WORSE than no memory —
        ? Math.round((Date.now() - new Date(results[0].createdAt).getTime()) / 3600000) // the agent trusts outdated facts
        : 0,
    });
    span.end();
    return results;
  });
}

The oldest_result_age_hours attribute saved us twice — the same class of bug as our 2 AM stale-data incident, but in the memory layer. Both times, retrieval was returning results from weeks ago that had drifted out of relevance. Without that attribute, those conversations just looked like bad model outputs. For deeper patterns on building persistent memory systems, the architecture matters as much as the observability.

What to Skip (and Why)

Not everything deserves a span. Trace the decision points, skip the plumbing. A single 5-step agent turn generates 15-30 spans. At 10,000 daily conversations with 8 turns each, that's 1.2-2.4 million spans per day. Teams that trace every internal function call routinely hit $2,000-5,000/month in observability costs before they realize what happened.

Here's the practical filter:

Trace ThisSkip This
Every LLM call (model, tokens, latency)Internal prompt template rendering
Every tool execution (name, params, success/fail)HTTP connection pooling details
Memory retrievals (count, age, relevance)Individual vector similarity scores
Agent planning decisions (which step, why)Token-level streaming events
Conversation-level metrics (turns, duration)Per-message serialization/deserialization
Error and retry eventsSuccessful health checks
Sampling decisions (why this trace was kept)Cache hit/miss for static config

The pattern: trace anything that affects the agent's output or the user's experience. Skip anything that's infrastructure plumbing. If you can't explain why a span would help you debug a customer complaint, don't create it.

Sampling Strategy

At scale, you can't afford 100% trace collection. Here's the sampling configuration that balances cost and coverage:

typescript
class AgentSampler {
  private baseSampler: ParentBasedSampler;
  constructor(sampleRate: number) {
    // ParentBased = child spans inherit the parent's sampling decision (keeps traces whole)
    this.baseSampler = new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(sampleRate) });
  }
 
  shouldSample(context: unknown, traceId: string, spanName: string, attrs: Record<string, unknown>) {
    // RULE 1: Never drop a tool failure. The cost of missing a failure pattern >> storage cost.
    if (attrs['app.tool.success'] === false) return { decision: 1 };
 
    // RULE 2: Always keep slow conversations. These are your hardest customer problems.
    if ((attrs['app.conversation.duration_ms'] as number) > 10000) return { decision: 1 };
 
    // RULE 3: Everything else at the configured rate (10-25% in prod).
    return this.baseSampler.shouldSample(context, traceId, spanName, attrs);
  }
}

For most teams: 100% in staging, 10-25% in production, 100% on errors regardless. At 10% sampling on 10,000 daily conversations, you're looking at roughly 50-100GB of trace data per month. Adjust upward if you're under 1,000 conversations/day since you need the data density more than cost savings.

Why P99 Matters More Than P50

P99 latency matters more than P50 because the slowest conversations are where your agent handles the hardest problems. A P50 of 1.2 seconds looks great. A P99 of 14 seconds means 1 in 100 users is waiting through multi-step reasoning, multiple tool calls, and retry loops. Those aren't random slow paths. They're the billing disputes, escalation flows, and compliance-sensitive questions. Your highest-value interactions are getting your worst latency.

Track percentiles per agent, per complexity tier:

typescript
const meter = metrics.getMeter('agent-service');
// Histogram, not gauge — you need percentile buckets, not just averages
const turnLatency = meter.createHistogram('agent.turn.duration', { unit: 'ms' });
 
function recordTurnLatency(durationMs: number, attrs: { agentId: string; toolCallCount: number; llmCallCount: number }) {
  // THIS is the key insight: segment by complexity so P99 means something
  const totalSteps = attrs.toolCallCount + attrs.llmCallCount;
  const tier = totalSteps <= 2 ? 'simple' : totalSteps <= 5 ? 'moderate' : 'complex';
 
  turnLatency.record(durationMs, {
    'agent.id': attrs.agentId,
    'agent.complexity_tier': tier,  // Without this, your "simple" P50 hides your "complex" P99
  });
}

When you segment by complexity tier, the picture changes. Your "simple" tier P99 might be 2 seconds. Your "complex" tier P99 might be 18 seconds. That's the gap where optimization actually matters, and it's invisible if you're only looking at aggregate latency. The same principle applies to agent analytics dashboards — aggregate numbers hide the conversations that need attention most.

Choosing an Observability Backend

The market split into two camps: AI-native observability tools and infrastructure tools that added AI support. Both speak OTEL. The difference is what they do after ingestion.

ToolOTEL GenAI SupportStrengthsGapsBest For
BraintrustNative (built on OTEL)Evals, prompt versioning, dataset management, scoringNo infrastructure metrics, no alertingTeams focused on model quality iteration
Arize PhoenixNative ingestionTrace visualization, embedding drift, LLM evaluationLighter on alerting, newer ecosystemML teams wanting deep model analytics
LangfuseOTEL + native SDKOpen-source, prompt management, cost tracking, evalsSelf-hosted complexity, smaller ecosystemTeams wanting full control, budget-conscious
Grafana CloudOTEL pluginCorrelates AI traces with infra metrics, mature alertingAI-specific features are newer, less eval depthTeams already on Grafana stack
DatadogOTEL + native APMFull-stack correlation, mature alerting, broad integrationsAI features are add-on, pricing at scaleEnterprise teams with existing Datadog
HoneycombOTEL nativeQuery flexibility, high-cardinality support, trace analysisNo built-in AI eval, no prompt managementDebugging-focused teams

The practical answer for most teams: use an AI-native tool (Braintrust or Langfuse) for evals and prompt iteration, plus your existing infrastructure tool (Grafana, Datadog) for alerting and correlation. OTEL makes this dual-export trivial:

typescript
const provider = new NodeTracerProvider({ sampler: new AgentSampler(0.1) });
 
// Destination 1: AI-native tool for evals, prompt versioning, scoring
provider.addSpanProcessor(new BatchSpanProcessor(
  new OTLPTraceExporter({
    url: 'https://otel.braintrust.dev/v1/traces',          // Braintrust, Langfuse, or Arize
    headers: { Authorization: `Bearer ${process.env.BRAINTRUST_API_KEY}` },
  })
));
 
// Destination 2: Infra tool for alerting, dashboards, on-call correlation
provider.addSpanProcessor(new BatchSpanProcessor(
  new OTLPTraceExporter({
    url: process.env.GRAFANA_OTLP_ENDPOINT,                // Grafana, Datadog, or Honeycomb
    headers: { Authorization: `Basic ${process.env.GRAFANA_OTLP_TOKEN}` },
  })
));
 
provider.register(); // One instrumentation, two backends, zero vendor lock-in

Same instrumentation code, two destinations, different purposes. That's the OTEL value proposition in one code block.

Five Ways Production Breaks You

These five patterns account for most production observability failures we've seen across agent teams.

1. Trace Explosion from Retry Loops

Agent frameworks that auto-retry on LLM errors can generate hundreds of spans in a single conversation. An LLM rate limit error triggers a retry, which triggers another rate limit, which triggers another retry. Each creates a full span tree. We've seen a single conversation produce 847 spans during an OpenAI outage. Your trace storage bill quadruples overnight, and the tracing overhead itself starts adding latency to an already degraded system.

Fix: cap retries at the instrumentation level, not just the business logic level. Add a retry_count attribute and stop creating child spans after retry 3.

2. Sensitive Data in Span Attributes

Tool parameters and LLM prompts often contain customer PII. Standard OTEL exporters will happily ship those to your observability backend. Your security team will not be happy.

Fix: implement attribute redaction in your span processor, before export:

typescript
// Runs BEFORE export — last chance to strip PII before it leaves your infra
class RedactingSpanProcessor implements SpanProcessor {
  // Catches the obvious stuff. Add your domain patterns (MRN, account numbers, etc.)
  private sensitivePatterns = [
    /\b\d{3}-?\d{2}-?\d{4}\b/g,    // SSN — tool params often contain these
    /\b\d{13,19}\b/g,               // Credit card numbers
    /\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, // Email — shows up in agent prompts constantly
  ];
 
  onEnd(span: ReadableSpan): void {
    // Scan every string attribute — tool.parameters is the biggest risk
    for (const [key, value] of Object.entries(span.attributes)) {
      if (typeof value !== 'string') continue;
      let redacted = value;
      for (const pattern of this.sensitivePatterns) redacted = redacted.replace(pattern, '[REDACTED]');
      if (redacted !== value) (span.attributes as Record<string, unknown>)[key] = redacted;
    }
  }
 
  onStart(): void {} // Required by interface
  forceFlush(): Promise<void> { return Promise.resolve(); }
  shutdown(): Promise<void> { return Promise.resolve(); }
}

3. Missing Correlation Between Traces and Quality Scores

You have traces in Grafana and quality scores from your scorecard evaluations. But they're in separate systems with no join key. When a conversation scores poorly, you can't find its trace. When a trace looks slow, you can't find its quality score.

Fix: add conversation_id and agent_id as attributes on every root span. Use the same IDs in your quality evaluation pipeline. This is trivially simple but almost everyone forgets it until they need it.

4. Context Propagation Across Tool Boundaries

When your agent calls an external tool via HTTP, the trace context should propagate so the tool execution appears as a child span. But many tool servers don't support W3C trace context headers. Your trace shows the agent calling a tool, then a gap, then the response.

Fix: at minimum, create a client-side span for the tool call duration even if the server doesn't participate in the trace. The timing and success/failure data is still valuable. For tools you control, add the OTEL HTTP instrumentation middleware.

5. Cardinality Explosion from Dynamic Tool Names

If your agent framework uses dynamic tool names (e.g., query_table_orders, query_table_customers), each unique tool name becomes a new metric series. With 50 tables, you have 50x the cardinality. With 500, your metrics backend starts rejecting data.

Fix: use a static tool name attribute (gen_ai.tool.name: "query_table") with the dynamic part as a separate attribute (app.tool.target: "orders"). This keeps cardinality bounded while preserving debugging detail.

The Complete Setup

A production pipeline needs four components: tracer provider, metric meter, span processors with redaction, and exporters. Here's the complete setup.

typescript
// Call ONCE at service startup, BEFORE any agent code runs
function initTelemetry(config: { serviceName: string; otlpEndpoint: string; otlpHeaders: Record<string, string>; sampleRate: number }) {
  const sdk = new NodeSDK({
    resource: new Resource({
      [ATTR_SERVICE_NAME]: config.serviceName,
      'deployment.environment': process.env.NODE_ENV ?? 'development',
    }),
    sampler: new AgentSampler(config.sampleRate),
    spanProcessors: [
      new RedactingSpanProcessor(),              // Strip PII BEFORE batching (order matters)
      new BatchSpanProcessor(new OTLPTraceExporter({
        url: `${config.otlpEndpoint}/v1/traces`,
        headers: config.otlpHeaders,
      }), {
        maxQueueSize: 2048,                       // Default 2048 is fine for most agent workloads
        scheduledDelayMillis: 5000,               // 5s batches — tracing adds ~1-3% latency overhead at this setting
      }),
    ],
    metricReader: new PeriodicExportingMetricReader({
      exporter: new OTLPMetricExporter({ url: `${config.otlpEndpoint}/v1/metrics`, headers: config.otlpHeaders }),
      exportIntervalMillis: 30000,                // 30s is right for agents — conversations last minutes, not milliseconds
    }),
    instrumentations: [new HttpInstrumentation()], // Auto-traces all outbound HTTP (tool calls, LLM APIs)
  });
 
  sdk.start();
  process.on('SIGTERM', () => sdk.shutdown().then(() => process.exit(0))); // Flush remaining spans on shutdown
  return sdk;
}

Call this once at startup. The BatchSpanProcessor adds roughly 1-3% latency overhead in practice (benchmarked by the OTEL team). That's negligible against LLM call latency. The real cost is storage: at 10,000 conversations/day with 15 spans each, you're generating roughly 2-3GB of trace data daily before sampling.

Connecting Traces to Quality Scores

Traces tell you what happened. Quality scores tell you if it was good. Connect them and you can answer: what do bad conversations look like structurally? This is where observability stops being monitoring and starts being improvement.

Spans (conversation_id, agent_id, traces) Transcript + metadata (conversation_id) Score conversation (accuracy, helpfulness, safety) Trace data indexed by conversation_id Quality scores indexed by conversation_id JOIN on conversation_id Show me traces where quality < 0.7 Agent Runtime OTEL Collector Quality Evaluator Dashboard

The query you want: "Show me traces from the last 24 hours where quality score < 0.7, grouped by failure pattern." This is the query that would have caught our stale-data bug before any customer noticed — the 3-4% of conversations returning wrong order statuses would have clustered around one tool span returning suspiciously fast. Maybe every low-quality conversation had a tool timeout on step 3. Maybe memory retrieval returned zero results. Trace data identifies the structural patterns, quality scores identify which patterns matter. Teams running automated quality evaluation with scorecards can correlate trace anomalies with scoring drops automatically.

Your Four-Week Rollout

Week 1 is about getting any data flowing. Week 4 is about acting on it automatically. Don't try to build the full pipeline on day one.

Progress0/13
  • Week 1: Add OTEL SDK, instrument LLM calls with GenAI attributes
  • Week 1: Add tool execution spans with name, params, success/fail
  • Week 1: Export to at least one backend (Grafana, Braintrust, Langfuse)
  • Week 2: Add memory retrieval spans with result count and age
  • Week 2: Add conversation_id and agent_id to all root spans
  • Week 2: Implement PII redaction in span processor
  • Week 2: Set up P50/P95/P99 latency dashboards segmented by complexity
  • Week 3: Connect quality scores to trace data via conversation_id
  • Week 3: Implement head-based sampling (10-25%) with 100% error capture
  • Week 3: Add alerting on P99 latency and tool failure rate
  • Week 4: Build a low-quality trace investigation workflow
  • Week 4: Set up weekly trace review for pattern detection
  • Week 4: Tune sampling rates based on cost vs coverage needs

From Grep to Trace in 15 Minutes

Remember our 2 AM incident? Here's exactly how it would have played out with proper traces. Quality scores drop on 3-4% of conversations. We filter traces where score < 0.7. Every one shows the same tool.execute span for order lookup completing in 2ms (cache hit) instead of the normal 200ms (live API). We click into the span, see the stale connection pool, fix it. Fifteen minutes, not six hours. No grep. No guessing.

That's what OTEL-standardized tracing changes. Debugging goes from "grep and hope" to "find the trace, see every step, identify the broken span."

The conventions exist. The tooling is mature. The only question is whether you instrument before or after your own 2 AM incident.

For the complete guide to what to monitor once your agents go live — covering drift detection, alerting strategies, and the five pillars of agent observability beyond tracing — start there if you haven't already.

Monitor Your AI Agents in Production

Chanl connects to any voice, chat, or messaging agent. Trace tool calls, evaluate quality with scorecards, and catch degradation before your customers do.

Start monitoring
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions