What are OpenTelemetry GenAI semantic conventions?

OpenTelemetry GenAI semantic conventions are a standardized set of attribute names, span types, and event structures for tracing generative AI workloads. They define how to record LLM calls (model, tokens, latency), tool executions, and agent reasoning steps in a vendor-neutral format. Adopted in 2025-2026 by Braintrust, Arize, Langfuse, and Grafana, they let you switch observability backends without re-instrumenting your code.

How much telemetry do production AI agents generate compared to traditional APIs?

Production agents generate 10-100x more telemetry than traditional REST APIs. A single user request can trigger multiple LLM calls, tool executions, memory retrievals, and planning steps — each producing its own span. A 5-step agent turn might generate 15-30 spans versus the 2-3 spans a typical API request produces. Sampling and span filtering become essential at scale.

Why does P99 latency matter more than P50 for agent quality?

P50 latency reflects the typical fast path — usually a single LLM call with no tool use. P99 captures the complex paths: multi-step reasoning, tool retries, memory lookups, and planning loops. These slow paths are where agents handle the hardest customer problems. If your P99 is 12 seconds but your P50 is 1.2 seconds, the users with the most complex needs are getting the worst experience.

Should I trace every LLM call or sample in production?

Trace every LLM call in staging and early production. Once you exceed roughly 1,000 conversations per day, switch to head-based sampling at 10-25% for full traces while keeping 100% coverage on error traces. Never sample tool execution failures — those should always be captured. The cost of missing a tool failure pattern far exceeds the storage cost of keeping those spans.

How do I connect OpenTelemetry traces to conversation quality scores?

Add your conversation ID and agent ID as span attributes on the root agent span. Your quality evaluation pipeline (scorecards, LLM-as-judge) produces scores keyed by conversation ID. Join on that ID in your observability backend to correlate trace data with quality outcomes. This lets you filter traces by quality score — showing you what the slow, low-quality conversations look like structurally.

What's the difference between agent-level and LLM-level tracing?

LLM-level tracing captures individual model calls: prompt tokens, completion tokens, model name, latency. Agent-level tracing captures the orchestration layer above that: which tools were called and in what order, how many reasoning steps occurred, what memory was retrieved, and whether the agent achieved its goal. You need both — LLM traces without agent context are noise, agent traces without LLM detail miss cost and performance data.

Which observability tools support OpenTelemetry GenAI conventions?

As of early 2026, Braintrust, Arize Phoenix, Langfuse, Grafana Cloud, Datadog, and Honeycomb all support OTEL GenAI semantic conventions to varying degrees. Braintrust and Langfuse offer the deepest AI-native features (prompt versioning, evals). Grafana and Datadog offer the strongest infrastructure correlation. Most teams use an AI-native tool for evals plus their existing infrastructure tool for alerting.

What to Trace When Your AI Agent Hits Production

It was 2 AM on a Tuesday, and our production agent was hallucinating order statuses. Not for every customer. Not in a pattern we could spot from dashboards. Just 3-4% of conversations, scattered across timezones, where the agent confidently told customers their orders had shipped when they hadn't. Our error rate was zero. Our latency was nominal. Every health check was green.

We spent six hours grep-ing through unstructured logs before we found it: a tool execution was silently returning cached data from a stale connection pool, and the agent was treating stale data as fresh. The LLM wasn't wrong. The tool wasn't throwing errors. The trace was invisible because we weren't tracing tool executions as first-class spans.

That night taught us something that saved us months later: agent observability isn't about watching the model. It's about watching the entire execution graph — every LLM call, every tool invocation, every memory retrieval, every planning step — as a connected trace. And as of 2026, there's finally a standard way to do it.

In this article:

The Standard That Changed Everything — OTEL GenAI conventions
The Four Spans Every Agent Needs — LLM, tool, memory, orchestration
What to Skip (and Why) — avoiding telemetry overload
Why P99 Matters More Than P50 — latency by complexity tier
Choosing an Observability Backend — OTEL-native options compared
Five Ways Production Breaks You — patterns that bite everyone
The Complete Setup — full pipeline config
Connecting Traces to Quality Scores — the join that changes everything
Your Four-Week Rollout — operational checklist

The Standard That Changed Everything

OpenTelemetry's GenAI semantic conventions standardized how the industry traces LLM and agent workloads. Released in late 2025 and widely adopted throughout early 2026, they define attribute names, span kinds, and event structures specifically for generative AI. Before this, every observability vendor invented their own schema.

The conventions cover three layers:

LLM call spans — model name, provider, token counts (input/output/total), finish reason, temperature, response format
Tool execution spans — tool name, parameters, result status, execution duration
Agent orchestration spans — planning steps, reasoning chains, memory operations, goal completion

Here's what the attribute namespace looks like in practice:

typescript

// The three namespaces that matter — everything else is optional
const GenAIAttributes = {
  // WHO made the call (swap providers without changing dashboards)
  SYSTEM: 'gen_ai.system',              // 'openai', 'anthropic', 'google'
  REQUEST_MODEL: 'gen_ai.request.model', // 'claude-sonnet-4-20250514'
 
  // WHAT it cost (the only attributes your CFO cares about)
  INPUT_TOKENS: 'gen_ai.usage.input_tokens',
  OUTPUT_TOKENS: 'gen_ai.usage.output_tokens',
 
  // WHERE in the agent graph (connects LLM calls to business logic)
  AGENT_ID: 'gen_ai.agent.id',
  TOOL_NAME: 'gen_ai.tool.name',
};

The adoption was fast. Braintrust shipped OTEL-native tracing in Q1 2026. Arize Phoenix built their entire trace viewer around the conventions. Langfuse added OTEL ingestion alongside their native SDK. Grafana Cloud released an AI observability plugin that reads GenAI spans natively. This convergence means you instrument once and export to any backend.

Before the conventions, switching from Langfuse to Grafana meant re-instrumenting every trace point. Now you swap an exporter.

The Four Spans Every Agent Needs

Every production agent trace needs four span types: LLM calls, tool executions, memory operations, and agent orchestration. Miss any one of these and you'll have blind spots that only surface as customer complaints.

Here's the span hierarchy for a single agent turn:

That's seven spans for one conversation turn. Multiply by 5-10 turns per conversation, and you understand why production agents generate 10-100x more telemetry than traditional APIs.

Instrumenting LLM Calls

The LLM span is the most straightforward. Wrap every model call with a span that captures the GenAI attributes:

typescript

import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
 
const tracer = trace.getTracer('agent-service', '1.0.0');
 
async function tracedLLMCall(options: LLMCallOptions, callFn: () => Promise<LLMResult>) {
  // 'gen_ai.chat' — the standard span name. Backends key off this for AI-specific views.
  return tracer.startActiveSpan('gen_ai.chat', {
    kind: SpanKind.CLIENT, // CLIENT because we're calling an external LLM API
    attributes: {
      'gen_ai.system': options.provider,        // Makes multi-provider dashboards work
      'gen_ai.request.model': options.model,    // Track cost per model, not per request
      'gen_ai.agent.id': options.agentId,       // Links this LLM call to the agent that made it
      'app.conversation_id': options.conversationId, // YOUR join key — not in the spec, but essential
    },
  }, async (span) => {
    try {
      const result = await callFn();
      // Token counts AFTER the call — you don't know output tokens until the response arrives
      span.setAttributes({
        'gen_ai.usage.input_tokens': result.usage.inputTokens,
        'gen_ai.usage.output_tokens': result.usage.outputTokens,
      });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.recordException(error as Error); // Attaches stack trace to the span for debugging
      throw error;
    } finally {
      span.end(); // ALWAYS end in finally — leaked spans poison your trace tree
    }
  });
}

Instrumenting Tool Executions

Tool spans are where most debugging value lives. A clean LLM call with a broken tool execution produces a confident, wrong answer — exactly our 2 AM incident. If we'd had this span, we would have seen app.tool.success: true with stale data in the response, and the duration_ms would have been suspiciously fast (cache hit, not a real API call). Thirty seconds of trace inspection instead of six hours of grep.

typescript

async function tracedToolExecution(options: ToolCallOptions, executeFn: () => Promise<unknown>) {
  return tracer.startActiveSpan('tool.execute', {
    kind: SpanKind.INTERNAL, // INTERNAL, not CLIENT — the agent is calling its own tool
    attributes: {
      'gen_ai.tool.name': options.toolName,       // Standard attr: backends auto-group by tool
      'gen_ai.agent.id': options.agentId,
      'app.conversation_id': options.conversationId,
      // Redact BEFORE setting — you can't un-export PII from your observability backend
      'app.tool.parameters': JSON.stringify(redactSensitiveFields(options.parameters)),
    },
  }, async (span) => {
    const start = performance.now();
    try {
      const result = await executeFn();
      // success=true is what you'll filter on: "show me all tool failures this week"
      span.setAttributes({ 'app.tool.success': true, 'app.tool.duration_ms': Math.round(performance.now() - start) });
      return result;
    } catch (error) {
      // error_type lets you distinguish timeout vs auth vs data errors in dashboards
      span.setAttributes({ 'app.tool.success': false, 'app.tool.error_type': error.constructor.name });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

The app.* attributes live outside the convention namespace. The spec gives you the structure; you fill in the domain-specific debugging fields.

Instrumenting Memory Operations

Memory retrieval spans close the last major blind spot. When an agent pulls stale context or misses relevant history, the conversation goes sideways in ways that look like model failures but aren't.

typescript

async function tracedMemoryRetrieval(agentId: string, conversationId: string, retrieveFn: () => Promise<MemoryResult[]>) {
  return tracer.startActiveSpan('memory.retrieve', {
    attributes: { 'gen_ai.agent.id': agentId, 'app.conversation_id': conversationId },
  }, async (span) => {
    const results = await retrieveFn();
    span.setAttributes({
      'app.memory.results_count': results.length,        // Zero results = agent is flying blind
      'app.memory.oldest_result_age_hours': results[0]   // Stale memory is WORSE than no memory —
        ? Math.round((Date.now() - new Date(results[0].createdAt).getTime()) / 3600000) // the agent trusts outdated facts
        : 0,
    });
    span.end();
    return results;
  });
}

The oldest_result_age_hours attribute saved us twice — the same class of bug as our 2 AM stale-data incident, but in the memory layer. Both times, retrieval was returning results from weeks ago that had drifted out of relevance. Without that attribute, those conversations just looked like bad model outputs. For deeper patterns on building persistent memory systems, the architecture matters as much as the observability.

What to Skip (and Why)

Not everything deserves a span. Trace the decision points, skip the plumbing. A single 5-step agent turn generates 15-30 spans. At 10,000 daily conversations with 8 turns each, that's 1.2-2.4 million spans per day. Teams that trace every internal function call routinely hit $2,000-5,000/month in observability costs before they realize what happened.

Here's the practical filter:

Trace This	Skip This
Every LLM call (model, tokens, latency)	Internal prompt template rendering
Every tool execution (name, params, success/fail)	HTTP connection pooling details
Memory retrievals (count, age, relevance)	Individual vector similarity scores
Agent planning decisions (which step, why)	Token-level streaming events
Conversation-level metrics (turns, duration)	Per-message serialization/deserialization
Error and retry events	Successful health checks
Sampling decisions (why this trace was kept)	Cache hit/miss for static config

The pattern: trace anything that affects the agent's output or the user's experience. Skip anything that's infrastructure plumbing. If you can't explain why a span would help you debug a customer complaint, don't create it.

Sampling Strategy

At scale, you can't afford 100% trace collection. Here's the sampling configuration that balances cost and coverage:

typescript

class AgentSampler {
  private baseSampler: ParentBasedSampler;
  constructor(sampleRate: number) {
    // ParentBased = child spans inherit the parent's sampling decision (keeps traces whole)
    this.baseSampler = new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(sampleRate) });
  }
 
  shouldSample(context: unknown, traceId: string, spanName: string, attrs: Record<string, unknown>) {
    // RULE 1: Never drop a tool failure. The cost of missing a failure pattern >> storage cost.
    if (attrs['app.tool.success'] === false) return { decision: 1 };
 
    // RULE 2: Always keep slow conversations. These are your hardest customer problems.
    if ((attrs['app.conversation.duration_ms'] as number) > 10000) return { decision: 1 };
 
    // RULE 3: Everything else at the configured rate (10-25% in prod).
    return this.baseSampler.shouldSample(context, traceId, spanName, attrs);
  }
}

For most teams: 100% in staging, 10-25% in production, 100% on errors regardless. At 10% sampling on 10,000 daily conversations, you're looking at roughly 50-100GB of trace data per month. Adjust upward if you're under 1,000 conversations/day since you need the data density more than cost savings.

Why P99 Matters More Than P50

P99 latency matters more than P50 because the slowest conversations are where your agent handles the hardest problems. A P50 of 1.2 seconds looks great. A P99 of 14 seconds means 1 in 100 users is waiting through multi-step reasoning, multiple tool calls, and retry loops. Those aren't random slow paths. They're the billing disputes, escalation flows, and compliance-sensitive questions. Your highest-value interactions are getting your worst latency.

Track percentiles per agent, per complexity tier:

typescript

const meter = metrics.getMeter('agent-service');
// Histogram, not gauge — you need percentile buckets, not just averages
const turnLatency = meter.createHistogram('agent.turn.duration', { unit: 'ms' });
 
function recordTurnLatency(durationMs: number, attrs: { agentId: string; toolCallCount: number; llmCallCount: number }) {
  // THIS is the key insight: segment by complexity so P99 means something
  const totalSteps = attrs.toolCallCount + attrs.llmCallCount;
  const tier = totalSteps <= 2 ? 'simple' : totalSteps <= 5 ? 'moderate' : 'complex';
 
  turnLatency.record(durationMs, {
    'agent.id': attrs.agentId,
    'agent.complexity_tier': tier,  // Without this, your "simple" P50 hides your "complex" P99
  });
}

When you segment by complexity tier, the picture changes. Your "simple" tier P99 might be 2 seconds. Your "complex" tier P99 might be 18 seconds. That's the gap where optimization actually matters, and it's invisible if you're only looking at aggregate latency. The same principle applies to agent analytics dashboards — aggregate numbers hide the conversations that need attention most.

Choosing an Observability Backend

The market split into two camps: AI-native observability tools and infrastructure tools that added AI support. Both speak OTEL. The difference is what they do after ingestion.

Tool	OTEL GenAI Support	Strengths	Gaps	Best For
Braintrust	Native (built on OTEL)	Evals, prompt versioning, dataset management, scoring	No infrastructure metrics, no alerting	Teams focused on model quality iteration
Arize Phoenix	Native ingestion	Trace visualization, embedding drift, LLM evaluation	Lighter on alerting, newer ecosystem	ML teams wanting deep model analytics
Langfuse	OTEL + native SDK	Open-source, prompt management, cost tracking, evals	Self-hosted complexity, smaller ecosystem	Teams wanting full control, budget-conscious
Grafana Cloud	OTEL plugin	Correlates AI traces with infra metrics, mature alerting	AI-specific features are newer, less eval depth	Teams already on Grafana stack
Datadog	OTEL + native APM	Full-stack correlation, mature alerting, broad integrations	AI features are add-on, pricing at scale	Enterprise teams with existing Datadog
Honeycomb	OTEL native	Query flexibility, high-cardinality support, trace analysis	No built-in AI eval, no prompt management	Debugging-focused teams

The practical answer for most teams: use an AI-native tool (Braintrust or Langfuse) for evals and prompt iteration, plus your existing infrastructure tool (Grafana, Datadog) for alerting and correlation. OTEL makes this dual-export trivial:

typescript

const provider = new NodeTracerProvider({ sampler: new AgentSampler(0.1) });
 
// Destination 1: AI-native tool for evals, prompt versioning, scoring
provider.addSpanProcessor(new BatchSpanProcessor(
  new OTLPTraceExporter({
    url: 'https://otel.braintrust.dev/v1/traces',          // Braintrust, Langfuse, or Arize
    headers: { Authorization: `Bearer ${process.env.BRAINTRUST_API_KEY}` },
  })
));
 
// Destination 2: Infra tool for alerting, dashboards, on-call correlation
provider.addSpanProcessor(new BatchSpanProcessor(
  new OTLPTraceExporter({
    url: process.env.GRAFANA_OTLP_ENDPOINT,                // Grafana, Datadog, or Honeycomb
    headers: { Authorization: `Basic ${process.env.GRAFANA_OTLP_TOKEN}` },
  })
));
 
provider.register(); // One instrumentation, two backends, zero vendor lock-in

Same instrumentation code, two destinations, different purposes. That's the OTEL value proposition in one code block.

Five Ways Production Breaks You

These five patterns account for most production observability failures we've seen across agent teams.

1. Trace Explosion from Retry Loops

Agent frameworks that auto-retry on LLM errors can generate hundreds of spans in a single conversation. An LLM rate limit error triggers a retry, which triggers another rate limit, which triggers another retry. Each creates a full span tree. We've seen a single conversation produce 847 spans during an OpenAI outage. Your trace storage bill quadruples overnight, and the tracing overhead itself starts adding latency to an already degraded system.

Fix: cap retries at the instrumentation level, not just the business logic level. Add a retry_count attribute and stop creating child spans after retry 3.

2. Sensitive Data in Span Attributes

Tool parameters and LLM prompts often contain customer PII. Standard OTEL exporters will happily ship those to your observability backend. Your security team will not be happy.

Fix: implement attribute redaction in your span processor, before export:

typescript

// Runs BEFORE export — last chance to strip PII before it leaves your infra
class RedactingSpanProcessor implements SpanProcessor {
  // Catches the obvious stuff. Add your domain patterns (MRN, account numbers, etc.)
  private sensitivePatterns = [
    /\b\d{3}-?\d{2}-?\d{4}\b/g,    // SSN — tool params often contain these
    /\b\d{13,19}\b/g,               // Credit card numbers
    /\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, // Email — shows up in agent prompts constantly
  ];
 
  onEnd(span: ReadableSpan): void {
    // Scan every string attribute — tool.parameters is the biggest risk
    for (const [key, value] of Object.entries(span.attributes)) {
      if (typeof value !== 'string') continue;
      let redacted = value;
      for (const pattern of this.sensitivePatterns) redacted = redacted.replace(pattern, '[REDACTED]');
      if (redacted !== value) (span.attributes as Record<string, unknown>)[key] = redacted;
    }
  }
 
  onStart(): void {} // Required by interface
  forceFlush(): Promise<void> { return Promise.resolve(); }
  shutdown(): Promise<void> { return Promise.resolve(); }
}

3. Missing Correlation Between Traces and Quality Scores

You have traces in Grafana and quality scores from your scorecard evaluations. But they're in separate systems with no join key. When a conversation scores poorly, you can't find its trace. When a trace looks slow, you can't find its quality score.

Fix: add conversation_id and agent_id as attributes on every root span. Use the same IDs in your quality evaluation pipeline. This is trivially simple but almost everyone forgets it until they need it.

4. Context Propagation Across Tool Boundaries

When your agent calls an external tool via HTTP, the trace context should propagate so the tool execution appears as a child span. But many tool servers don't support W3C trace context headers. Your trace shows the agent calling a tool, then a gap, then the response.

Fix: at minimum, create a client-side span for the tool call duration even if the server doesn't participate in the trace. The timing and success/failure data is still valuable. For tools you control, add the OTEL HTTP instrumentation middleware.

5. Cardinality Explosion from Dynamic Tool Names

If your agent framework uses dynamic tool names (e.g., query_table_orders, query_table_customers), each unique tool name becomes a new metric series. With 50 tables, you have 50x the cardinality. With 500, your metrics backend starts rejecting data.

Fix: use a static tool name attribute (gen_ai.tool.name: "query_table") with the dynamic part as a separate attribute (app.tool.target: "orders"). This keeps cardinality bounded while preserving debugging detail.

The Complete Setup

A production pipeline needs four components: tracer provider, metric meter, span processors with redaction, and exporters. Here's the complete setup.

typescript

// Call ONCE at service startup, BEFORE any agent code runs
function initTelemetry(config: { serviceName: string; otlpEndpoint: string; otlpHeaders: Record<string, string>; sampleRate: number }) {
  const sdk = new NodeSDK({
    resource: new Resource({
      [ATTR_SERVICE_NAME]: config.serviceName,
      'deployment.environment': process.env.NODE_ENV ?? 'development',
    }),
    sampler: new AgentSampler(config.sampleRate),
    spanProcessors: [
      new RedactingSpanProcessor(),              // Strip PII BEFORE batching (order matters)
      new BatchSpanProcessor(new OTLPTraceExporter({
        url: `${config.otlpEndpoint}/v1/traces`,
        headers: config.otlpHeaders,
      }), {
        maxQueueSize: 2048,                       // Default 2048 is fine for most agent workloads
        scheduledDelayMillis: 5000,               // 5s batches — tracing adds ~1-3% latency overhead at this setting
      }),
    ],
    metricReader: new PeriodicExportingMetricReader({
      exporter: new OTLPMetricExporter({ url: `${config.otlpEndpoint}/v1/metrics`, headers: config.otlpHeaders }),
      exportIntervalMillis: 30000,                // 30s is right for agents — conversations last minutes, not milliseconds
    }),
    instrumentations: [new HttpInstrumentation()], // Auto-traces all outbound HTTP (tool calls, LLM APIs)
  });
 
  sdk.start();
  process.on('SIGTERM', () => sdk.shutdown().then(() => process.exit(0))); // Flush remaining spans on shutdown
  return sdk;
}

Call this once at startup. The BatchSpanProcessor adds roughly 1-3% latency overhead in practice (benchmarked by the OTEL team). That's negligible against LLM call latency. The real cost is storage: at 10,000 conversations/day with 15 spans each, you're generating roughly 2-3GB of trace data daily before sampling.

Connecting Traces to Quality Scores

Traces tell you what happened. Quality scores tell you if it was good. Connect them and you can answer: what do bad conversations look like structurally? This is where observability stops being monitoring and starts being improvement.

The query you want: "Show me traces from the last 24 hours where quality score < 0.7, grouped by failure pattern." This is the query that would have caught our stale-data bug before any customer noticed — the 3-4% of conversations returning wrong order statuses would have clustered around one tool span returning suspiciously fast. Maybe every low-quality conversation had a tool timeout on step 3. Maybe memory retrieval returned zero results. Trace data identifies the structural patterns, quality scores identify which patterns matter. Teams running automated quality evaluation with scorecards can correlate trace anomalies with scoring drops automatically.

Your Four-Week Rollout

Week 1 is about getting any data flowing. Week 4 is about acting on it automatically. Don't try to build the full pipeline on day one.

Progress0/13

From Grep to Trace in 15 Minutes

Remember our 2 AM incident? Here's exactly how it would have played out with proper traces. Quality scores drop on 3-4% of conversations. We filter traces where score < 0.7. Every one shows the same tool.execute span for order lookup completing in 2ms (cache hit) instead of the normal 200ms (live API). We click into the span, see the stale connection pool, fix it. Fifteen minutes, not six hours. No grep. No guessing.

That's what OTEL-standardized tracing changes. Debugging goes from "grep and hope" to "find the trace, see every step, identify the broken span."

The conventions exist. The tooling is mature. The only question is whether you instrument before or after your own 2 AM incident.

For the complete guide to what to monitor once your agents go live — covering drift detection, alerting strategies, and the five pillars of agent observability beyond tracing — start there if you haven't already.

Monitor Your AI Agents in Production

Chanl connects to any voice, chat, or messaging agent. Trace tool calls, evaluate quality with scorecards, and catch degradation before your customers do.

Start monitoring

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

observability opentelemetry monitoring typescript production agent-infrastructure analytics tools

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.