What is AI agent observability and how is it different from traditional APM?

AI agent observability is the practice of monitoring, understanding, and troubleshooting what an AI agent does across its entire reasoning workflow, not just whether it's up. Traditional APM tracks uptime, error rates, and latency for deterministic systems where the same input always produces the same output. Agent observability tracks non-deterministic behavior: multi-step reasoning chains, tool call sequences, output quality, and semantic drift. An agent can return HTTP 200 with a perfectly formatted response that contains completely hallucinated information. Traditional APM sees a success. Agent observability sees a failure.

How do you detect quality drift in a production AI agent?

Run automated quality evaluations on a sample of production conversations, typically 5 to 10 percent of traffic. Compare scores on a rolling weekly basis. A consistent downward trend of 0.3 points or more over two weeks signals drift. Complement score tracking with embedding-based semantic similarity against a reference corpus. Chanl's scorecard system lets you evaluate calls against predefined criteria and track aggregate results over time to surface these trends automatically.

How much telemetry do AI agents generate compared to traditional apps?

AI agents generate 10 to 100 times more telemetry than traditional applications. A single LLM call creates 8 to 15 spans per request compared to 2 or 3 for a typical API endpoint. A single conversation turn might involve multiple LLM calls, tool executions, memory lookups, and knowledge base queries. This volume means commercial observability platforms charging per GB of ingested data can result in observability bills that exceed your actual LLM costs.

What metrics should I monitor first when launching an AI agent?

Start with five core metrics: end-to-end latency at p50 and p95, tool call success rate, token usage per conversation, call volume over time, and a sampled quality score. These cover system health, reliability, cost visibility, throughput, and behavioral correctness. Add drift detection and per-category analysis once your baseline is stable, usually after two to four weeks of production traffic.

Can OpenTelemetry handle AI agent tracing?

OpenTelemetry now has experimental semantic conventions for generative AI spans, covering model calls, tool executions, and agent planning steps. According to Elastic's 2026 observability report, 89 percent of production users consider OTel compliance at least very important, and OTel production adoption tripled from 2025 to 2026. OTel handles the tracing structure well, but you still need additional tooling for quality evaluation, cost tracking, and drift detection, which are not part of the OTel specification.

How do you set alert thresholds for AI agents without causing alert fatigue?

Base thresholds on your own baseline data, not industry benchmarks. Measure your p95 latency, error rate, and quality scores for two weeks, then set alerts at two standard deviations from your baseline. Use composite signals rather than single-metric alerts. A latency spike alone is noise. A latency spike plus a quality score drop plus a token usage increase is a real incident. Separate critical alerts that page on-call from informational alerts that go into a next-business-day review queue.

What is the cost of not monitoring AI agent quality in production?

The cost is silent failure at scale. OneUptime documented cases where agents hallucinated product features for three days before anyone noticed because the response format looked correct. In another case, a tool API changed its response format, causing agents to retry in a loop, which spiked monthly LLM bills by 15,000 dollars before the team caught it. Without quality monitoring, every failure looks like a success until a customer reports it.

Build an AI Agent Observability Pipeline from Scratch

Your AI agent passed every test in staging. The demo went well. Leadership signed off. You pushed it to production, and for two weeks it handled 500 conversations a day without a single alert firing.

Then a customer tweets that your agent confidently quoted a return policy you retired three months ago. You check the logs. No errors. No exceptions. HTTP 200 on every request. Latency under a second. By every traditional metric, the system was healthy the entire time. It was just wrong, confidently and repeatedly, for who knows how long.

Microsoft's security team named this the observability gap in March 2026: "Traditional monitoring, built around uptime, latency, and error rates, can miss the root cause and provide limited signal for attribution or reconstruction in AI-related scenarios." Your Datadog dashboard says green. Your customers say broken.

This article builds a working observability pipeline in TypeScript using the Chanl SDK. Not a monitoring checklist. Actual code that pulls real metrics, scores quality, detects drift, and triggers alerts. By the end, you'll have a system that answers the question traditional monitoring can't: is the agent actually behaving correctly?

What you'll build	Why it matters
Metrics collector	Pull latency, token cost, and call volume from production conversations
Quality scorer	Automatically evaluate agent output against defined criteria
Drift detector	Catch gradual degradation before customers notice
Alert pipeline	Composite signals that fire on real incidents, not noise
Full dashboard loop	Pull metrics, score, detect, alert, all in one pipeline

Prerequisites and setup

You'll need Node.js 20+, TypeScript 5+, and a Chanl account with API access. All code examples are TypeScript, and we'll use the Chanl SDK for data access throughout.

Install the SDK:

bash

npm install @chanl/sdk

Set your environment:

bash

export CHANL_API_KEY=your-api-key
export CHANL_BASE_URL=https://platform.chanl.ai

Initialize the client:

typescript

import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({
  apiKey: process.env.CHANL_API_KEY,
  baseUrl: process.env.CHANL_BASE_URL,
});

If you're new to agent observability concepts, AI Agent Observability: What to Monitor When Your Agent Goes Live covers the foundational thinking. This article assumes you understand why observability matters and focuses on building the pipeline.

Why traditional APM fails for AI agents

Traditional Application Performance Monitoring was built for deterministic systems. A REST API either returns the right data or throws an error. A database query either succeeds or times out. The same input produces the same output, and a failure looks like a failure.

AI agents broke that contract. The same question asked twice produces two different answers, both potentially correct, both phrased differently. A "successful" HTTP 200 response might contain completely hallucinated data. An agent might be technically "up" while stuck in an infinite reasoning loop burning $50 per minute in API costs. OneUptime documented a case where an agent hallucinated product features that don't exist, and nobody noticed for three days because the response format looked correct.

Dimension	Traditional APM	AI Agent Observability
Core question	Is it running?	Is it behaving correctly?
Failure mode	Errors, timeouts, crashes	Confident wrong answers, policy violations, hallucinations
Input/output	Deterministic (same in, same out)	Non-deterministic (same in, different out)
Success signal	HTTP 200, low latency	Correct, complete, policy-adherent response
What breaks	Infrastructure	Reasoning, context, tool selection
Trace complexity	2-3 spans per request	8-15 spans per request (LLM + tools + memory)
Cost model	Fixed compute	Variable per-token, per-call
Drift risk	Low (code doesn't change itself)	High (model behavior shifts, knowledge goes stale)
Quality measurement	Binary pass/fail	Multi-criteria scored evaluation

The 2026 Elastic observability report found that 85% of organizations now use some form of GenAI for observability, but the majority are applying GenAI to existing monitoring rather than building observability for their AI systems. That's the gap this article fills.

What are the four pillars of AI agent observability?

Agent observability has four pillars: metrics, logs, traces, and quality. The first three are borrowed from traditional observability. The fourth, quality, is what makes agent monitoring different. Without quality scoring, you're monitoring a system whose primary failure mode is invisible to the other three pillars.

The four pillars of AI agent observability and how they connect

Pillar 1: Metrics

Metrics are the numbers that tell you what's happening at a glance. For AI agents, the critical metrics are different from traditional services.

Latency isn't just response time. It's end-to-end conversation turn latency, which includes LLM inference, tool calls, memory retrieval, and response formatting. Track p50, p95, and p99 separately. An agent might respond in 800ms for 90% of queries but take 12 seconds for complex ones requiring multiple tool calls. The average looks fine. The tail doesn't.

Token cost is the new compute bill. Unlike traditional services where compute is relatively fixed, every agent conversation has a variable cost based on prompt length, conversation history, and the number of reasoning steps. A cost spike might signal a reasoning loop or bloated context window.

Call volume and success rate tell you throughput and reliability. But for agents, "success" needs redefinition. An API call that returns 200 but gives a wrong answer isn't a success. You need to pair success rate with quality scores, which we'll build in Pillar 4.

Here's how to pull these metrics from production using the Chanl SDK:

typescript

import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({
  apiKey: process.env.CHANL_API_KEY,
  baseUrl: process.env.CHANL_BASE_URL,
});
 
// Pull aggregate metrics for the last 24 hours
const metrics = await chanl.calls.getMetrics({
  agentId: "agent_support_v3",
  startDate: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
  endDate: new Date().toISOString(),
});
 
console.log("Latency p50:", metrics.latency.p50, "ms");
console.log("Latency p95:", metrics.latency.p95, "ms");
console.log("Total calls:", metrics.totalCalls);
console.log("Avg tokens/call:", metrics.tokenUsage.average);
console.log("Total cost:", `$${metrics.cost.total.toFixed(2)}`);

The getMetrics method returns pre-aggregated data. This is important. You're not fetching 500 calls and computing averages client-side. The backend does the aggregation, which means this call is fast even when you're pulling metrics across millions of conversations.

For more granular analysis, pull individual call data:

typescript

// Get a specific call's full detail
const call = await chanl.calls.get("call_abc123");
 
console.log("Duration:", call.duration, "seconds");
console.log("Turns:", call.turns);
console.log("Tool calls:", call.toolCalls?.length ?? 0);
console.log("Total tokens:", call.tokenUsage?.total);
console.log("Status:", call.status);

Total Calls

0+12%

Avg Duration

4:23-8s

Resolution

0%+3%

Live Dashboard

Active calls23

Avg wait0:04

Satisfaction98%

Pillar 2: Logs

Structured logging for AI agents means capturing every decision point in a format you can query later. The critical insight from OneUptime's analysis is that the highest-value approach is instrumenting decision points, not just request/response boundaries.

For each conversation turn, you want to capture:

typescript

interface AgentLogEntry {
  timestamp: string;
  conversationId: string;
  turnIndex: number;
  phase: "reasoning" | "tool_call" | "tool_result" | "response";
  input: string;
  output: string;
  toolName?: string;
  toolSuccess?: boolean;
  latencyMs: number;
  tokenCount: number;
  metadata: Record<string, unknown>;
}

With the Chanl SDK, you can pull the full transcript for any call, which includes tool calls, agent reasoning, and customer messages in order:

typescript

// Pull full conversation transcript
const transcript = await chanl.calls.getTranscript("call_abc123");
 
for (const entry of transcript.entries) {
  console.log(`[${entry.role}] ${entry.content}`);
 
  if (entry.toolCall) {
    console.log(`  Tool: ${entry.toolCall.name}`);
    console.log(`  Args: ${JSON.stringify(entry.toolCall.arguments)}`);
    console.log(`  Result: ${entry.toolCall.result?.status}`);
  }
}

The transcript gives you the raw material for debugging. When a customer reports a problem, you don't grep through CloudWatch logs hoping to find the right request. You pull the transcript, see every step the agent took, and identify exactly where the reasoning went wrong.

Pillar 3: Traces

Traces connect the dots between individual log entries. A single conversation turn might involve an LLM reasoning step, a knowledge base query, two tool calls, and a final response generation. Traces show you the causal chain: which step caused which, what ran in parallel, and where time was spent.

OpenTelemetry's experimental semantic conventions for generative AI now cover model calls, tool executions, and agent planning steps. According to the 2026 Elastic report, OTel production adoption tripled from 2025 to 2026, and 89% of production users consider OTel compliance at least very important.

Here's a basic trace structure for an agent conversation turn:

typescript

import { trace, SpanKind } from "@opentelemetry/api";
 
const tracer = trace.getTracer("agent-observability");
 
async function traceConversationTurn(
  conversationId: string,
  userMessage: string
) {
  return tracer.startActiveSpan(
    "agent.conversation_turn",
    { kind: SpanKind.SERVER },
    async (rootSpan) => {
      rootSpan.setAttribute("conversation.id", conversationId);
      rootSpan.setAttribute("user.message.length", userMessage.length);
 
      // Trace knowledge retrieval
      const context = await tracer.startActiveSpan(
        "agent.knowledge_retrieval",
        async (span) => {
          const results = await retrieveContext(userMessage);
          span.setAttribute("retrieval.results_count", results.length);
          span.setAttribute("retrieval.top_score", results[0]?.score ?? 0);
          span.end();
          return results;
        }
      );
 
      // Trace LLM reasoning
      const response = await tracer.startActiveSpan(
        "agent.llm_reasoning",
        async (span) => {
          const result = await generateResponse(userMessage, context);
          span.setAttribute("llm.model", result.model);
          span.setAttribute("llm.tokens.prompt", result.promptTokens);
          span.setAttribute("llm.tokens.completion", result.completionTokens);
          span.setAttribute("llm.tool_calls", result.toolCalls?.length ?? 0);
          span.end();
          return result;
        }
      );
 
      // Trace each tool call
      for (const toolCall of response.toolCalls ?? []) {
        await tracer.startActiveSpan(
          `agent.tool_call.${toolCall.name}`,
          async (span) => {
            span.setAttribute("tool.name", toolCall.name);
            span.setAttribute(
              "tool.arguments",
              JSON.stringify(toolCall.arguments)
            );
            const result = await executeToolCall(toolCall);
            span.setAttribute("tool.success", result.success);
            span.setAttribute("tool.latency_ms", result.latencyMs);
            span.end();
          }
        );
      }
 
      rootSpan.setAttribute("response.length", response.text.length);
      rootSpan.end();
      return response;
    }
  );
}

The key insight is that you're creating parent-child span relationships. When you look at a trace in Jaeger or your tracing backend, you see the full tree: conversation turn at the root, with knowledge retrieval, LLM reasoning, and tool calls as children. If a tool call takes 4 seconds, you see it immediately. If knowledge retrieval returns zero results and the agent hallucinates, you see the empty retrieval and the hallucinated response connected in the same trace.

Pillar 4: Quality

This is the pillar that traditional observability doesn't have, and it's the one that matters most for AI agents. Quality scoring answers the question that metrics, logs, and traces can't: was the agent's output actually correct?

Quality scoring works by evaluating agent responses against defined criteria: accuracy, policy adherence, tone, completeness, and domain-specific rules. If you're deciding between vibes-based assessment and structured scoring, Scorecards vs. Vibes breaks down the tradeoffs. You can run these evaluations on every call or on a statistical sample.

The Chanl SDK provides two approaches. First, evaluate a specific call against a scorecard:

typescript

// Score a specific call against quality criteria
const scorecard = await chanl.calls.getScorecard("call_abc123");
 
console.log("Overall score:", scorecard.overallScore);
for (const criterion of scorecard.criteria) {
  console.log(`  ${criterion.name}: ${criterion.score}/${criterion.maxScore}`);
  if (criterion.score < criterion.threshold) {
    console.log(`  ⚠ Below threshold: ${criterion.feedback}`);
  }
}

Second, run an evaluation programmatically. This is useful for sampling production traffic:

typescript

// Run quality evaluation on a call
const evaluation = await chanl.scorecards.evaluate({
  callId: "call_abc123",
  scorecardId: "sc_support_quality",
});
 
console.log("Evaluation ID:", evaluation.id);
console.log("Score:", evaluation.score);
console.log("Pass:", evaluation.pass);
console.log("Criteria results:", evaluation.results);

And third, pull aggregate quality data to track trends:

typescript

// Get quality trends over time
const qualityResults = await chanl.scorecards.listResults({
  agentId: "agent_support_v3",
  startDate: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString(),
  endDate: new Date().toISOString(),
});
 
console.log("Total evaluations:", qualityResults.total);
console.log("Average score:", qualityResults.averageScore);
console.log("Pass rate:", `${qualityResults.passRate}%`);

Quality scoring is what closes the loop. Without it, you're flying with three instruments in a four-instrument cockpit.

How to build the full observability pipeline

A production pipeline connects the four pillars into a single cron job: collect metrics, sample calls for quality scoring, compare scores against baselines, and fire alerts when multiple signals correlate. Here's how to wire it together.

End-to-end observability pipeline: collect, score, detect, alert

Step 1: Metrics collector

The metrics collector runs hourly and pulls aggregate data:

typescript

interface MetricsSnapshot {
  timestamp: string;
  agentId: string;
  period: { start: string; end: string };
  latency: { p50: number; p95: number; p99: number };
  calls: { total: number; successful: number; failed: number };
  tokens: { average: number; total: number; cost: number };
  tools: { totalCalls: number; successRate: number };
}
 
async function collectMetrics(agentId: string): Promise<MetricsSnapshot> {
  const now = new Date();
  const oneHourAgo = new Date(now.getTime() - 60 * 60 * 1000);
 
  const metrics = await chanl.calls.getMetrics({
    agentId,
    startDate: oneHourAgo.toISOString(),
    endDate: now.toISOString(),
  });
 
  return {
    timestamp: now.toISOString(),
    agentId,
    period: {
      start: oneHourAgo.toISOString(),
      end: now.toISOString(),
    },
    latency: {
      p50: metrics.latency.p50,
      p95: metrics.latency.p95,
      p99: metrics.latency.p99,
    },
    calls: {
      total: metrics.totalCalls,
      successful: metrics.successfulCalls,
      failed: metrics.failedCalls,
    },
    tokens: {
      average: metrics.tokenUsage.average,
      total: metrics.tokenUsage.total,
      cost: metrics.cost.total,
    },
    tools: {
      totalCalls: metrics.toolCalls?.total ?? 0,
      successRate: metrics.toolCalls?.successRate ?? 1,
    },
  };
}

Step 2: Quality sampler

For quality scoring, you don't need to evaluate every call. A 5-10% sample gives you statistical significance for trend detection. The sampler selects recent calls and runs them through Chanl's scorecard evaluation:

typescript

interface QualitySample {
  callId: string;
  score: number;
  pass: boolean;
  criteria: Array<{
    name: string;
    score: number;
    maxScore: number;
  }>;
}
 
async function sampleAndScore(
  agentId: string,
  scorecardId: string,
  sampleRate: number = 0.1
): Promise<QualitySample[]> {
  // Pull recent calls
  const recentCalls = await chanl.calls.list({
    agentId,
    startDate: new Date(Date.now() - 60 * 60 * 1000).toISOString(),
    limit: 100,
  });
 
  // Random sample
  const sampled = recentCalls.items.filter(() => Math.random() < sampleRate);
  const results: QualitySample[] = [];
 
  for (const call of sampled) {
    const evaluation = await chanl.scorecards.evaluate({
      callId: call.id,
      scorecardId,
    });
 
    results.push({
      callId: call.id,
      score: evaluation.score,
      pass: evaluation.pass,
      criteria: evaluation.results.map((r) => ({
        name: r.criterionName,
        score: r.score,
        maxScore: r.maxScore,
      })),
    });
  }
 
  return results;
}

Step 3: Drift detector

One bad call is an outlier. A week of declining scores is drift. The drift detector compares current quality scores against your baseline using standard deviation thresholds.

typescript

interface DriftSignal {
  detected: boolean;
  severity: "none" | "warning" | "critical";
  metric: string;
  currentValue: number;
  baselineValue: number;
  deviationPercent: number;
}
 
function detectDrift(
  currentScores: QualitySample[],
  baselineAvg: number,
  baselineStdDev: number
): DriftSignal {
  if (currentScores.length === 0) {
    return {
      detected: false,
      severity: "none",
      metric: "quality_score",
      currentValue: 0,
      baselineValue: baselineAvg,
      deviationPercent: 0,
    };
  }
 
  const currentAvg =
    currentScores.reduce((sum, s) => sum + s.score, 0) / currentScores.length;
  const deviation = baselineAvg - currentAvg;
  const deviationPercent = (deviation / baselineAvg) * 100;
 
  // Warning at 1 std dev below baseline, critical at 2
  let severity: "none" | "warning" | "critical" = "none";
  if (deviation > 2 * baselineStdDev) {
    severity = "critical";
  } else if (deviation > baselineStdDev) {
    severity = "warning";
  }
 
  return {
    detected: severity !== "none",
    severity,
    metric: "quality_score",
    currentValue: currentAvg,
    baselineValue: baselineAvg,
    deviationPercent,
  };
}

Step 4: Alert engine

Here's where the pipeline earns its keep. A single metric anomaly is noise. Multiple correlated anomalies are an incident. The alert engine combines signals from metrics and drift detection to decide what's worth waking someone up for.

typescript

interface AlertDecision {
  shouldAlert: boolean;
  severity: "info" | "warning" | "critical";
  signals: string[];
  message: string;
}
 
function evaluateAlerts(
  metrics: MetricsSnapshot,
  drift: DriftSignal,
  baseline: {
    avgLatencyP95: number;
    avgCost: number;
    avgToolSuccessRate: number;
  }
): AlertDecision {
  const signals: string[] = [];
 
  // Check latency spike
  if (metrics.latency.p95 > baseline.avgLatencyP95 * 2) {
    signals.push(
      `Latency p95 at ${metrics.latency.p95}ms (baseline: ${baseline.avgLatencyP95}ms)`
    );
  }
 
  // Check cost spike (possible reasoning loop)
  if (metrics.tokens.cost > baseline.avgCost * 3) {
    signals.push(
      `Token cost $${metrics.tokens.cost.toFixed(2)} (baseline: $${baseline.avgCost.toFixed(2)})`
    );
  }
 
  // Check tool failure rate
  if (metrics.tools.successRate < baseline.avgToolSuccessRate * 0.8) {
    signals.push(
      `Tool success rate ${(metrics.tools.successRate * 100).toFixed(1)}% ` +
        `(baseline: ${(baseline.avgToolSuccessRate * 100).toFixed(1)}%)`
    );
  }
 
  // Check quality drift
  if (drift.detected) {
    signals.push(
      `Quality drift: ${drift.severity} ` +
        `(${drift.deviationPercent.toFixed(1)}% below baseline)`
    );
  }
 
  // Composite severity
  let severity: "info" | "warning" | "critical" = "info";
  if (signals.length >= 3 || drift.severity === "critical") {
    severity = "critical";
  } else if (signals.length >= 2 || drift.severity === "warning") {
    severity = "warning";
  }
 
  return {
    shouldAlert: signals.length >= 2,
    severity,
    signals,
    message:
      signals.length >= 2
        ? `${signals.length} correlated signals detected for ${metrics.agentId}: ${signals.join("; ")}`
        : "No actionable signals",
  };
}

The key design decision: shouldAlert requires two or more signals. A latency spike alone doesn't page anyone. A latency spike combined with a quality score drop and a cost increase means something is actually wrong.

Step 5: Putting it all together

Here's the full pipeline, designed to run on a cron schedule:

typescript

async function runObservabilityPipeline(config: {
  agentId: string;
  scorecardId: string;
  sampleRate: number;
  baseline: {
    qualityAvg: number;
    qualityStdDev: number;
    avgLatencyP95: number;
    avgCost: number;
    avgToolSuccessRate: number;
  };
}) {
  console.log(`[${new Date().toISOString()}] Running pipeline for ${config.agentId}`);
 
  // Step 1: Collect metrics
  const metrics = await collectMetrics(config.agentId);
  console.log(`  Calls: ${metrics.calls.total}, P95: ${metrics.latency.p95}ms`);
 
  // Step 2: Sample and score quality
  const qualitySamples = await sampleAndScore(
    config.agentId,
    config.scorecardId,
    config.sampleRate
  );
  console.log(`  Scored ${qualitySamples.length} calls`);
 
  // Step 3: Detect drift
  const drift = detectDrift(
    qualitySamples,
    config.baseline.qualityAvg,
    config.baseline.qualityStdDev
  );
  console.log(`  Drift: ${drift.severity}`);
 
  // Step 4: Evaluate alerts
  const alert = evaluateAlerts(metrics, drift, {
    avgLatencyP95: config.baseline.avgLatencyP95,
    avgCost: config.baseline.avgCost,
    avgToolSuccessRate: config.baseline.avgToolSuccessRate,
  });
 
  if (alert.shouldAlert) {
    console.log(`  🚨 ALERT [${alert.severity}]: ${alert.message}`);
    await sendAlert(alert);
  } else {
    console.log("  ✓ No actionable signals");
  }
 
  // Step 5: Store for trend analysis
  await storeSnapshot({
    metrics,
    qualitySamples,
    drift,
    alert,
  });
}

Run it every hour and you have a working observability pipeline. The first two weeks of data become your baseline. After that, drift detection has enough history to catch gradual degradation.

How to investigate a production incident with this pipeline

Start with the composite alert signal to narrow the search. A latency spike plus quality drop points to a slow or failing tool. A cost spike plus quality drop suggests a reasoning loop. The combination tells you where to look first.

Pull a failing call. Use the Chanl SDK to get the full conversation:

typescript

// Get the call detail
const call = await chanl.calls.get("call_failing_123");
 
// Pull the full transcript
const transcript = await chanl.calls.getTranscript("call_failing_123");
 
// Run AI analysis on the call
const analysis = await chanl.calls.analyze("call_failing_123");
 
console.log("AI Analysis:", analysis.summary);
console.log("Issues found:", analysis.issues);
console.log("Suggested fixes:", analysis.suggestions);

The analyze method runs AI-powered analysis on the call, identifying issues like policy violations, hallucinations, or tool misuse. It doesn't replace human review, but it narrows the search space from "something is wrong with 500 daily conversations" to "here are the three specific failure patterns in calls from the last six hours."

Check the scorecard breakdown. The overall quality score tells you the agent is underperforming. The per-criterion breakdown tells you why:

typescript

const scorecard = await chanl.calls.getScorecard("call_failing_123");
 
// Find which criteria failed
const failures = scorecard.criteria.filter(
  (c) => c.score < c.threshold
);
 
for (const failure of failures) {
  console.log(`Failed: ${failure.name}`);
  console.log(`  Score: ${failure.score}/${failure.maxScore}`);
  console.log(`  Feedback: ${failure.feedback}`);
}

Maybe accuracy is fine but policy adherence dropped. That's a stale knowledge base. Maybe accuracy and tone both dropped. That's probably a model change or prompt regression. The per-criterion data tells you where to look, and the Chanl scorecard system handles the evaluation so you're not building an LLM-as-judge from scratch.

How to establish baselines and catch drift over time

Collect two weeks of metrics and quality scores before enabling alerts. That data becomes your baseline. Drift detection compares each new data point against the baseline's mean and standard deviation. A 0.5% daily decline is invisible in real time but adds up to 15% over a month.

Building your baseline

Run the metrics collector and quality sampler for two weeks without alerting. Store every data point. After two weeks, compute:

typescript

interface Baseline {
  qualityAvg: number;
  qualityStdDev: number;
  avgLatencyP95: number;
  avgCost: number;
  avgToolSuccessRate: number;
  sampleSize: number;
  periodDays: number;
}
 
function computeBaseline(snapshots: MetricsSnapshot[], qualityScores: number[]): Baseline {
  const latencies = snapshots.map((s) => s.latency.p95);
  const costs = snapshots.map((s) => s.tokens.cost);
  const toolRates = snapshots.map((s) => s.tools.successRate);
 
  const avg = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;
  const stdDev = (arr: number[]) => {
    const mean = avg(arr);
    return Math.sqrt(arr.reduce((sum, v) => sum + (v - mean) ** 2, 0) / arr.length);
  };
 
  return {
    qualityAvg: avg(qualityScores),
    qualityStdDev: stdDev(qualityScores),
    avgLatencyP95: avg(latencies),
    avgCost: avg(costs),
    avgToolSuccessRate: avg(toolRates),
    sampleSize: qualityScores.length,
    periodDays: 14,
  };
}

Tracking quality trends

Use Chanl's aggregate scorecard results to track quality over time without running individual evaluations on historical data:

typescript

async function getQualityTrend(agentId: string, weeks: number = 4) {
  const trends: Array<{ week: number; avgScore: number; passRate: number }> = [];
 
  for (let w = 0; w < weeks; w++) {
    const end = new Date(Date.now() - w * 7 * 24 * 60 * 60 * 1000);
    const start = new Date(end.getTime() - 7 * 24 * 60 * 60 * 1000);
 
    const results = await chanl.scorecards.listResults({
      agentId,
      startDate: start.toISOString(),
      endDate: end.toISOString(),
    });
 
    trends.push({
      week: w,
      avgScore: results.averageScore,
      passRate: results.passRate,
    });
  }
 
  // Check for downward trend
  const scores = trends.map((t) => t.avgScore).reverse();
  let declining = true;
  for (let i = 1; i < scores.length; i++) {
    if (scores[i] >= scores[i - 1]) {
      declining = false;
      break;
    }
  }
 
  if (declining && scores.length > 2) {
    const totalDrop = scores[0] - scores[scores.length - 1];
    console.log(
      `⚠ Consistent quality decline: ${totalDrop.toFixed(2)} points over ${weeks} weeks`
    );
  }
 
  return trends;
}

A consistent decline of 0.3 points or more over two weeks is a clear signal. At that point, pull the per-criterion breakdown from your scorecard results to identify which specific quality dimensions are degrading. Is it accuracy? Policy adherence? Tone? Each points to a different root cause.

What does the cost of observability look like?

A single LLM call creates 8-15 spans compared to 2-3 for a typical API endpoint. Commercial observability platforms charging $0.10-$0.30 per GB of ingested data can produce observability bills that exceed your actual LLM costs. The fix is aggregation and sampling.

Here's a realistic cost breakdown for an agent handling 500 calls per day:

Component	Volume/Day	Approach	Monthly Cost
Metrics	500 calls x 24 metrics	Aggregate before store	~$5
Logs	500 transcripts x ~5KB each	Store full, query on demand	~$15
Traces	500 calls x ~12 spans each	Sample 10%, full trace	~$10
Quality evals	50 calls x scorecard	10% sample rate	~$25 (LLM eval cost)
Storage	All above, 90-day retention	Time-series DB	~$20
Total			~$75/month

Compare that to the cost of not monitoring: a three-day undetected hallucination affecting 1,500 conversations. Or a $15,000 monthly bill spike from a tool retry loop. $75 per month for a pipeline that catches these problems within an hour is a reasonable trade.

The trick is aggregation and sampling. Don't store every span for every call. Aggregate metrics hourly, sample 10% for full tracing, and run quality evaluations on the sample. You get statistical significance without the data volume of full instrumentation.

What should your observability dashboard look like?

A good dashboard answers three questions at a glance: Is the agent healthy right now? Has quality changed over time? Where should I investigate?

Organize it in three rows, each serving a different time horizon:

Row	Content	Time horizon
Top: Health indicators	Four cards: p95 latency, call volume, tool success rate, quality score. Sparkline of last 24 hours. Color-coded against baseline (green/yellow/red).	Right now
Middle: Trends	Full-width chart showing quality score, latency, and cost over 30 days. Drift that looks flat hour-to-hour becomes obvious at this scale.	Last month
Bottom: Investigation	Recent alerts, lowest-scoring calls, drift-flagged calls. Each links to `chanl.calls.get(callId)` for transcript and scorecard.	Action queue

The Chanl analytics dashboard provides this layout out of the box, with real-time metrics, quality trends, and drill-down into individual conversations. If you're building your own dashboard, the SDK methods we've covered (getMetrics, getScorecard, listResults, getTranscript, analyze) provide all the data you need. Chanl's monitoring features handle the alerting layer, so you can focus on the analysis rather than the plumbing.

The closed loop: observability that improves the agent

Observability is not a passive activity. The whole point of collecting metrics, scoring quality, and detecting drift is to feed improvements back into the agent. Here's how the loop closes:

Pipeline detects drift in the "policy adherence" quality criterion
Investigation reveals the knowledge base contains outdated return policy documents
Fix: update the knowledge base, run scenario tests against the updated content
Verification: quality scores recover to baseline within 24 hours
Baseline update: the pipeline incorporates the new, higher-quality data into its rolling baseline

This is the data flywheel. Every observation becomes an input to improvement, the same loop described in turning conversation data into agent improvements. The agent gets better not because you're guessing what's wrong, but because you have data showing exactly what degraded and by how much.

The same loop applies to prompt changes. Before deploying a prompt update, run your scorecard evaluations against a set of production calls using the new prompt. Compare scores to your baseline. If quality improves, ship it. If it degrades on any criterion, investigate before deploying. The pipeline gives you the data to make that decision with confidence.

Wrapping up

80% of Fortune 500 companies are running active AI agents. Most of them are monitoring uptime when they should be monitoring behavior. That's the same blind spot as the retired return policy from the opening: HTTP 200, zero alerts, completely wrong answers.

The pipeline we built covers the full loop: collect metrics with chanl.calls.getMetrics(), score quality with chanl.scorecards.evaluate(), detect drift by comparing against baselines, and alert on composite signals. It runs on a cron, costs about $75 per month for a 500-call-per-day agent, and catches the failures that traditional APM misses.

Start with the metrics collector. Get two weeks of baseline data. Then add quality scoring and drift detection. Every team that instruments quality scoring discovers problems they didn't know existed. The question isn't whether your agent has issues. It's whether you're seeing them before your customers do.

To build the evaluation framework that powers quality scoring, see How to Evaluate AI Agents: Build an Eval Framework from Scratch. For testing agents before they hit production, Scenario Testing: The QA Strategy That Catches What Unit Tests Miss covers the pre-deploy side of the loop.

See your agent's blind spots

Chanl's observability pipeline gives you metrics, quality scoring, and drift detection for every AI agent conversation. Start monitoring what traditional APM misses.

Try Chanl free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

observability analytics monitoring typescript ai-agents learning-ai production scorecards

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.