ChanlChanl
Learning AI

Build an AI Agent Observability Pipeline from Scratch

Build a production observability pipeline for AI agents using TypeScript and the Chanl SDK. Covers metrics, traces, quality scoring, drift detection, and alerting.

DGDean GroverCo-founderFollow
March 31, 2026
22 min read read
Engineering team reviewing real-time AI agent monitoring dashboards with metrics and conversation traces

Your AI agent passed every test in staging. The demo went well. Leadership signed off. You pushed it to production, and for two weeks it handled 500 conversations a day without a single alert firing.

Then a customer tweets that your agent confidently quoted a return policy you retired three months ago. You check the logs. No errors. No exceptions. HTTP 200 on every request. Latency under a second. By every traditional metric, the system was healthy the entire time. It was just wrong, confidently and repeatedly, for who knows how long.

Microsoft's security team named this the observability gap in March 2026: "Traditional monitoring, built around uptime, latency, and error rates, can miss the root cause and provide limited signal for attribution or reconstruction in AI-related scenarios." Your Datadog dashboard says green. Your customers say broken.

This article builds a working observability pipeline in TypeScript using the Chanl SDK. Not a monitoring checklist. Actual code that pulls real metrics, scores quality, detects drift, and triggers alerts. By the end, you'll have a system that answers the question traditional monitoring can't: is the agent actually behaving correctly?

What you'll buildWhy it matters
Metrics collectorPull latency, token cost, and call volume from production conversations
Quality scorerAutomatically evaluate agent output against defined criteria
Drift detectorCatch gradual degradation before customers notice
Alert pipelineComposite signals that fire on real incidents, not noise
Full dashboard loopPull metrics, score, detect, alert, all in one pipeline

Prerequisites and setup

You'll need Node.js 20+, TypeScript 5+, and a Chanl account with API access. All code examples are TypeScript, and we'll use the Chanl SDK for data access throughout.

Install the SDK:

bash
npm install @chanl/sdk

Set your environment:

bash
export CHANL_API_KEY=your-api-key
export CHANL_BASE_URL=https://platform.chanl.ai

Initialize the client:

typescript
import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({
  apiKey: process.env.CHANL_API_KEY,
  baseUrl: process.env.CHANL_BASE_URL,
});

If you're new to agent observability concepts, AI Agent Observability: What to Monitor When Your Agent Goes Live covers the foundational thinking. This article assumes you understand why observability matters and focuses on building the pipeline.

Why traditional APM fails for AI agents

Traditional Application Performance Monitoring was built for deterministic systems. A REST API either returns the right data or throws an error. A database query either succeeds or times out. The same input produces the same output, and a failure looks like a failure.

AI agents broke that contract. The same question asked twice produces two different answers, both potentially correct, both phrased differently. A "successful" HTTP 200 response might contain completely hallucinated data. An agent might be technically "up" while stuck in an infinite reasoning loop burning $50 per minute in API costs. OneUptime documented a case where an agent hallucinated product features that don't exist, and nobody noticed for three days because the response format looked correct.

DimensionTraditional APMAI Agent Observability
Core questionIs it running?Is it behaving correctly?
Failure modeErrors, timeouts, crashesConfident wrong answers, policy violations, hallucinations
Input/outputDeterministic (same in, same out)Non-deterministic (same in, different out)
Success signalHTTP 200, low latencyCorrect, complete, policy-adherent response
What breaksInfrastructureReasoning, context, tool selection
Trace complexity2-3 spans per request8-15 spans per request (LLM + tools + memory)
Cost modelFixed computeVariable per-token, per-call
Drift riskLow (code doesn't change itself)High (model behavior shifts, knowledge goes stale)
Quality measurementBinary pass/failMulti-criteria scored evaluation

The 2026 Elastic observability report found that 85% of organizations now use some form of GenAI for observability, but the majority are applying GenAI to existing monitoring rather than building observability for their AI systems. That's the gap this article fills.

What are the four pillars of AI agent observability?

Agent observability has four pillars: metrics, logs, traces, and quality. The first three are borrowed from traditional observability. The fourth, quality, is what makes agent monitoring different. Without quality scoring, you're monitoring a system whose primary failure mode is invisible to the other three pillars.

volume, latency, cost structured events, tool calls multi-step reasoning chains scorecard evaluations, drift composite signal trend report Metrics Dashboard Logs Traces Quality Alert Engine PagerDuty / Slack Weekly Review
The four pillars of AI agent observability and how they connect

Pillar 1: Metrics

Metrics are the numbers that tell you what's happening at a glance. For AI agents, the critical metrics are different from traditional services.

Latency isn't just response time. It's end-to-end conversation turn latency, which includes LLM inference, tool calls, memory retrieval, and response formatting. Track p50, p95, and p99 separately. An agent might respond in 800ms for 90% of queries but take 12 seconds for complex ones requiring multiple tool calls. The average looks fine. The tail doesn't.

Token cost is the new compute bill. Unlike traditional services where compute is relatively fixed, every agent conversation has a variable cost based on prompt length, conversation history, and the number of reasoning steps. A cost spike might signal a reasoning loop or bloated context window.

Call volume and success rate tell you throughput and reliability. But for agents, "success" needs redefinition. An API call that returns 200 but gives a wrong answer isn't a success. You need to pair success rate with quality scores, which we'll build in Pillar 4.

Here's how to pull these metrics from production using the Chanl SDK:

typescript
import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({
  apiKey: process.env.CHANL_API_KEY,
  baseUrl: process.env.CHANL_BASE_URL,
});
 
// Pull aggregate metrics for the last 24 hours
const metrics = await chanl.calls.getMetrics({
  agentId: "agent_support_v3",
  startDate: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
  endDate: new Date().toISOString(),
});
 
console.log("Latency p50:", metrics.latency.p50, "ms");
console.log("Latency p95:", metrics.latency.p95, "ms");
console.log("Total calls:", metrics.totalCalls);
console.log("Avg tokens/call:", metrics.tokenUsage.average);
console.log("Total cost:", `$${metrics.cost.total.toFixed(2)}`);

The getMetrics method returns pre-aggregated data. This is important. You're not fetching 500 calls and computing averages client-side. The backend does the aggregation, which means this call is fast even when you're pulling metrics across millions of conversations.

For more granular analysis, pull individual call data:

typescript
// Get a specific call's full detail
const call = await chanl.calls.get("call_abc123");
 
console.log("Duration:", call.duration, "seconds");
console.log("Turns:", call.turns);
console.log("Tool calls:", call.toolCalls?.length ?? 0);
console.log("Total tokens:", call.tokenUsage?.total);
console.log("Status:", call.status);
Data analyst reviewing metrics
Total Calls
0+12%
Avg Duration
4:23-8s
Resolution
0%+3%
Live Dashboard
Active calls23
Avg wait0:04
Satisfaction98%

Pillar 2: Logs

Structured logging for AI agents means capturing every decision point in a format you can query later. The critical insight from OneUptime's analysis is that the highest-value approach is instrumenting decision points, not just request/response boundaries.

For each conversation turn, you want to capture:

typescript
interface AgentLogEntry {
  timestamp: string;
  conversationId: string;
  turnIndex: number;
  phase: "reasoning" | "tool_call" | "tool_result" | "response";
  input: string;
  output: string;
  toolName?: string;
  toolSuccess?: boolean;
  latencyMs: number;
  tokenCount: number;
  metadata: Record<string, unknown>;
}

With the Chanl SDK, you can pull the full transcript for any call, which includes tool calls, agent reasoning, and customer messages in order:

typescript
// Pull full conversation transcript
const transcript = await chanl.calls.getTranscript("call_abc123");
 
for (const entry of transcript.entries) {
  console.log(`[${entry.role}] ${entry.content}`);
 
  if (entry.toolCall) {
    console.log(`  Tool: ${entry.toolCall.name}`);
    console.log(`  Args: ${JSON.stringify(entry.toolCall.arguments)}`);
    console.log(`  Result: ${entry.toolCall.result?.status}`);
  }
}

The transcript gives you the raw material for debugging. When a customer reports a problem, you don't grep through CloudWatch logs hoping to find the right request. You pull the transcript, see every step the agent took, and identify exactly where the reasoning went wrong.

Pillar 3: Traces

Traces connect the dots between individual log entries. A single conversation turn might involve an LLM reasoning step, a knowledge base query, two tool calls, and a final response generation. Traces show you the causal chain: which step caused which, what ran in parallel, and where time was spent.

OpenTelemetry's experimental semantic conventions for generative AI now cover model calls, tool executions, and agent planning steps. According to the 2026 Elastic report, OTel production adoption tripled from 2025 to 2026, and 89% of production users consider OTel compliance at least very important.

Here's a basic trace structure for an agent conversation turn:

typescript
import { trace, SpanKind } from "@opentelemetry/api";
 
const tracer = trace.getTracer("agent-observability");
 
async function traceConversationTurn(
  conversationId: string,
  userMessage: string
) {
  return tracer.startActiveSpan(
    "agent.conversation_turn",
    { kind: SpanKind.SERVER },
    async (rootSpan) => {
      rootSpan.setAttribute("conversation.id", conversationId);
      rootSpan.setAttribute("user.message.length", userMessage.length);
 
      // Trace knowledge retrieval
      const context = await tracer.startActiveSpan(
        "agent.knowledge_retrieval",
        async (span) => {
          const results = await retrieveContext(userMessage);
          span.setAttribute("retrieval.results_count", results.length);
          span.setAttribute("retrieval.top_score", results[0]?.score ?? 0);
          span.end();
          return results;
        }
      );
 
      // Trace LLM reasoning
      const response = await tracer.startActiveSpan(
        "agent.llm_reasoning",
        async (span) => {
          const result = await generateResponse(userMessage, context);
          span.setAttribute("llm.model", result.model);
          span.setAttribute("llm.tokens.prompt", result.promptTokens);
          span.setAttribute("llm.tokens.completion", result.completionTokens);
          span.setAttribute("llm.tool_calls", result.toolCalls?.length ?? 0);
          span.end();
          return result;
        }
      );
 
      // Trace each tool call
      for (const toolCall of response.toolCalls ?? []) {
        await tracer.startActiveSpan(
          `agent.tool_call.${toolCall.name}`,
          async (span) => {
            span.setAttribute("tool.name", toolCall.name);
            span.setAttribute(
              "tool.arguments",
              JSON.stringify(toolCall.arguments)
            );
            const result = await executeToolCall(toolCall);
            span.setAttribute("tool.success", result.success);
            span.setAttribute("tool.latency_ms", result.latencyMs);
            span.end();
          }
        );
      }
 
      rootSpan.setAttribute("response.length", response.text.length);
      rootSpan.end();
      return response;
    }
  );
}

The key insight is that you're creating parent-child span relationships. When you look at a trace in Jaeger or your tracing backend, you see the full tree: conversation turn at the root, with knowledge retrieval, LLM reasoning, and tool calls as children. If a tool call takes 4 seconds, you see it immediately. If knowledge retrieval returns zero results and the agent hallucinates, you see the empty retrieval and the hallucinated response connected in the same trace.

Pillar 4: Quality

This is the pillar that traditional observability doesn't have, and it's the one that matters most for AI agents. Quality scoring answers the question that metrics, logs, and traces can't: was the agent's output actually correct?

Quality scoring works by evaluating agent responses against defined criteria: accuracy, policy adherence, tone, completeness, and domain-specific rules. If you're deciding between vibes-based assessment and structured scoring, Scorecards vs. Vibes breaks down the tradeoffs. You can run these evaluations on every call or on a statistical sample.

The Chanl SDK provides two approaches. First, evaluate a specific call against a scorecard:

typescript
// Score a specific call against quality criteria
const scorecard = await chanl.calls.getScorecard("call_abc123");
 
console.log("Overall score:", scorecard.overallScore);
for (const criterion of scorecard.criteria) {
  console.log(`  ${criterion.name}: ${criterion.score}/${criterion.maxScore}`);
  if (criterion.score < criterion.threshold) {
    console.log(`  ⚠ Below threshold: ${criterion.feedback}`);
  }
}

Second, run an evaluation programmatically. This is useful for sampling production traffic:

typescript
// Run quality evaluation on a call
const evaluation = await chanl.scorecards.evaluate({
  callId: "call_abc123",
  scorecardId: "sc_support_quality",
});
 
console.log("Evaluation ID:", evaluation.id);
console.log("Score:", evaluation.score);
console.log("Pass:", evaluation.pass);
console.log("Criteria results:", evaluation.results);

And third, pull aggregate quality data to track trends:

typescript
// Get quality trends over time
const qualityResults = await chanl.scorecards.listResults({
  agentId: "agent_support_v3",
  startDate: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString(),
  endDate: new Date().toISOString(),
});
 
console.log("Total evaluations:", qualityResults.total);
console.log("Average score:", qualityResults.averageScore);
console.log("Pass rate:", `${qualityResults.passRate}%`);

Quality scoring is what closes the loop. Without it, you're flying with three instruments in a four-instrument cockpit.

How to build the full observability pipeline

A production pipeline connects the four pillars into a single cron job: collect metrics, sample calls for quality scoring, compare scores against baselines, and fire alerts when multiple signals correlate. Here's how to wire it together.

exceeded normal Cron: Every Hour Collect Metrics Sample Calls Score Quality Detect Drift Threshold? Alert: Slack/PagerDuty Store Baseline Weekly Report
End-to-end observability pipeline: collect, score, detect, alert

Step 1: Metrics collector

The metrics collector runs hourly and pulls aggregate data:

typescript
interface MetricsSnapshot {
  timestamp: string;
  agentId: string;
  period: { start: string; end: string };
  latency: { p50: number; p95: number; p99: number };
  calls: { total: number; successful: number; failed: number };
  tokens: { average: number; total: number; cost: number };
  tools: { totalCalls: number; successRate: number };
}
 
async function collectMetrics(agentId: string): Promise<MetricsSnapshot> {
  const now = new Date();
  const oneHourAgo = new Date(now.getTime() - 60 * 60 * 1000);
 
  const metrics = await chanl.calls.getMetrics({
    agentId,
    startDate: oneHourAgo.toISOString(),
    endDate: now.toISOString(),
  });
 
  return {
    timestamp: now.toISOString(),
    agentId,
    period: {
      start: oneHourAgo.toISOString(),
      end: now.toISOString(),
    },
    latency: {
      p50: metrics.latency.p50,
      p95: metrics.latency.p95,
      p99: metrics.latency.p99,
    },
    calls: {
      total: metrics.totalCalls,
      successful: metrics.successfulCalls,
      failed: metrics.failedCalls,
    },
    tokens: {
      average: metrics.tokenUsage.average,
      total: metrics.tokenUsage.total,
      cost: metrics.cost.total,
    },
    tools: {
      totalCalls: metrics.toolCalls?.total ?? 0,
      successRate: metrics.toolCalls?.successRate ?? 1,
    },
  };
}

Step 2: Quality sampler

For quality scoring, you don't need to evaluate every call. A 5-10% sample gives you statistical significance for trend detection. The sampler selects recent calls and runs them through Chanl's scorecard evaluation:

typescript
interface QualitySample {
  callId: string;
  score: number;
  pass: boolean;
  criteria: Array<{
    name: string;
    score: number;
    maxScore: number;
  }>;
}
 
async function sampleAndScore(
  agentId: string,
  scorecardId: string,
  sampleRate: number = 0.1
): Promise<QualitySample[]> {
  // Pull recent calls
  const recentCalls = await chanl.calls.list({
    agentId,
    startDate: new Date(Date.now() - 60 * 60 * 1000).toISOString(),
    limit: 100,
  });
 
  // Random sample
  const sampled = recentCalls.items.filter(() => Math.random() < sampleRate);
  const results: QualitySample[] = [];
 
  for (const call of sampled) {
    const evaluation = await chanl.scorecards.evaluate({
      callId: call.id,
      scorecardId,
    });
 
    results.push({
      callId: call.id,
      score: evaluation.score,
      pass: evaluation.pass,
      criteria: evaluation.results.map((r) => ({
        name: r.criterionName,
        score: r.score,
        maxScore: r.maxScore,
      })),
    });
  }
 
  return results;
}

Step 3: Drift detector

One bad call is an outlier. A week of declining scores is drift. The drift detector compares current quality scores against your baseline using standard deviation thresholds.

typescript
interface DriftSignal {
  detected: boolean;
  severity: "none" | "warning" | "critical";
  metric: string;
  currentValue: number;
  baselineValue: number;
  deviationPercent: number;
}
 
function detectDrift(
  currentScores: QualitySample[],
  baselineAvg: number,
  baselineStdDev: number
): DriftSignal {
  if (currentScores.length === 0) {
    return {
      detected: false,
      severity: "none",
      metric: "quality_score",
      currentValue: 0,
      baselineValue: baselineAvg,
      deviationPercent: 0,
    };
  }
 
  const currentAvg =
    currentScores.reduce((sum, s) => sum + s.score, 0) / currentScores.length;
  const deviation = baselineAvg - currentAvg;
  const deviationPercent = (deviation / baselineAvg) * 100;
 
  // Warning at 1 std dev below baseline, critical at 2
  let severity: "none" | "warning" | "critical" = "none";
  if (deviation > 2 * baselineStdDev) {
    severity = "critical";
  } else if (deviation > baselineStdDev) {
    severity = "warning";
  }
 
  return {
    detected: severity !== "none",
    severity,
    metric: "quality_score",
    currentValue: currentAvg,
    baselineValue: baselineAvg,
    deviationPercent,
  };
}

Step 4: Alert engine

Here's where the pipeline earns its keep. A single metric anomaly is noise. Multiple correlated anomalies are an incident. The alert engine combines signals from metrics and drift detection to decide what's worth waking someone up for.

typescript
interface AlertDecision {
  shouldAlert: boolean;
  severity: "info" | "warning" | "critical";
  signals: string[];
  message: string;
}
 
function evaluateAlerts(
  metrics: MetricsSnapshot,
  drift: DriftSignal,
  baseline: {
    avgLatencyP95: number;
    avgCost: number;
    avgToolSuccessRate: number;
  }
): AlertDecision {
  const signals: string[] = [];
 
  // Check latency spike
  if (metrics.latency.p95 > baseline.avgLatencyP95 * 2) {
    signals.push(
      `Latency p95 at ${metrics.latency.p95}ms (baseline: ${baseline.avgLatencyP95}ms)`
    );
  }
 
  // Check cost spike (possible reasoning loop)
  if (metrics.tokens.cost > baseline.avgCost * 3) {
    signals.push(
      `Token cost $${metrics.tokens.cost.toFixed(2)} (baseline: $${baseline.avgCost.toFixed(2)})`
    );
  }
 
  // Check tool failure rate
  if (metrics.tools.successRate < baseline.avgToolSuccessRate * 0.8) {
    signals.push(
      `Tool success rate ${(metrics.tools.successRate * 100).toFixed(1)}% ` +
        `(baseline: ${(baseline.avgToolSuccessRate * 100).toFixed(1)}%)`
    );
  }
 
  // Check quality drift
  if (drift.detected) {
    signals.push(
      `Quality drift: ${drift.severity} ` +
        `(${drift.deviationPercent.toFixed(1)}% below baseline)`
    );
  }
 
  // Composite severity
  let severity: "info" | "warning" | "critical" = "info";
  if (signals.length >= 3 || drift.severity === "critical") {
    severity = "critical";
  } else if (signals.length >= 2 || drift.severity === "warning") {
    severity = "warning";
  }
 
  return {
    shouldAlert: signals.length >= 2,
    severity,
    signals,
    message:
      signals.length >= 2
        ? `${signals.length} correlated signals detected for ${metrics.agentId}: ${signals.join("; ")}`
        : "No actionable signals",
  };
}

The key design decision: shouldAlert requires two or more signals. A latency spike alone doesn't page anyone. A latency spike combined with a quality score drop and a cost increase means something is actually wrong.

Step 5: Putting it all together

Here's the full pipeline, designed to run on a cron schedule:

typescript
async function runObservabilityPipeline(config: {
  agentId: string;
  scorecardId: string;
  sampleRate: number;
  baseline: {
    qualityAvg: number;
    qualityStdDev: number;
    avgLatencyP95: number;
    avgCost: number;
    avgToolSuccessRate: number;
  };
}) {
  console.log(`[${new Date().toISOString()}] Running pipeline for ${config.agentId}`);
 
  // Step 1: Collect metrics
  const metrics = await collectMetrics(config.agentId);
  console.log(`  Calls: ${metrics.calls.total}, P95: ${metrics.latency.p95}ms`);
 
  // Step 2: Sample and score quality
  const qualitySamples = await sampleAndScore(
    config.agentId,
    config.scorecardId,
    config.sampleRate
  );
  console.log(`  Scored ${qualitySamples.length} calls`);
 
  // Step 3: Detect drift
  const drift = detectDrift(
    qualitySamples,
    config.baseline.qualityAvg,
    config.baseline.qualityStdDev
  );
  console.log(`  Drift: ${drift.severity}`);
 
  // Step 4: Evaluate alerts
  const alert = evaluateAlerts(metrics, drift, {
    avgLatencyP95: config.baseline.avgLatencyP95,
    avgCost: config.baseline.avgCost,
    avgToolSuccessRate: config.baseline.avgToolSuccessRate,
  });
 
  if (alert.shouldAlert) {
    console.log(`  🚨 ALERT [${alert.severity}]: ${alert.message}`);
    await sendAlert(alert);
  } else {
    console.log("  ✓ No actionable signals");
  }
 
  // Step 5: Store for trend analysis
  await storeSnapshot({
    metrics,
    qualitySamples,
    drift,
    alert,
  });
}

Run it every hour and you have a working observability pipeline. The first two weeks of data become your baseline. After that, drift detection has enough history to catch gradual degradation.

How to investigate a production incident with this pipeline

Start with the composite alert signal to narrow the search. A latency spike plus quality drop points to a slow or failing tool. A cost spike plus quality drop suggests a reasoning loop. The combination tells you where to look first.

Pull a failing call. Use the Chanl SDK to get the full conversation:

typescript
// Get the call detail
const call = await chanl.calls.get("call_failing_123");
 
// Pull the full transcript
const transcript = await chanl.calls.getTranscript("call_failing_123");
 
// Run AI analysis on the call
const analysis = await chanl.calls.analyze("call_failing_123");
 
console.log("AI Analysis:", analysis.summary);
console.log("Issues found:", analysis.issues);
console.log("Suggested fixes:", analysis.suggestions);

The analyze method runs AI-powered analysis on the call, identifying issues like policy violations, hallucinations, or tool misuse. It doesn't replace human review, but it narrows the search space from "something is wrong with 500 daily conversations" to "here are the three specific failure patterns in calls from the last six hours."

Check the scorecard breakdown. The overall quality score tells you the agent is underperforming. The per-criterion breakdown tells you why:

typescript
const scorecard = await chanl.calls.getScorecard("call_failing_123");
 
// Find which criteria failed
const failures = scorecard.criteria.filter(
  (c) => c.score < c.threshold
);
 
for (const failure of failures) {
  console.log(`Failed: ${failure.name}`);
  console.log(`  Score: ${failure.score}/${failure.maxScore}`);
  console.log(`  Feedback: ${failure.feedback}`);
}

Maybe accuracy is fine but policy adherence dropped. That's a stale knowledge base. Maybe accuracy and tone both dropped. That's probably a model change or prompt regression. The per-criterion data tells you where to look, and the Chanl scorecard system handles the evaluation so you're not building an LLM-as-judge from scratch.

How to establish baselines and catch drift over time

Collect two weeks of metrics and quality scores before enabling alerts. That data becomes your baseline. Drift detection compares each new data point against the baseline's mean and standard deviation. A 0.5% daily decline is invisible in real time but adds up to 15% over a month.

Building your baseline

Run the metrics collector and quality sampler for two weeks without alerting. Store every data point. After two weeks, compute:

typescript
interface Baseline {
  qualityAvg: number;
  qualityStdDev: number;
  avgLatencyP95: number;
  avgCost: number;
  avgToolSuccessRate: number;
  sampleSize: number;
  periodDays: number;
}
 
function computeBaseline(snapshots: MetricsSnapshot[], qualityScores: number[]): Baseline {
  const latencies = snapshots.map((s) => s.latency.p95);
  const costs = snapshots.map((s) => s.tokens.cost);
  const toolRates = snapshots.map((s) => s.tools.successRate);
 
  const avg = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;
  const stdDev = (arr: number[]) => {
    const mean = avg(arr);
    return Math.sqrt(arr.reduce((sum, v) => sum + (v - mean) ** 2, 0) / arr.length);
  };
 
  return {
    qualityAvg: avg(qualityScores),
    qualityStdDev: stdDev(qualityScores),
    avgLatencyP95: avg(latencies),
    avgCost: avg(costs),
    avgToolSuccessRate: avg(toolRates),
    sampleSize: qualityScores.length,
    periodDays: 14,
  };
}

Use Chanl's aggregate scorecard results to track quality over time without running individual evaluations on historical data:

typescript
async function getQualityTrend(agentId: string, weeks: number = 4) {
  const trends: Array<{ week: number; avgScore: number; passRate: number }> = [];
 
  for (let w = 0; w < weeks; w++) {
    const end = new Date(Date.now() - w * 7 * 24 * 60 * 60 * 1000);
    const start = new Date(end.getTime() - 7 * 24 * 60 * 60 * 1000);
 
    const results = await chanl.scorecards.listResults({
      agentId,
      startDate: start.toISOString(),
      endDate: end.toISOString(),
    });
 
    trends.push({
      week: w,
      avgScore: results.averageScore,
      passRate: results.passRate,
    });
  }
 
  // Check for downward trend
  const scores = trends.map((t) => t.avgScore).reverse();
  let declining = true;
  for (let i = 1; i < scores.length; i++) {
    if (scores[i] >= scores[i - 1]) {
      declining = false;
      break;
    }
  }
 
  if (declining && scores.length > 2) {
    const totalDrop = scores[0] - scores[scores.length - 1];
    console.log(
      `⚠ Consistent quality decline: ${totalDrop.toFixed(2)} points over ${weeks} weeks`
    );
  }
 
  return trends;
}

A consistent decline of 0.3 points or more over two weeks is a clear signal. At that point, pull the per-criterion breakdown from your scorecard results to identify which specific quality dimensions are degrading. Is it accuracy? Policy adherence? Tone? Each points to a different root cause.

What does the cost of observability look like?

A single LLM call creates 8-15 spans compared to 2-3 for a typical API endpoint. Commercial observability platforms charging $0.10-$0.30 per GB of ingested data can produce observability bills that exceed your actual LLM costs. The fix is aggregation and sampling.

Here's a realistic cost breakdown for an agent handling 500 calls per day:

ComponentVolume/DayApproachMonthly Cost
Metrics500 calls x 24 metricsAggregate before store~$5
Logs500 transcripts x ~5KB eachStore full, query on demand~$15
Traces500 calls x ~12 spans eachSample 10%, full trace~$10
Quality evals50 calls x scorecard10% sample rate~$25 (LLM eval cost)
StorageAll above, 90-day retentionTime-series DB~$20
Total~$75/month

Compare that to the cost of not monitoring: a three-day undetected hallucination affecting 1,500 conversations. Or a $15,000 monthly bill spike from a tool retry loop. $75 per month for a pipeline that catches these problems within an hour is a reasonable trade.

The trick is aggregation and sampling. Don't store every span for every call. Aggregate metrics hourly, sample 10% for full tracing, and run quality evaluations on the sample. You get statistical significance without the data volume of full instrumentation.

What should your observability dashboard look like?

A good dashboard answers three questions at a glance: Is the agent healthy right now? Has quality changed over time? Where should I investigate?

Organize it in three rows, each serving a different time horizon:

RowContentTime horizon
Top: Health indicatorsFour cards: p95 latency, call volume, tool success rate, quality score. Sparkline of last 24 hours. Color-coded against baseline (green/yellow/red).Right now
Middle: TrendsFull-width chart showing quality score, latency, and cost over 30 days. Drift that looks flat hour-to-hour becomes obvious at this scale.Last month
Bottom: InvestigationRecent alerts, lowest-scoring calls, drift-flagged calls. Each links to chanl.calls.get(callId) for transcript and scorecard.Action queue

The Chanl analytics dashboard provides this layout out of the box, with real-time metrics, quality trends, and drill-down into individual conversations. If you're building your own dashboard, the SDK methods we've covered (getMetrics, getScorecard, listResults, getTranscript, analyze) provide all the data you need. Chanl's monitoring features handle the alerting layer, so you can focus on the analysis rather than the plumbing.

The closed loop: observability that improves the agent

Observability is not a passive activity. The whole point of collecting metrics, scoring quality, and detecting drift is to feed improvements back into the agent. Here's how the loop closes:

  1. Pipeline detects drift in the "policy adherence" quality criterion
  2. Investigation reveals the knowledge base contains outdated return policy documents
  3. Fix: update the knowledge base, run scenario tests against the updated content
  4. Verification: quality scores recover to baseline within 24 hours
  5. Baseline update: the pipeline incorporates the new, higher-quality data into its rolling baseline

This is the data flywheel. Every observation becomes an input to improvement, the same loop described in turning conversation data into agent improvements. The agent gets better not because you're guessing what's wrong, but because you have data showing exactly what degraded and by how much.

The same loop applies to prompt changes. Before deploying a prompt update, run your scorecard evaluations against a set of production calls using the new prompt. Compare scores to your baseline. If quality improves, ship it. If it degrades on any criterion, investigate before deploying. The pipeline gives you the data to make that decision with confidence.

Wrapping up

80% of Fortune 500 companies are running active AI agents. Most of them are monitoring uptime when they should be monitoring behavior. That's the same blind spot as the retired return policy from the opening: HTTP 200, zero alerts, completely wrong answers.

The pipeline we built covers the full loop: collect metrics with chanl.calls.getMetrics(), score quality with chanl.scorecards.evaluate(), detect drift by comparing against baselines, and alert on composite signals. It runs on a cron, costs about $75 per month for a 500-call-per-day agent, and catches the failures that traditional APM misses.

Start with the metrics collector. Get two weeks of baseline data. Then add quality scoring and drift detection. Every team that instruments quality scoring discovers problems they didn't know existed. The question isn't whether your agent has issues. It's whether you're seeing them before your customers do.

To build the evaluation framework that powers quality scoring, see How to Evaluate AI Agents: Build an Eval Framework from Scratch. For testing agents before they hit production, Scenario Testing: The QA Strategy That Catches What Unit Tests Miss covers the pre-deploy side of the loop.

See your agent's blind spots

Chanl's observability pipeline gives you metrics, quality scoring, and drift detection for every AI agent conversation. Start monitoring what traditional APM misses.

Try Chanl free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions