Your AI agent passed every test in staging. The demo went well. Leadership signed off. You pushed it to production, and for two weeks it handled 500 conversations a day without a single alert firing.
Then a customer tweets that your agent confidently quoted a return policy you retired three months ago. You check the logs. No errors. No exceptions. HTTP 200 on every request. Latency under a second. By every traditional metric, the system was healthy the entire time. It was just wrong, confidently and repeatedly, for who knows how long.
Microsoft's security team named this the observability gap in March 2026: "Traditional monitoring, built around uptime, latency, and error rates, can miss the root cause and provide limited signal for attribution or reconstruction in AI-related scenarios." Your Datadog dashboard says green. Your customers say broken.
This article builds a working observability pipeline in TypeScript using the Chanl SDK. Not a monitoring checklist. Actual code that pulls real metrics, scores quality, detects drift, and triggers alerts. By the end, you'll have a system that answers the question traditional monitoring can't: is the agent actually behaving correctly?
| What you'll build | Why it matters |
|---|---|
| Metrics collector | Pull latency, token cost, and call volume from production conversations |
| Quality scorer | Automatically evaluate agent output against defined criteria |
| Drift detector | Catch gradual degradation before customers notice |
| Alert pipeline | Composite signals that fire on real incidents, not noise |
| Full dashboard loop | Pull metrics, score, detect, alert, all in one pipeline |
Prerequisites and setup
You'll need Node.js 20+, TypeScript 5+, and a Chanl account with API access. All code examples are TypeScript, and we'll use the Chanl SDK for data access throughout.
Install the SDK:
npm install @chanl/sdkSet your environment:
export CHANL_API_KEY=your-api-key
export CHANL_BASE_URL=https://platform.chanl.aiInitialize the client:
import Chanl from "@chanl/sdk";
const chanl = new Chanl({
apiKey: process.env.CHANL_API_KEY,
baseUrl: process.env.CHANL_BASE_URL,
});If you're new to agent observability concepts, AI Agent Observability: What to Monitor When Your Agent Goes Live covers the foundational thinking. This article assumes you understand why observability matters and focuses on building the pipeline.
Why traditional APM fails for AI agents
Traditional Application Performance Monitoring was built for deterministic systems. A REST API either returns the right data or throws an error. A database query either succeeds or times out. The same input produces the same output, and a failure looks like a failure.
AI agents broke that contract. The same question asked twice produces two different answers, both potentially correct, both phrased differently. A "successful" HTTP 200 response might contain completely hallucinated data. An agent might be technically "up" while stuck in an infinite reasoning loop burning $50 per minute in API costs. OneUptime documented a case where an agent hallucinated product features that don't exist, and nobody noticed for three days because the response format looked correct.
| Dimension | Traditional APM | AI Agent Observability |
|---|---|---|
| Core question | Is it running? | Is it behaving correctly? |
| Failure mode | Errors, timeouts, crashes | Confident wrong answers, policy violations, hallucinations |
| Input/output | Deterministic (same in, same out) | Non-deterministic (same in, different out) |
| Success signal | HTTP 200, low latency | Correct, complete, policy-adherent response |
| What breaks | Infrastructure | Reasoning, context, tool selection |
| Trace complexity | 2-3 spans per request | 8-15 spans per request (LLM + tools + memory) |
| Cost model | Fixed compute | Variable per-token, per-call |
| Drift risk | Low (code doesn't change itself) | High (model behavior shifts, knowledge goes stale) |
| Quality measurement | Binary pass/fail | Multi-criteria scored evaluation |
The 2026 Elastic observability report found that 85% of organizations now use some form of GenAI for observability, but the majority are applying GenAI to existing monitoring rather than building observability for their AI systems. That's the gap this article fills.
What are the four pillars of AI agent observability?
Agent observability has four pillars: metrics, logs, traces, and quality. The first three are borrowed from traditional observability. The fourth, quality, is what makes agent monitoring different. Without quality scoring, you're monitoring a system whose primary failure mode is invisible to the other three pillars.
Pillar 1: Metrics
Metrics are the numbers that tell you what's happening at a glance. For AI agents, the critical metrics are different from traditional services.
Latency isn't just response time. It's end-to-end conversation turn latency, which includes LLM inference, tool calls, memory retrieval, and response formatting. Track p50, p95, and p99 separately. An agent might respond in 800ms for 90% of queries but take 12 seconds for complex ones requiring multiple tool calls. The average looks fine. The tail doesn't.
Token cost is the new compute bill. Unlike traditional services where compute is relatively fixed, every agent conversation has a variable cost based on prompt length, conversation history, and the number of reasoning steps. A cost spike might signal a reasoning loop or bloated context window.
Call volume and success rate tell you throughput and reliability. But for agents, "success" needs redefinition. An API call that returns 200 but gives a wrong answer isn't a success. You need to pair success rate with quality scores, which we'll build in Pillar 4.
Here's how to pull these metrics from production using the Chanl SDK:
import Chanl from "@chanl/sdk";
const chanl = new Chanl({
apiKey: process.env.CHANL_API_KEY,
baseUrl: process.env.CHANL_BASE_URL,
});
// Pull aggregate metrics for the last 24 hours
const metrics = await chanl.calls.getMetrics({
agentId: "agent_support_v3",
startDate: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
endDate: new Date().toISOString(),
});
console.log("Latency p50:", metrics.latency.p50, "ms");
console.log("Latency p95:", metrics.latency.p95, "ms");
console.log("Total calls:", metrics.totalCalls);
console.log("Avg tokens/call:", metrics.tokenUsage.average);
console.log("Total cost:", `$${metrics.cost.total.toFixed(2)}`);The getMetrics method returns pre-aggregated data. This is important. You're not fetching 500 calls and computing averages client-side. The backend does the aggregation, which means this call is fast even when you're pulling metrics across millions of conversations.
For more granular analysis, pull individual call data:
// Get a specific call's full detail
const call = await chanl.calls.get("call_abc123");
console.log("Duration:", call.duration, "seconds");
console.log("Turns:", call.turns);
console.log("Tool calls:", call.toolCalls?.length ?? 0);
console.log("Total tokens:", call.tokenUsage?.total);
console.log("Status:", call.status);
Pillar 2: Logs
Structured logging for AI agents means capturing every decision point in a format you can query later. The critical insight from OneUptime's analysis is that the highest-value approach is instrumenting decision points, not just request/response boundaries.
For each conversation turn, you want to capture:
interface AgentLogEntry {
timestamp: string;
conversationId: string;
turnIndex: number;
phase: "reasoning" | "tool_call" | "tool_result" | "response";
input: string;
output: string;
toolName?: string;
toolSuccess?: boolean;
latencyMs: number;
tokenCount: number;
metadata: Record<string, unknown>;
}With the Chanl SDK, you can pull the full transcript for any call, which includes tool calls, agent reasoning, and customer messages in order:
// Pull full conversation transcript
const transcript = await chanl.calls.getTranscript("call_abc123");
for (const entry of transcript.entries) {
console.log(`[${entry.role}] ${entry.content}`);
if (entry.toolCall) {
console.log(` Tool: ${entry.toolCall.name}`);
console.log(` Args: ${JSON.stringify(entry.toolCall.arguments)}`);
console.log(` Result: ${entry.toolCall.result?.status}`);
}
}The transcript gives you the raw material for debugging. When a customer reports a problem, you don't grep through CloudWatch logs hoping to find the right request. You pull the transcript, see every step the agent took, and identify exactly where the reasoning went wrong.
Pillar 3: Traces
Traces connect the dots between individual log entries. A single conversation turn might involve an LLM reasoning step, a knowledge base query, two tool calls, and a final response generation. Traces show you the causal chain: which step caused which, what ran in parallel, and where time was spent.
OpenTelemetry's experimental semantic conventions for generative AI now cover model calls, tool executions, and agent planning steps. According to the 2026 Elastic report, OTel production adoption tripled from 2025 to 2026, and 89% of production users consider OTel compliance at least very important.
Here's a basic trace structure for an agent conversation turn:
import { trace, SpanKind } from "@opentelemetry/api";
const tracer = trace.getTracer("agent-observability");
async function traceConversationTurn(
conversationId: string,
userMessage: string
) {
return tracer.startActiveSpan(
"agent.conversation_turn",
{ kind: SpanKind.SERVER },
async (rootSpan) => {
rootSpan.setAttribute("conversation.id", conversationId);
rootSpan.setAttribute("user.message.length", userMessage.length);
// Trace knowledge retrieval
const context = await tracer.startActiveSpan(
"agent.knowledge_retrieval",
async (span) => {
const results = await retrieveContext(userMessage);
span.setAttribute("retrieval.results_count", results.length);
span.setAttribute("retrieval.top_score", results[0]?.score ?? 0);
span.end();
return results;
}
);
// Trace LLM reasoning
const response = await tracer.startActiveSpan(
"agent.llm_reasoning",
async (span) => {
const result = await generateResponse(userMessage, context);
span.setAttribute("llm.model", result.model);
span.setAttribute("llm.tokens.prompt", result.promptTokens);
span.setAttribute("llm.tokens.completion", result.completionTokens);
span.setAttribute("llm.tool_calls", result.toolCalls?.length ?? 0);
span.end();
return result;
}
);
// Trace each tool call
for (const toolCall of response.toolCalls ?? []) {
await tracer.startActiveSpan(
`agent.tool_call.${toolCall.name}`,
async (span) => {
span.setAttribute("tool.name", toolCall.name);
span.setAttribute(
"tool.arguments",
JSON.stringify(toolCall.arguments)
);
const result = await executeToolCall(toolCall);
span.setAttribute("tool.success", result.success);
span.setAttribute("tool.latency_ms", result.latencyMs);
span.end();
}
);
}
rootSpan.setAttribute("response.length", response.text.length);
rootSpan.end();
return response;
}
);
}The key insight is that you're creating parent-child span relationships. When you look at a trace in Jaeger or your tracing backend, you see the full tree: conversation turn at the root, with knowledge retrieval, LLM reasoning, and tool calls as children. If a tool call takes 4 seconds, you see it immediately. If knowledge retrieval returns zero results and the agent hallucinates, you see the empty retrieval and the hallucinated response connected in the same trace.
Pillar 4: Quality
This is the pillar that traditional observability doesn't have, and it's the one that matters most for AI agents. Quality scoring answers the question that metrics, logs, and traces can't: was the agent's output actually correct?
Quality scoring works by evaluating agent responses against defined criteria: accuracy, policy adherence, tone, completeness, and domain-specific rules. If you're deciding between vibes-based assessment and structured scoring, Scorecards vs. Vibes breaks down the tradeoffs. You can run these evaluations on every call or on a statistical sample.
The Chanl SDK provides two approaches. First, evaluate a specific call against a scorecard:
// Score a specific call against quality criteria
const scorecard = await chanl.calls.getScorecard("call_abc123");
console.log("Overall score:", scorecard.overallScore);
for (const criterion of scorecard.criteria) {
console.log(` ${criterion.name}: ${criterion.score}/${criterion.maxScore}`);
if (criterion.score < criterion.threshold) {
console.log(` ⚠ Below threshold: ${criterion.feedback}`);
}
}Second, run an evaluation programmatically. This is useful for sampling production traffic:
// Run quality evaluation on a call
const evaluation = await chanl.scorecards.evaluate({
callId: "call_abc123",
scorecardId: "sc_support_quality",
});
console.log("Evaluation ID:", evaluation.id);
console.log("Score:", evaluation.score);
console.log("Pass:", evaluation.pass);
console.log("Criteria results:", evaluation.results);And third, pull aggregate quality data to track trends:
// Get quality trends over time
const qualityResults = await chanl.scorecards.listResults({
agentId: "agent_support_v3",
startDate: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString(),
endDate: new Date().toISOString(),
});
console.log("Total evaluations:", qualityResults.total);
console.log("Average score:", qualityResults.averageScore);
console.log("Pass rate:", `${qualityResults.passRate}%`);Quality scoring is what closes the loop. Without it, you're flying with three instruments in a four-instrument cockpit.
How to build the full observability pipeline
A production pipeline connects the four pillars into a single cron job: collect metrics, sample calls for quality scoring, compare scores against baselines, and fire alerts when multiple signals correlate. Here's how to wire it together.
Step 1: Metrics collector
The metrics collector runs hourly and pulls aggregate data:
interface MetricsSnapshot {
timestamp: string;
agentId: string;
period: { start: string; end: string };
latency: { p50: number; p95: number; p99: number };
calls: { total: number; successful: number; failed: number };
tokens: { average: number; total: number; cost: number };
tools: { totalCalls: number; successRate: number };
}
async function collectMetrics(agentId: string): Promise<MetricsSnapshot> {
const now = new Date();
const oneHourAgo = new Date(now.getTime() - 60 * 60 * 1000);
const metrics = await chanl.calls.getMetrics({
agentId,
startDate: oneHourAgo.toISOString(),
endDate: now.toISOString(),
});
return {
timestamp: now.toISOString(),
agentId,
period: {
start: oneHourAgo.toISOString(),
end: now.toISOString(),
},
latency: {
p50: metrics.latency.p50,
p95: metrics.latency.p95,
p99: metrics.latency.p99,
},
calls: {
total: metrics.totalCalls,
successful: metrics.successfulCalls,
failed: metrics.failedCalls,
},
tokens: {
average: metrics.tokenUsage.average,
total: metrics.tokenUsage.total,
cost: metrics.cost.total,
},
tools: {
totalCalls: metrics.toolCalls?.total ?? 0,
successRate: metrics.toolCalls?.successRate ?? 1,
},
};
}Step 2: Quality sampler
For quality scoring, you don't need to evaluate every call. A 5-10% sample gives you statistical significance for trend detection. The sampler selects recent calls and runs them through Chanl's scorecard evaluation:
interface QualitySample {
callId: string;
score: number;
pass: boolean;
criteria: Array<{
name: string;
score: number;
maxScore: number;
}>;
}
async function sampleAndScore(
agentId: string,
scorecardId: string,
sampleRate: number = 0.1
): Promise<QualitySample[]> {
// Pull recent calls
const recentCalls = await chanl.calls.list({
agentId,
startDate: new Date(Date.now() - 60 * 60 * 1000).toISOString(),
limit: 100,
});
// Random sample
const sampled = recentCalls.items.filter(() => Math.random() < sampleRate);
const results: QualitySample[] = [];
for (const call of sampled) {
const evaluation = await chanl.scorecards.evaluate({
callId: call.id,
scorecardId,
});
results.push({
callId: call.id,
score: evaluation.score,
pass: evaluation.pass,
criteria: evaluation.results.map((r) => ({
name: r.criterionName,
score: r.score,
maxScore: r.maxScore,
})),
});
}
return results;
}Step 3: Drift detector
One bad call is an outlier. A week of declining scores is drift. The drift detector compares current quality scores against your baseline using standard deviation thresholds.
interface DriftSignal {
detected: boolean;
severity: "none" | "warning" | "critical";
metric: string;
currentValue: number;
baselineValue: number;
deviationPercent: number;
}
function detectDrift(
currentScores: QualitySample[],
baselineAvg: number,
baselineStdDev: number
): DriftSignal {
if (currentScores.length === 0) {
return {
detected: false,
severity: "none",
metric: "quality_score",
currentValue: 0,
baselineValue: baselineAvg,
deviationPercent: 0,
};
}
const currentAvg =
currentScores.reduce((sum, s) => sum + s.score, 0) / currentScores.length;
const deviation = baselineAvg - currentAvg;
const deviationPercent = (deviation / baselineAvg) * 100;
// Warning at 1 std dev below baseline, critical at 2
let severity: "none" | "warning" | "critical" = "none";
if (deviation > 2 * baselineStdDev) {
severity = "critical";
} else if (deviation > baselineStdDev) {
severity = "warning";
}
return {
detected: severity !== "none",
severity,
metric: "quality_score",
currentValue: currentAvg,
baselineValue: baselineAvg,
deviationPercent,
};
}Step 4: Alert engine
Here's where the pipeline earns its keep. A single metric anomaly is noise. Multiple correlated anomalies are an incident. The alert engine combines signals from metrics and drift detection to decide what's worth waking someone up for.
interface AlertDecision {
shouldAlert: boolean;
severity: "info" | "warning" | "critical";
signals: string[];
message: string;
}
function evaluateAlerts(
metrics: MetricsSnapshot,
drift: DriftSignal,
baseline: {
avgLatencyP95: number;
avgCost: number;
avgToolSuccessRate: number;
}
): AlertDecision {
const signals: string[] = [];
// Check latency spike
if (metrics.latency.p95 > baseline.avgLatencyP95 * 2) {
signals.push(
`Latency p95 at ${metrics.latency.p95}ms (baseline: ${baseline.avgLatencyP95}ms)`
);
}
// Check cost spike (possible reasoning loop)
if (metrics.tokens.cost > baseline.avgCost * 3) {
signals.push(
`Token cost $${metrics.tokens.cost.toFixed(2)} (baseline: $${baseline.avgCost.toFixed(2)})`
);
}
// Check tool failure rate
if (metrics.tools.successRate < baseline.avgToolSuccessRate * 0.8) {
signals.push(
`Tool success rate ${(metrics.tools.successRate * 100).toFixed(1)}% ` +
`(baseline: ${(baseline.avgToolSuccessRate * 100).toFixed(1)}%)`
);
}
// Check quality drift
if (drift.detected) {
signals.push(
`Quality drift: ${drift.severity} ` +
`(${drift.deviationPercent.toFixed(1)}% below baseline)`
);
}
// Composite severity
let severity: "info" | "warning" | "critical" = "info";
if (signals.length >= 3 || drift.severity === "critical") {
severity = "critical";
} else if (signals.length >= 2 || drift.severity === "warning") {
severity = "warning";
}
return {
shouldAlert: signals.length >= 2,
severity,
signals,
message:
signals.length >= 2
? `${signals.length} correlated signals detected for ${metrics.agentId}: ${signals.join("; ")}`
: "No actionable signals",
};
}The key design decision: shouldAlert requires two or more signals. A latency spike alone doesn't page anyone. A latency spike combined with a quality score drop and a cost increase means something is actually wrong.
Step 5: Putting it all together
Here's the full pipeline, designed to run on a cron schedule:
async function runObservabilityPipeline(config: {
agentId: string;
scorecardId: string;
sampleRate: number;
baseline: {
qualityAvg: number;
qualityStdDev: number;
avgLatencyP95: number;
avgCost: number;
avgToolSuccessRate: number;
};
}) {
console.log(`[${new Date().toISOString()}] Running pipeline for ${config.agentId}`);
// Step 1: Collect metrics
const metrics = await collectMetrics(config.agentId);
console.log(` Calls: ${metrics.calls.total}, P95: ${metrics.latency.p95}ms`);
// Step 2: Sample and score quality
const qualitySamples = await sampleAndScore(
config.agentId,
config.scorecardId,
config.sampleRate
);
console.log(` Scored ${qualitySamples.length} calls`);
// Step 3: Detect drift
const drift = detectDrift(
qualitySamples,
config.baseline.qualityAvg,
config.baseline.qualityStdDev
);
console.log(` Drift: ${drift.severity}`);
// Step 4: Evaluate alerts
const alert = evaluateAlerts(metrics, drift, {
avgLatencyP95: config.baseline.avgLatencyP95,
avgCost: config.baseline.avgCost,
avgToolSuccessRate: config.baseline.avgToolSuccessRate,
});
if (alert.shouldAlert) {
console.log(` 🚨 ALERT [${alert.severity}]: ${alert.message}`);
await sendAlert(alert);
} else {
console.log(" ✓ No actionable signals");
}
// Step 5: Store for trend analysis
await storeSnapshot({
metrics,
qualitySamples,
drift,
alert,
});
}Run it every hour and you have a working observability pipeline. The first two weeks of data become your baseline. After that, drift detection has enough history to catch gradual degradation.
How to investigate a production incident with this pipeline
Start with the composite alert signal to narrow the search. A latency spike plus quality drop points to a slow or failing tool. A cost spike plus quality drop suggests a reasoning loop. The combination tells you where to look first.
Pull a failing call. Use the Chanl SDK to get the full conversation:
// Get the call detail
const call = await chanl.calls.get("call_failing_123");
// Pull the full transcript
const transcript = await chanl.calls.getTranscript("call_failing_123");
// Run AI analysis on the call
const analysis = await chanl.calls.analyze("call_failing_123");
console.log("AI Analysis:", analysis.summary);
console.log("Issues found:", analysis.issues);
console.log("Suggested fixes:", analysis.suggestions);The analyze method runs AI-powered analysis on the call, identifying issues like policy violations, hallucinations, or tool misuse. It doesn't replace human review, but it narrows the search space from "something is wrong with 500 daily conversations" to "here are the three specific failure patterns in calls from the last six hours."
Check the scorecard breakdown. The overall quality score tells you the agent is underperforming. The per-criterion breakdown tells you why:
const scorecard = await chanl.calls.getScorecard("call_failing_123");
// Find which criteria failed
const failures = scorecard.criteria.filter(
(c) => c.score < c.threshold
);
for (const failure of failures) {
console.log(`Failed: ${failure.name}`);
console.log(` Score: ${failure.score}/${failure.maxScore}`);
console.log(` Feedback: ${failure.feedback}`);
}Maybe accuracy is fine but policy adherence dropped. That's a stale knowledge base. Maybe accuracy and tone both dropped. That's probably a model change or prompt regression. The per-criterion data tells you where to look, and the Chanl scorecard system handles the evaluation so you're not building an LLM-as-judge from scratch.
How to establish baselines and catch drift over time
Collect two weeks of metrics and quality scores before enabling alerts. That data becomes your baseline. Drift detection compares each new data point against the baseline's mean and standard deviation. A 0.5% daily decline is invisible in real time but adds up to 15% over a month.
Building your baseline
Run the metrics collector and quality sampler for two weeks without alerting. Store every data point. After two weeks, compute:
interface Baseline {
qualityAvg: number;
qualityStdDev: number;
avgLatencyP95: number;
avgCost: number;
avgToolSuccessRate: number;
sampleSize: number;
periodDays: number;
}
function computeBaseline(snapshots: MetricsSnapshot[], qualityScores: number[]): Baseline {
const latencies = snapshots.map((s) => s.latency.p95);
const costs = snapshots.map((s) => s.tokens.cost);
const toolRates = snapshots.map((s) => s.tools.successRate);
const avg = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;
const stdDev = (arr: number[]) => {
const mean = avg(arr);
return Math.sqrt(arr.reduce((sum, v) => sum + (v - mean) ** 2, 0) / arr.length);
};
return {
qualityAvg: avg(qualityScores),
qualityStdDev: stdDev(qualityScores),
avgLatencyP95: avg(latencies),
avgCost: avg(costs),
avgToolSuccessRate: avg(toolRates),
sampleSize: qualityScores.length,
periodDays: 14,
};
}Tracking quality trends
Use Chanl's aggregate scorecard results to track quality over time without running individual evaluations on historical data:
async function getQualityTrend(agentId: string, weeks: number = 4) {
const trends: Array<{ week: number; avgScore: number; passRate: number }> = [];
for (let w = 0; w < weeks; w++) {
const end = new Date(Date.now() - w * 7 * 24 * 60 * 60 * 1000);
const start = new Date(end.getTime() - 7 * 24 * 60 * 60 * 1000);
const results = await chanl.scorecards.listResults({
agentId,
startDate: start.toISOString(),
endDate: end.toISOString(),
});
trends.push({
week: w,
avgScore: results.averageScore,
passRate: results.passRate,
});
}
// Check for downward trend
const scores = trends.map((t) => t.avgScore).reverse();
let declining = true;
for (let i = 1; i < scores.length; i++) {
if (scores[i] >= scores[i - 1]) {
declining = false;
break;
}
}
if (declining && scores.length > 2) {
const totalDrop = scores[0] - scores[scores.length - 1];
console.log(
`⚠ Consistent quality decline: ${totalDrop.toFixed(2)} points over ${weeks} weeks`
);
}
return trends;
}A consistent decline of 0.3 points or more over two weeks is a clear signal. At that point, pull the per-criterion breakdown from your scorecard results to identify which specific quality dimensions are degrading. Is it accuracy? Policy adherence? Tone? Each points to a different root cause.
What does the cost of observability look like?
A single LLM call creates 8-15 spans compared to 2-3 for a typical API endpoint. Commercial observability platforms charging $0.10-$0.30 per GB of ingested data can produce observability bills that exceed your actual LLM costs. The fix is aggregation and sampling.
Here's a realistic cost breakdown for an agent handling 500 calls per day:
| Component | Volume/Day | Approach | Monthly Cost |
|---|---|---|---|
| Metrics | 500 calls x 24 metrics | Aggregate before store | ~$5 |
| Logs | 500 transcripts x ~5KB each | Store full, query on demand | ~$15 |
| Traces | 500 calls x ~12 spans each | Sample 10%, full trace | ~$10 |
| Quality evals | 50 calls x scorecard | 10% sample rate | ~$25 (LLM eval cost) |
| Storage | All above, 90-day retention | Time-series DB | ~$20 |
| Total | ~$75/month |
Compare that to the cost of not monitoring: a three-day undetected hallucination affecting 1,500 conversations. Or a $15,000 monthly bill spike from a tool retry loop. $75 per month for a pipeline that catches these problems within an hour is a reasonable trade.
The trick is aggregation and sampling. Don't store every span for every call. Aggregate metrics hourly, sample 10% for full tracing, and run quality evaluations on the sample. You get statistical significance without the data volume of full instrumentation.
What should your observability dashboard look like?
A good dashboard answers three questions at a glance: Is the agent healthy right now? Has quality changed over time? Where should I investigate?
Organize it in three rows, each serving a different time horizon:
| Row | Content | Time horizon |
|---|---|---|
| Top: Health indicators | Four cards: p95 latency, call volume, tool success rate, quality score. Sparkline of last 24 hours. Color-coded against baseline (green/yellow/red). | Right now |
| Middle: Trends | Full-width chart showing quality score, latency, and cost over 30 days. Drift that looks flat hour-to-hour becomes obvious at this scale. | Last month |
| Bottom: Investigation | Recent alerts, lowest-scoring calls, drift-flagged calls. Each links to chanl.calls.get(callId) for transcript and scorecard. | Action queue |
The Chanl analytics dashboard provides this layout out of the box, with real-time metrics, quality trends, and drill-down into individual conversations. If you're building your own dashboard, the SDK methods we've covered (getMetrics, getScorecard, listResults, getTranscript, analyze) provide all the data you need. Chanl's monitoring features handle the alerting layer, so you can focus on the analysis rather than the plumbing.
The closed loop: observability that improves the agent
Observability is not a passive activity. The whole point of collecting metrics, scoring quality, and detecting drift is to feed improvements back into the agent. Here's how the loop closes:
- Pipeline detects drift in the "policy adherence" quality criterion
- Investigation reveals the knowledge base contains outdated return policy documents
- Fix: update the knowledge base, run scenario tests against the updated content
- Verification: quality scores recover to baseline within 24 hours
- Baseline update: the pipeline incorporates the new, higher-quality data into its rolling baseline
This is the data flywheel. Every observation becomes an input to improvement, the same loop described in turning conversation data into agent improvements. The agent gets better not because you're guessing what's wrong, but because you have data showing exactly what degraded and by how much.
The same loop applies to prompt changes. Before deploying a prompt update, run your scorecard evaluations against a set of production calls using the new prompt. Compare scores to your baseline. If quality improves, ship it. If it degrades on any criterion, investigate before deploying. The pipeline gives you the data to make that decision with confidence.
Wrapping up
80% of Fortune 500 companies are running active AI agents. Most of them are monitoring uptime when they should be monitoring behavior. That's the same blind spot as the retired return policy from the opening: HTTP 200, zero alerts, completely wrong answers.
The pipeline we built covers the full loop: collect metrics with chanl.calls.getMetrics(), score quality with chanl.scorecards.evaluate(), detect drift by comparing against baselines, and alert on composite signals. It runs on a cron, costs about $75 per month for a 500-call-per-day agent, and catches the failures that traditional APM misses.
Start with the metrics collector. Get two weeks of baseline data. Then add quality scoring and drift detection. Every team that instruments quality scoring discovers problems they didn't know existed. The question isn't whether your agent has issues. It's whether you're seeing them before your customers do.
To build the evaluation framework that powers quality scoring, see How to Evaluate AI Agents: Build an Eval Framework from Scratch. For testing agents before they hit production, Scenario Testing: The QA Strategy That Catches What Unit Tests Miss covers the pre-deploy side of the loop.
See your agent's blind spots
Chanl's observability pipeline gives you metrics, quality scoring, and drift detection for every AI agent conversation. Start monitoring what traditional APM misses.
Try Chanl freeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



