Last year, a team I know shipped a customer support agent to production after three days of manual testing. They'd asked it maybe forty questions, liked the answers, and called it ready. Within a week, the agent was confidently quoting a refund policy that had been deprecated six months earlier. It told three customers they were eligible for full refunds on items that were, in fact, final sale.
The agent didn't hallucinate. It didn't crash. It just gave plausible-sounding wrong answers — the hardest failure mode to catch without structured evaluation. The fix took twenty minutes. Finding the problem took five days and an angry email from the VP of Customer Success.
This is the vibes-based testing trap. You run a few prompts, the outputs look reasonable, and you ship. It works until it doesn't, and when it doesn't, the failures are always subtle and always expensive.
This guide is about building something better: a real evaluation framework you can run before every deploy, catch regressions automatically, and actually trust. We'll build working eval harnesses in both TypeScript and Python, cover the major evaluation strategies, and set you up with patterns that scale from side project to production.
Why You Need Evals (Not Just Tests)
Traditional software testing is binary. A function returns the right value or it doesn't. The API responds with a 200 or it doesn't. You can assert on exact outputs.
AI agents don't work that way. Ask the same agent the same question twice and you'll get two different answers — both potentially correct, both phrased differently. "Did the agent do a good job?" isn't a boolean question. It's a spectrum across multiple dimensions: accuracy, tone, completeness, policy adherence, helpfulness.
This means you need evaluations, not just tests. Evals score agent behavior on a rubric. They measure quality on a scale. They let you compare version A of your prompt against version B and know — with numbers, not feelings — which one is better.
Without evals, you'll hit these problems eventually:
Prompt regressions go undetected. You tweak the system prompt to improve handling of billing questions. It works great for billing. But now the agent is 15% worse at answering shipping questions, and you won't notice for weeks.
Model upgrades break things silently. You switch from GPT-4o to a newer model. Overall quality looks fine. But the new model is worse at following your specific formatting instructions, and your downstream systems that parse agent responses start failing.
You can't compare approaches. Is few-shot prompting better than a detailed system prompt for your use case? Without evals, you're guessing. With evals, you run both and look at the numbers.
“Evaluation is the immune system of an AI application. Without it, every change is a potential infection you won't detect until the symptoms are obvious.”
The Six Types of AI Agent Evals
Not all evaluations work the same way. Each type has strengths, weaknesses, and situations where it's the right tool. Let's walk through all six.
1. Exact Match and Heuristic Evals
The simplest kind. Does the output contain a specific string? Does it match a regex pattern? Is it valid JSON? Is it under a certain length?
// Simple heuristic checks
function evalFormatting(output: string): boolean {
// Must not contain internal system tags
if (output.includes("[INTERNAL]") || output.includes("{{")) return false;
// Must stay under 500 words
if (output.split(/\s+/).length > 500) return false;
// Must not quote a dollar amount without a disclaimer
const hasDollar = /\$\d+/.test(output);
const hasDisclaimer = /subject to change|may vary|contact.*for.*pricing/i.test(output);
if (hasDollar && !hasDisclaimer) return false;
return true;
}Heuristic evals are fast, deterministic, and cheap. Use them as a first pass — they catch obvious structural failures before you spend money on LLM-as-judge scoring.
2. LLM-as-Judge
This is the workhorse of modern eval frameworks. You use one LLM to evaluate another LLM's output. The judge model gets the original question, the agent's response, and a scoring rubric, then produces a structured score with reasoning.
The key insight: the judge prompt matters enormously. A vague judge prompt produces vague, inconsistent scores. A precise judge prompt with scoring rubrics and examples produces scores that correlate strongly with human judgment.
We'll build a full LLM-as-judge system later in this article. For now, here's the high-level pattern:
Input: "What's your return policy?"
Agent output: "You can return any item within 30 days for a
full refund, no questions asked!"
Judge prompt:
- Score ACCURACY (1-5): Is the information factually correct
given the reference policy?
- Score COMPLETENESS (1-5): Did the agent cover all relevant
details (timeframe, conditions, exceptions)?
- Score TONE (1-5): Was the response appropriately helpful
without being misleading?
Judge output:
accuracy: 3 (correct timeframe but omitted the
"original packaging required" condition)
completeness: 2 (missing restocking fee, packaging
requirement, and gift card exception)
tone: 4 (friendly and clear, slightly overpromises
with "no questions asked")3. Reference-Based Evals
You provide a "gold standard" answer and measure how close the agent's response is. This isn't exact string matching — you're typically using semantic similarity or an LLM judge to compare meaning, not wording.
Reference-based evals are great for factual questions where there's a clearly correct answer. They're less useful for open-ended conversations where many different responses could be equally good.
4. Rubric-Based Evals
Instead of comparing against a reference answer, you define a rubric — a structured set of criteria with score levels. This is what you'll use most in practice. A rubric for a customer support agent might evaluate accuracy, empathy, policy adherence, and resolution effectiveness as separate dimensions.
The power of rubric-based evals is that they decompose "quality" into measurable components. An agent can score 5/5 on accuracy while scoring 2/5 on empathy. That tells you exactly what to fix — something a single overall score never will. This is the same principle behind structured scorecard systems: multi-dimensional evaluation with clear criteria at each level.
5. Human Preference Evals
Show a human two agent responses to the same prompt and ask which is better. Aggregate enough preferences and you get reliable rankings using Elo ratings or Bradley-Terry models — the same math used to rank chess players.
Human preference evals are expensive and slow, but they're the gold standard for subjective quality. Use them to calibrate your automated evals: if your LLM judge consistently disagrees with human preferences, your judge prompt needs work.
6. End-to-End Task Completion
Did the agent actually accomplish the goal? For a customer support agent: was the issue resolved? For a booking agent: was the reservation made correctly? For a data extraction agent: were the right fields populated?
Task completion evals often require integration with your actual systems — checking that a ticket was created, a database was updated, or a correct API call was made. They're the most realistic eval type, but also the most complex to set up.
For agents handling multi-step workflows, scenario-based testing lets you simulate entire conversations with personas and validate the end state — not just individual responses.
Building an Eval Harness in TypeScript
Let's build a real, working eval framework. This isn't a toy example — you can use this as the foundation for production evaluations.
The architecture is simple: define test cases, run them against your agent, score the results with LLM-as-judge, and output a report.
import Anthropic from "@anthropic-ai/sdk";
// — Types ---
interface TestCase {
id: string;
input: string;
context?: string; // optional reference info for the judge
criteria: string[]; // what the judge should evaluate
expectedBehavior: string; // natural language description
}
interface EvalScore {
criterion: string;
score: number; // 1-5
reasoning: string;
}
interface EvalResult {
testCase: TestCase;
agentOutput: string;
scores: EvalScore[];
averageScore: number;
pass: boolean;
latencyMs: number;
}
// — Agent Under Test ---
async function runAgent(
client: Anthropic,
systemPrompt: string,
userMessage: string
): Promise<{ output: string; latencyMs: number }> {
const start = Date.now();
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: "user", content: userMessage }],
});
const output =
response.content[0].type === "text" ? response.content[0].text : "";
return { output, latencyMs: Date.now() - start };
}
// — LLM-as-Judge ---
const JUDGE_PROMPT = `You are an expert evaluator for AI agent responses.
You will be given:
1. The user's input message
2. The agent's response
3. Context about what the correct behavior should be
4. A list of criteria to evaluate
For each criterion, provide:
- A score from 1-5 (1=terrible, 2=poor, 3=adequate, 4=good, 5=excellent)
- A brief reasoning explaining the score
Think step-by-step before scoring. Consider edge cases and subtle issues.
Respond in this exact JSON format:
{
"scores": [
{
"criterion": "<criterion name>",
"score": <1-5>,
"reasoning": "<1-2 sentence explanation>"
}
]
}`;
async function judgeResponse(
client: Anthropic,
testCase: TestCase,
agentOutput: string
): Promise<EvalScore[]> {
const message = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: JUDGE_PROMPT,
messages: [
{
role: "user",
content: `## User Input
${testCase.input}
## Agent Response
${agentOutput}
## Expected Behavior
${testCase.expectedBehavior}
${testCase.context ? "## Reference Context\n" + testCase.context : ""}
## Criteria to Evaluate
${testCase.criteria.map((c, i) => `${i + 1}. ${c}`).join("\n")}`,
},
],
});
const text =
message.content[0].type === "text" ? message.content[0].text : "{}";
const jsonMatch = text.match(/\{[\s\S]*\}/);
if (!jsonMatch) throw new Error("Judge did not return valid JSON");
const parsed = JSON.parse(jsonMatch[0]);
return parsed.scores;
}
// — Test Runner ---
async function runEvals(
testCases: TestCase[],
systemPrompt: string,
passThreshold: number = 3.5
): Promise<EvalResult[]> {
const client = new Anthropic();
const results: EvalResult[] = [];
for (const testCase of testCases) {
console.log(`Running: ${testCase.id}...`);
const { output, latencyMs } = await runAgent(
client, systemPrompt, testCase.input
);
const scores = await judgeResponse(client, testCase, output);
const avg =
scores.reduce((sum, s) => sum + s.score, 0) / scores.length;
results.push({
testCase,
agentOutput: output,
scores,
averageScore: Math.round(avg * 100) / 100,
pass: avg >= passThreshold,
latencyMs,
});
}
return results;
}
// — Report ---
function printReport(results: EvalResult[]): void {
console.log("\n" + "=".repeat(60));
console.log("EVALUATION REPORT");
console.log("=".repeat(60));
const passed = results.filter((r) => r.pass).length;
console.log(`\nOverall: ${passed}/${results.length} passed\n`);
for (const r of results) {
const icon = r.pass ? "PASS" : "FAIL";
console.log(`[${icon}] ${r.testCase.id} — avg: ${r.averageScore} (${r.latencyMs}ms)`);
for (const s of r.scores) {
console.log(` ${s.criterion}: ${s.score}/5 — ${s.reasoning}`);
}
console.log();
}
}
// — Test Cases ---
const SUPPORT_AGENT_PROMPT = `You are a customer support agent for TechCo.
Our return policy: 30-day returns with original packaging. Restocking
fee of 15% for opened electronics. Gift cards are final sale.
Business hours: Mon-Fri 9am-6pm EST.
Always be helpful, accurate, and empathetic.`;
const testCases: TestCase[] = [
{
id: "eval-001",
input: "I bought a laptop 3 weeks ago and want to return it. I opened the box though.",
context: "30-day return window. Opened electronics have 15% restocking fee.",
criteria: ["Accuracy", "Completeness", "Empathy"],
expectedBehavior:
"Should confirm the return is within the 30-day window, mention the " +
"15% restocking fee for opened electronics, and be empathetic.",
},
{
id: "eval-002",
input: "Can I return a gift card?",
context: "Gift cards are final sale and cannot be returned.",
criteria: ["Accuracy", "Tone", "Policy adherence"],
expectedBehavior:
"Should clearly state that gift cards are final sale and cannot be " +
"returned. Should be empathetic but firm. Must not offer alternatives " +
"that contradict the policy.",
},
{
id: "eval-003",
input: "Your product broke after 2 days! This is unacceptable!",
context: "Defective items within 30 days get full refund, no restocking fee.",
criteria: ["Empathy", "Accuracy", "De-escalation", "Resolution"],
expectedBehavior:
"Should acknowledge frustration, apologize, explain that defective items " +
"qualify for full refund without restocking fee, and offer clear next steps.",
},
{
id: "eval-004",
input: "What are your hours? Also can I return something I bought 45 days ago?",
context: "Hours: Mon-Fri 9-6 EST. Returns within 30 days only.",
criteria: ["Accuracy", "Completeness", "Clarity"],
expectedBehavior:
"Should answer BOTH questions. State business hours correctly AND explain " +
"that the 45-day return is outside the 30-day window. Must not skip either question.",
},
];
// — Run ---
runEvals(testCases, SUPPORT_AGENT_PROMPT).then(printReport);That's a complete, runnable eval harness. Install the SDK (npm install @anthropic-ai/sdk), set your ANTHROPIC_API_KEY environment variable, and run it with npx tsx eval-harness.ts.
Let's break down what's happening:
Test cases define the input, expected behavior, reference context, and specific criteria to evaluate. Each criterion gets its own score — you're not collapsing everything into a single number.
The agent runner calls your LLM with the system prompt and captures both the output and latency. In a production setup, you'd swap this out for a call to your actual agent API.
The LLM judge is the critical piece. It gets the test case, the agent's response, and a detailed rubric. The judge prompt asks for step-by-step reasoning before scoring — this chain-of-thought approach significantly improves scoring consistency. It returns structured JSON with per-criterion scores and explanations.
The report shows pass/fail for each test case with a detailed breakdown. You can immediately see which criteria failed and why.
A Note on Judge Prompt Design
The judge prompt is the most important piece of your eval framework. A few principles that matter:
Be specific about what each score level means. "Score 1-5" is too vague. Consider adding anchored examples: "A score of 3 means the response is factually correct but incomplete. A score of 5 means the response is correct, complete, and proactively addresses likely follow-up questions."
Ask for reasoning before the score. When the judge explains its thinking first, the scores are more consistent. This is chain-of-thought prompting applied to evaluation.
Use a strong model for judging. Your judge should be at least as capable as the model you're evaluating. Using a weaker model to judge a stronger one produces unreliable results.
Building the Same Harness in Python
Here's the equivalent framework in Python. Same architecture, same patterns — just idiomatic Python.
import anthropic
import json
import time
from dataclasses import dataclass, field
# — Types ---
@dataclass
class TestCase:
id: str
input: str
criteria: list[str]
expected_behavior: str
context: str = ""
@dataclass
class EvalScore:
criterion: str
score: int # 1-5
reasoning: str
@dataclass
class EvalResult:
test_case: TestCase
agent_output: str
scores: list[EvalScore]
average_score: float
passed: bool
latency_ms: float
# — Agent Under Test ---
def run_agent(
client: anthropic.Anthropic,
system_prompt: str,
user_message: str,
) -> tuple[str, float]:
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}],
)
output = response.content[0].text
latency_ms = (time.time() - start) * 1000
return output, latency_ms
# — LLM-as-Judge ---
JUDGE_PROMPT = """You are an expert evaluator for AI agent responses.
You will be given:
1. The user's input message
2. The agent's response
3. Context about what the correct behavior should be
4. A list of criteria to evaluate
For each criterion, provide:
- A score from 1-5 (1=terrible, 2=poor, 3=adequate, 4=good, 5=excellent)
- A brief reasoning explaining the score
Think step-by-step before scoring. Consider edge cases and subtle issues.
Respond in this exact JSON format:
{
"scores": [
{
"criterion": "<criterion name>",
"score": <1-5>,
"reasoning": "<1-2 sentence explanation>"
}
]
}"""
def judge_response(
client: anthropic.Anthropic,
test_case: TestCase,
agent_output: str,
) -> list[EvalScore]:
context_block = ""
if test_case.context:
context_block = f"\n## Reference Context\n{test_case.context}"
criteria_list = "\n".join(
f"{i+1}. {c}" for i, c in enumerate(test_case.criteria)
)
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=JUDGE_PROMPT,
messages=[{
"role": "user",
"content": f"""## User Input
{test_case.input}
## Agent Response
{agent_output}
## Expected Behavior
{test_case.expected_behavior}
{context_block}
## Criteria to Evaluate
{criteria_list}""",
}],
)
text = message.content[0].text
# Extract JSON from response
import re
json_match = re.search(r"\{[\s\S]*\}", text)
if not json_match:
raise ValueError("Judge did not return valid JSON")
parsed = json.loads(json_match.group())
return [
EvalScore(
criterion=s["criterion"],
score=s["score"],
reasoning=s["reasoning"],
)
for s in parsed["scores"]
]
# — Test Runner ---
def run_evals(
test_cases: list[TestCase],
system_prompt: str,
pass_threshold: float = 3.5,
) -> list[EvalResult]:
client = anthropic.Anthropic()
results = []
for tc in test_cases:
print(f"Running: {tc.id}...")
output, latency_ms = run_agent(client, system_prompt, tc.input)
scores = judge_response(client, tc, output)
avg = sum(s.score for s in scores) / len(scores)
results.append(EvalResult(
test_case=tc,
agent_output=output,
scores=scores,
average_score=round(avg, 2),
passed=avg >= pass_threshold,
latency_ms=round(latency_ms, 1),
))
return results
# — Report ---
def print_report(results: list[EvalResult]) -> None:
print("\n" + "=" * 60)
print("EVALUATION REPORT")
print("=" * 60)
passed = sum(1 for r in results if r.passed)
print(f"\nOverall: {passed}/{len(results)} passed\n")
for r in results:
icon = "PASS" if r.passed else "FAIL"
print(f"[{icon}] {r.test_case.id} — avg: {r.average_score} ({r.latency_ms}ms)")
for s in r.scores:
print(f" {s.criterion}: {s.score}/5 — {s.reasoning}")
print()
# — Test Cases ---
SUPPORT_AGENT_PROMPT = """You are a customer support agent for TechCo.
Our return policy: 30-day returns with original packaging. Restocking
fee of 15% for opened electronics. Gift cards are final sale.
Business hours: Mon-Fri 9am-6pm EST.
Always be helpful, accurate, and empathetic."""
test_cases = [
TestCase(
id="eval-001",
input="I bought a laptop 3 weeks ago and want to return it. I opened the box though.",
context="30-day return window. Opened electronics have 15% restocking fee.",
criteria=["Accuracy", "Completeness", "Empathy"],
expected_behavior=(
"Should confirm the return is within the 30-day window, mention "
"the 15% restocking fee for opened electronics, and be empathetic."
),
),
TestCase(
id="eval-002",
input="Can I return a gift card?",
context="Gift cards are final sale and cannot be returned.",
criteria=["Accuracy", "Tone", "Policy adherence"],
expected_behavior=(
"Should clearly state that gift cards are final sale and cannot "
"be returned. Should be empathetic but firm."
),
),
TestCase(
id="eval-003",
input="Your product broke after 2 days! This is unacceptable!",
context="Defective items within 30 days get full refund, no restocking fee.",
criteria=["Empathy", "Accuracy", "De-escalation", "Resolution"],
expected_behavior=(
"Should acknowledge frustration, apologize, explain that defective "
"items qualify for full refund without restocking fee, and offer "
"clear next steps."
),
),
TestCase(
id="eval-004",
input="What are your hours? Also can I return something I bought 45 days ago?",
context="Hours: Mon-Fri 9-6 EST. Returns within 30 days only.",
criteria=["Accuracy", "Completeness", "Clarity"],
expected_behavior=(
"Should answer BOTH questions. State business hours correctly "
"AND explain that the 45-day return is outside the 30-day window."
),
),
]
# — Run ---
if __name__ == "__main__":
results = run_evals(test_cases, SUPPORT_AGENT_PROMPT)
print_report(results)Install the SDK (pip install anthropic), set ANTHROPIC_API_KEY, and run with python eval_harness.py. Same logic, same output format.
Eval Metrics That Matter
Once you're collecting eval data, you need to decide what to measure. Here are the metrics that actually tell you something useful, organized by what they capture.
Quality Metrics
Accuracy — Is the agent's response factually correct? This is non-negotiable for any production agent. Measure it per-response with LLM-as-judge scoring against known facts or reference documents.
Faithfulness — Does the response stay grounded in the provided context? An agent that's "accurate" but draws on training data instead of your knowledge base is a liability. Faithfulness specifically measures whether claims are supported by the retrieved context, not just whether they happen to be true.
Relevance — Did the agent actually address what the user asked? An accurate, faithful response that doesn't answer the question is still a failure. This catches agents that go off-topic or provide correct but unhelpful information.
Completeness — Did the response cover everything it should? Missing the restocking fee when explaining return policy isn't inaccurate — it's incomplete. These are different failure modes that need different scores.
Operational Metrics
Latency — How long does the agent take to respond? Track both p50 (typical experience) and p95 (worst-case experience). For conversational agents, anything over 3 seconds at p95 feels broken to users.
Cost per evaluation — LLM-as-judge isn't free. Track the total cost of running your eval suite. If it costs $50 to run a full eval, you'll run it less often. Optimize for a suite that costs pennies per run and can execute on every PR.
Token usage — Both for the agent being evaluated and for the judge. Verbose agents cost more and often provide worse experiences. Track input and output tokens separately.
Aggregate Metrics
Pass rate — What percentage of test cases pass your threshold? Track this over time. A declining pass rate is an early warning signal.
Mean score by criterion — Average accuracy score across all test cases, average empathy score, and so on. This shows which dimensions are strong and which need work.
Score variance — High variance on a criterion means your agent is inconsistent. It might ace 8 out of 10 empathy tests but completely fail the other 2. Low average scores are a systematic problem; high variance is a robustness problem.
Example eval report: per-criterion average scores across a test suite. You can immediately see that de-escalation and completeness need work — even though the overall average looks acceptable.
Designing Your Eval Set
Your eval set is the collection of test cases you run against your agent. Think of it as a test suite, but for behavior instead of code. Here's how to build one that actually catches problems.
Coverage Over Volume
Twenty well-designed test cases that cover your key scenarios are more valuable than two hundred random ones. Structure your eval set around the categories of conversations your agent handles:
| Category | Example Test Cases |
|---|---|
| Happy path | Standard questions with clear answers |
| Edge cases | Boundary conditions (day 30 of a 30-day return window) |
| Policy conflicts | User wants something the policy doesn't allow |
| Multi-part questions | Two or three questions in a single message |
| Emotional users | Frustrated, confused, or upset callers |
| Ambiguous inputs | Questions that could mean multiple things |
| Out-of-scope | Questions the agent shouldn't try to answer |
| Adversarial | Attempts to get the agent to break its rules |
The Golden Test Set
Maintain a curated set of 20-50 test cases that you treat as your regression suite. These never change (unless the underlying policy changes). Every prompt edit, every model change, every configuration update gets run against this set before deployment.
When a production bug surfaces, add a test case for it. Your golden set should grow over time, accumulating the hard-won knowledge of every failure you've encountered.
Versioning and Tracking
Version your eval set just like you version your code. When you change a test case, you should know why. When scores change between runs, you need to determine whether it's because the agent changed or because the test changed.
Store eval results with metadata: which prompt version was tested, which model, which eval set version, and the timestamp. This creates the audit trail you need for debugging regressions. Production monitoring complements this by catching issues that your eval set didn't anticipate.
CI Integration: Evals on Every PR
The real power of an eval framework comes from automation. Running evals manually is better than nothing, but running them on every pull request is transformative.
Here's a GitHub Actions workflow that runs your eval suite and blocks merging if scores drop:
name: Agent Evals
on:
pull_request:
paths:
- "prompts/**"
- "src/agent/**"
- "eval/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install dependencies
run: npm ci
- name: Run eval suite
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: npx tsx eval/run.ts --output eval-results.json
- name: Check thresholds
run: |
node -e "
const r = require('./eval-results.json');
const failed = r.results.filter(t => !t.pass);
if (failed.length > 0) {
console.log('FAILED EVALS:');
failed.forEach(f => console.log(' ' + f.testCase.id + ': ' + f.averageScore));
process.exit(1);
}
const avgScore = r.results.reduce((s,t) => s + t.averageScore, 0) / r.results.length;
if (avgScore < 4.0) {
console.log('Average score ' + avgScore + ' below threshold 4.0');
process.exit(1);
}
console.log('All evals passed. Average: ' + avgScore.toFixed(2));
"
- name: Comment results on PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval-results.json', 'utf8'));
const passed = results.results.filter(r => r.pass).length;
const total = results.results.length;
const avg = (results.results.reduce((s,r) => s + r.averageScore, 0) / total).toFixed(2);
let body = '## Agent Eval Results\n\n';
body += '| Test | Score | Status |\n|------|-------|--------|\n';
results.results.forEach(r => {
const status = r.pass ? 'Pass' : 'Fail';
body += '| ' + r.testCase.id + ' | ' + r.averageScore + ' | ' + status + ' |\n';
});
body += '\n**Average: ' + avg + '** | **' + passed + '/' + total + ' passed**';
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});This workflow triggers whenever someone changes a prompt file or agent code. It runs the full eval suite, checks that all test cases pass and the average score meets your threshold, and posts a summary comment on the PR. If scores drop, the PR is blocked from merging.
A few practical considerations:
Cost control. Each eval run calls your LLM twice per test case (once for the agent, once for the judge). With 30 test cases, that's 60 LLM calls. At typical API pricing, a full run costs $0.50-$2.00. That's cheap insurance against shipping a broken prompt.
Flakiness. LLM-as-judge scores have natural variance. A test case that scores 3.8 on one run might score 3.4 on the next. Set your pass threshold with a margin — if you need 3.5, set your actual threshold at 3.8 to account for variance. Or run each test case three times and take the median.
Speed. Run test cases in parallel where possible. A 30-case suite running sequentially might take 3 minutes. Running in batches of 10 brings it under a minute.
Regression Testing: Catching What Breaks
Regression testing is where evals deliver the most value. The pattern is straightforward: maintain a baseline of scores, and flag any significant drops.
interface Baseline {
[testCaseId: string]: {
[criterion: string]: number; // baseline score
};
}
function checkRegressions(
results: EvalResult[],
baseline: Baseline,
regressionThreshold: number = 1.0
): { testId: string; criterion: string; drop: number }[] {
const regressions: { testId: string; criterion: string; drop: number }[] = [];
for (const result of results) {
const baselineScores = baseline[result.testCase.id];
if (!baselineScores) continue;
for (const score of result.scores) {
const baseScore = baselineScores[score.criterion];
if (baseScore === undefined) continue;
const drop = baseScore - score.score;
if (drop >= regressionThreshold) {
regressions.push({
testId: result.testCase.id,
criterion: score.criterion,
drop,
});
}
}
}
return regressions;
}
// Usage: after running evals, check for regressions
const regressions = checkRegressions(results, previousBaseline);
if (regressions.length > 0) {
console.error("REGRESSIONS DETECTED:");
regressions.forEach((r) =>
console.error(` ${r.testId} / ${r.criterion}: dropped ${r.drop} points`)
);
process.exit(1);
}Store your baseline scores in a JSON file committed to your repo. After every successful eval run that you're happy with, update the baseline. This creates a ratchet: quality can only go up, never quietly degrade.
Advanced Patterns
Once you've got the basics working, here are patterns that take your eval framework to the next level.
A/B Eval Comparison
When you're testing a new prompt version, run the same test cases against both the old and new prompts and compare scores side by side:
async function comparePrompts(
testCases: TestCase[],
promptA: string,
promptB: string
): Promise<void> {
const resultsA = await runEvals(testCases, promptA);
const resultsB = await runEvals(testCases, promptB);
console.log("\nA/B COMPARISON");
console.log("=" .repeat(50));
console.log("Test ID | Prompt A | Prompt B | Delta");
console.log("-".repeat(50));
let totalA = 0, totalB = 0;
for (let i = 0; i < testCases.length; i++) {
const a = resultsA[i].averageScore;
const b = resultsB[i].averageScore;
const delta = b - a;
const arrow = delta > 0 ? "+" : "";
totalA += a;
totalB += b;
console.log(
`${testCases[i].id.padEnd(18)}| ${a.toFixed(2).padEnd(9)}| ${b.toFixed(2).padEnd(9)}| ${arrow}${delta.toFixed(2)}`
);
}
const avgA = totalA / testCases.length;
const avgB = totalB / testCases.length;
console.log("-".repeat(50));
console.log(
`Average | ${avgA.toFixed(2).padEnd(9)}| ${avgB.toFixed(2).padEnd(9)}| ${(avgB - avgA > 0 ? "+" : "")}${(avgB - avgA).toFixed(2)}`
);
}This pattern is essential for prompt engineering workflows. Instead of guessing whether your prompt change helped, you get a clear comparison table.
Multi-Turn Conversation Evals
Real agents don't just answer one question. They handle entire conversations. Evaluating multi-turn interactions requires a slightly different approach:
interface ConversationTestCase {
id: string;
turns: { role: "user" | "assistant"; content: string }[];
// The last turn is the one we evaluate; earlier turns are context
criteria: string[];
expectedBehavior: string;
}
async function runConversationEval(
client: Anthropic,
systemPrompt: string,
testCase: ConversationTestCase
): Promise<EvalResult> {
// Build message history from all turns except the last user message
const messages = testCase.turns.slice(0, -1).map((t) => ({
role: t.role as "user" | "assistant",
content: t.content,
}));
// Add the final user message
const lastTurn = testCase.turns[testCase.turns.length - 1];
messages.push({ role: "user", content: lastTurn.content });
const start = Date.now();
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: systemPrompt,
messages,
});
const output = response.content[0].type === "text"
? response.content[0].text : "";
// Judge with full conversation context
const scores = await judgeResponse(client, {
id: testCase.id,
input: testCase.turns.map(
(t) => `${t.role}: ${t.content}`
).join("\n"),
criteria: testCase.criteria,
expectedBehavior: testCase.expectedBehavior,
}, output);
const avg = scores.reduce((s, sc) => s + sc.score, 0) / scores.length;
return {
testCase: {
id: testCase.id,
input: lastTurn.content,
criteria: testCase.criteria,
expectedBehavior: testCase.expectedBehavior,
},
agentOutput: output,
scores,
averageScore: Math.round(avg * 100) / 100,
pass: avg >= 3.5,
latencyMs: Date.now() - start,
};
}Multi-turn evals are critical for catching context loss. An agent that handles individual questions well might completely forget details from earlier in the conversation. The analytics from production conversations will tell you where these breakdowns happen most.
Cost-Aware Evaluation
Track the cost of every eval run so you can optimize:
def estimate_cost(results: list[EvalResult], price_per_1k_input: float = 0.003, price_per_1k_output: float = 0.015) -> dict:
"""Estimate total eval cost based on typical token usage."""
total_input_tokens = 0
total_output_tokens = 0
for r in results:
# Rough estimate: ~200 input tokens per agent call,
# ~300 per judge call, ~200 output each
total_input_tokens += 500
total_output_tokens += 400
input_cost = (total_input_tokens / 1000) * price_per_1k_input
output_cost = (total_output_tokens / 1000) * price_per_1k_output
return {
"total_cost": round(input_cost + output_cost, 4),
"cost_per_test_case": round((input_cost + output_cost) / len(results), 4),
"input_tokens": total_input_tokens,
"output_tokens": total_output_tokens,
}The Eval Ecosystem: Frameworks Worth Knowing
You don't have to build everything from scratch. The eval ecosystem has matured significantly. Here's a quick overview of the major frameworks and where they fit:
Braintrust connects evaluation scoring with production tracing, dataset management, and CI-based release enforcement in a single system. It's particularly strong if you want a managed platform that covers the full eval lifecycle.
DeepEval is open-source and developer-friendly, with plug-and-play metrics and pytest integration. Great if you want something you can embed directly in your test suite without a separate platform.
RAGAS focuses specifically on RAG evaluation with research-backed retrieval and generation metrics. If your agent relies heavily on retrieval-augmented generation, RAGAS metrics like faithfulness and answer relevancy are worth adding to your framework.
Langfuse offers open-source observability with built-in evaluation capabilities. Good for teams that want to self-host their eval infrastructure.
Promptfoo focuses on red-teaming and security validation alongside standard evals. Worth looking at if you need adversarial testing.
The framework you built earlier in this article gives you the core patterns. These platforms add managed infrastructure, pre-built metrics, and dashboards on top of the same fundamental ideas.
Best Practices Checklist
- Start with 20-30 well-designed test cases covering happy paths, edge cases, and adversarial inputs
- Use LLM-as-judge with a detailed rubric — not a vague "rate this 1-5" prompt
- Score multiple criteria independently (accuracy, completeness, tone, policy adherence)
- Run evals in CI on every PR that touches prompts or agent code
- Maintain a golden test set that grows with every production bug
- Store baselines and check for regressions — quality should ratchet up, never quietly degrade
- Track cost and latency alongside quality scores
- Run A/B comparisons when testing prompt changes — never guess
- Use a strong model as judge (at least as capable as the model being evaluated)
- Add multi-turn conversation evals, not just single-turn Q&A
- Version your eval set and track changes alongside code changes
- Review judge scores against human judgment quarterly to check calibration
Where to Go From Here
You've now got the building blocks for a real eval framework: test case design, LLM-as-judge scoring, regression detection, CI integration, and a working codebase in both TypeScript and Python. That's enough to catch the vast majority of issues before they reach production.
The next steps depend on your situation. If you're just starting, get the basic eval harness running and write test cases for your ten most common customer interactions. If you're already running evals, focus on CI integration and regression baselines. If you've got all of that, explore multi-turn evaluation and A/B comparison for prompt optimization.
The teams shipping the most reliable agents aren't the ones with the fanciest models or the most sophisticated architectures. They're the ones who test systematically, measure specifically, and never ship on vibes alone.
If you'd rather not build your eval infrastructure from scratch, Chanl's scorecard and scenario testing systems provide production-ready evaluation workflows — but the principles in this guide apply regardless of the tools you use.
Start measuring. Stop guessing.
Chanl Team
AI Agent Testing Platform
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Get AI Agent Insights
Subscribe to our newsletter for weekly tips and best practices.



