Your agent handles the first five turns perfectly. The customer asks about a billing issue, the agent pulls up the account, explains the charge, offers a credit. Textbook execution. Your eval suite gives it a 94% score.
By turn twelve, the same agent in the same conversation has forgotten the customer's name. By turn eighteen, it contradicts the credit it offered six turns earlier. By turn twenty-three, it confidently recommends a plan the customer already told it they cancelled.
Nothing crashed. No error logs. No alerts fired. The agent just quietly got worse.
This is agent drift. Not a catastrophic failure but a slow erosion of quality that hides behind passing test scores and clean dashboards. It's one of the hardest problems in production AI because the agents that drift the most are often the ones that look the best on standard evaluations.
Table of contents
- What agent drift actually is
- Three types of drift
- Why don't standard evals catch it?
- The math of compounding unreliability
- Memory fails when you need it most
- How do you actually measure drift?
- Anchoring agents against drift
- What to do Monday morning
What agent drift actually is
Agent drift is your AI agent getting progressively worse during a single conversation, not over weeks, but within minutes. It differs from model drift (where a model's training data becomes stale over weeks or months). Agent drift happens live, within a single session, while the customer is still on the line.
Think of it like a game of telephone. The agent starts with clear instructions, sharp context, and well-defined guardrails. But every turn adds more tokens to the context window. Attention shifts. Earlier instructions compete with recent exchanges for the model's limited focus. The agent doesn't suddenly break. It gradually bends.
A January 2026 paper from researchers studying multi-agent systems put formal structure around this problem for the first time. They identified three distinct types of drift, proposed a 12-dimension measurement framework, and showed that drift isn't random. It follows predictable patterns that get worse under specific conditions.
Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Introduces three types of progressive agent degradation and the Agent Stability Index (ASI) for measurement across 12 behavioral dimensions. Read the paper
Before that paper, "drift" was a vague complaint. Now it has taxonomy.
Three types of drift
Each type of drift attacks a different layer of agent behavior. Understanding which one you're seeing determines how you fix it.
Semantic drift is the most common. The agent gradually deviates from its original intent or instructions.
At turn 3, it correctly follows the refund policy: "I can process a refund for orders within 30 days." By turn 15, after a long back-and-forth about the customer's frustration, the agent's responses have drifted toward appeasement: "Given the circumstances, I think we can make an exception."
No one told it to do that. The original instructions haven't changed. But the weight of recent conversational context has slowly overridden the earlier system prompt.
Behavioral drift is subtler. The agent develops unintended strategies that weren't part of its design. A support agent trained to be helpful starts giving overly verbose answers to simple questions because longer responses correlated with higher satisfaction scores in its training data. Over the course of a conversation, this tendency amplifies. What starts as thoroughness at turn 5 becomes rambling by turn 20. The behavior emerges from the interaction between the model's learned patterns and the growing context, not from any explicit instruction.
Coordination drift applies when multiple agents or components work together. In a system where a router agent hands off to specialist agents, the handoff quality degrades over time. The router's understanding of which specialist to call becomes less precise. The specialist receives increasingly garbled context summaries. The agents' shared understanding of the task erodes, even when each individual agent still performs reasonably on its own.
| Drift Type | What Degrades | Example | Detection Signal |
|---|---|---|---|
| Semantic | Adherence to original instructions | Refund policy loosens after emotional escalation | Scorecard policy criteria drops after turn 10 |
| Behavioral | Response strategy and style | Concise answers become verbose rambling | Response length increases 3x by turn 20 |
| Coordination | Multi-agent handoff quality | Router sends customer to wrong specialist | Handoff accuracy drops in longer sessions |
The research measured these types independently and found they don't always correlate. An agent can have stable semantics (sticking to its instructions) while exhibiting significant behavioral drift (developing new response patterns). Testing for one type and assuming you've covered the others is a mistake.
Why don't standard evals catch it?
Short answer: they test at the wrong conversation length. Standard evaluations test agents at a single point in time, usually with 3-5 turn conversations, and that approach has a blind spot the size of a highway.
The agent gets a question, responds, maybe handles a follow-up, and the evaluator scores the result. This is like testing a car's engine at idle and declaring it road-ready. The interesting failures happen at speed, under load, over time.
Here's why point-in-time testing fails for drift detection.
First, drift is definitionally longitudinal. It doesn't exist in a single exchange. An agent that drifts badly at turn 20 might be flawless at turn 5. If your eval only reaches turn 5, you'll never see the problem. And most test suites don't go further because longer scenarios cost more to run, take more time to design, and produce harder-to-interpret results.
Second, drift is sensitive to conversation path. The same agent can drift in completely different ways depending on what the customer says. A straightforward conversation might show minimal drift. A conversation with emotional escalation, topic switching, or ambiguous requests triggers drift much faster. Your eval needs to cover these divergent paths, not just the happy one.
Third, drift compounds with tool use. Every tool call introduces a potential drift vector. The agent calls a tool, gets a result, interprets it, and uses that interpretation in the next response. If the interpretation is slightly off, the error propagates. By the fifth tool call, small misinterpretations have stacked into meaningful behavioral changes.
Anthropic's evaluation research makes this quantitative with two complementary metrics. Pass@k measures whether at least one of k attempts succeeds. It answers: "Can the agent do this at all?" Pass^k measures whether all k attempts succeed. It answers: "Can the agent do this reliably?"
An agent with 90% pass@1 looks great. Run it five times and you get 59% pass^5. Run it ten times: 35%. The single-run eval creates an illusion of reliability that evaporates under repetition. And repetition is exactly what production looks like.
Anthropic: Demystifying Evals Introduces the pass@k vs pass^k distinction for measuring agent capability versus consistency. Read the guide
Anthropic's practical advice: start with 20-50 real production failures, not synthetic benchmarks. Those actual failure transcripts tell you where drift matters for your specific use case.
The math of compounding unreliability
That eval gap between capability and consistency plays out in the math. Even without drift, reliability in compound AI systems is worse than most teams realize. Drift makes it dramatically worse.
Princeton researchers Sayash Kapoor and Arvind Narayanan studied this gap across multiple domains and found a consistent pattern: reliability improves at roughly half the rate of accuracy. In customer service applications specifically, reliability improvements were one-seventh the rate of accuracy improvements.
That ratio means the model upgrades your team celebrates (GPT-5 is 15% more accurate!) translate to much smaller reliability gains in practice. A 15% accuracy improvement might yield only a 2% reliability improvement when the agent is handling real conversations with tool calls, memory lookups, and multi-turn context.
The compounding math is brutal. If your agent has three components in its pipeline, say an LLM call, a tool execution, and a memory retrieval, each at 90%, 85%, and 97% individual reliability, the system reliability isn't the average (90.7%). It's the product: 74%.
And that's the static number. With drift, each component's reliability decreases over conversation length. If the LLM call drops from 90% to 80% by turn 20, the tool execution drops from 85% to 75%, and memory retrieval drops from 97% to 90%, your system reliability at turn 20 is 54%. That's a coin flip. For a system that scored 74% on your eval.
A February 2026 paper from a team studying agent reliability formalized this into four dimensions: Consistency (same input, same output), Robustness (correct behavior under perturbation), Predictability (behavior within expected bounds), and Safety (avoidance of harmful outputs). Their finding: reliability doesn't improve uniformly with capability. A more capable model can actually be less reliable in specific dimensions.
Towards a Science of AI Agent Reliability Defines four reliability dimensions and shows that capability improvements don't uniformly improve reliability, with significant brittleness to prompt paraphrasing. Read the paper
One example from the research: Claude Opus 4.5, one of the most capable models available, showed only 73% consistency when asked paraphrased versions of the same question. Not different questions. The same question, worded differently. If the model can't give consistent answers to rephrased versions of identical queries, what happens when a customer restates their problem in turn 15 after first explaining it in turn 3?
The answer: the agent might give a completely different response. Not because anything changed. Because the paraphrase, combined with the intervening context, is enough to trigger a different behavioral path.
Memory fails when you need it most
You might think persistent memory would save you. If the agent can look up what happened at turn 3, who cares if the context window is noisy? The problem: memory itself degrades under complexity.
A December 2025 study that examined agent behavior across varying complexity levels found a pattern that should concern anyone running agents in production. Memory failures scaled with task complexity: 0.67 failures per task in simple scenarios, 2.33 in moderate scenarios, and 3.67 in complex ones. That's a 5.5x increase in memory failures as complexity grows.
But the most striking finding was the disconnect between task completion and memory quality. Agents achieved reasonable task completion rates even in complex scenarios, while their memory retrieval recall dropped to 13.1%. The agents were completing tasks while forgetting most of what happened during those tasks.
Beyond Task Completion: Evaluating Agent Memory in Complex Scenarios Reveals that memory failures scale dramatically with complexity, reaching only 13.1% recall in complex scenarios despite reasonable task completion rates. Read the paper
This creates a particularly dangerous form of drift. The agent looks like it's working. It's completing tasks, giving responses, following the conversation. But its internal model of the conversation is increasingly disconnected from reality. At turn 5, the agent has a 90%+ accurate picture of the conversation. By turn 20 in a complex interaction, it's operating on a 13% accurate picture and confabulating the rest.
The customer doesn't know this. The agent doesn't know this. Your monitoring dashboard doesn't know this, because the agent is still generating confident, fluent responses. It's just generating them based on a hallucinated version of the conversation history.
This is why drift detection can't rely on output quality alone. An agent can produce grammatically perfect, tonally appropriate, completely wrong responses. The output looks fine. The underlying state is degraded.
How do you actually measure drift?
Drift is measurable, but most teams measure the wrong thing.
The Agent Stability Index (ASI) framework introduced in the January 2026 paper provides twelve dimensions for tracking drift. These include semantic consistency (is the agent saying the same things about the same topics?), behavioral predictability (is the agent's response strategy stable?), and coordination alignment (in multi-agent systems, are the agents still working toward the same goal?).
For production teams, the practical approach is simpler than implementing a 12-dimension index. You need three things: the same scenario run at different conversation depths, a multi-criteria scorecard that measures consistency explicitly, and a comparison framework that highlights where scores diverge.
Running the same scenario at turn 5 versus turn 20 is the single most revealing test you can add to your eval suite. Take a scenario your agent handles well. Run it as a standalone 5-turn conversation. Then embed the exact same scenario at turn 20 of a longer conversation where the first 15 turns cover different topics.
Compare the results across multiple dimensions. Did the agent give the same factual answer? Did it maintain the same tone? Did it follow the same policy constraints? Did it use the same tools in the same order? Any significant divergence between the turn-5 and turn-20 versions is drift you need to investigate.
Multi-criteria scorecards are essential here because drift rarely affects all dimensions equally. An agent might maintain perfect factual accuracy while its tone drifts from professional to casual. Or it might preserve tone while its policy adherence weakens. A single "quality score" hides these dimension-specific degradations. You need separate scores for each criteria, tracked across conversation depth.
Here's what a depth-comparison test looks like with the Chanl SDK:
import Chanl from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Run the same scenario at different conversation depths
const depths = [5, 15, 30];
const results = [];
for (const depth of depths) {
// Execute the scenario with depth-controlling parameters
const { data } = await chanl.scenarios.run('scenario_billing_dispute', {
parameters: {
conversationDepth: depth,
priorTopics: depth > 5 ? ['shipping-inquiry', 'product-return', 'account-update'] : [],
},
});
// Wait for completion, then evaluate with a consistency-focused scorecard
const execId = data.executionId || data.execution.id;
const callId = data.execution.callDetails?.callId;
if (callId) {
await chanl.scorecards.evaluate(callId, { scorecardId: 'scorecard_consistency' });
const { data: scoreData } = await chanl.scorecards.getResultsByCall(callId);
results.push({ depth, criteriaResults: scoreData.results[0]?.criteriaResults || [] });
}
}
// Compare: where do scores diverge?
for (const key of ['factual_accuracy', 'tone_stability', 'policy_adherence', 'fact_retention']) {
const atFive = results[0]?.criteriaResults.find(c => c.criteriaKey === key);
const atThirty = results[2]?.criteriaResults.find(c => c.criteriaKey === key);
if (atFive && atThirty && typeof atFive.result === 'number' && typeof atThirty.result === 'number') {
const drift = atFive.result - atThirty.result;
if (drift > 0.15) {
console.warn(`${key}: ${(drift * 100).toFixed(0)}% drift between turn 5 and turn 30`);
}
}
}The 15% threshold is a starting point. Some dimensions tolerate more variation than others. Factual accuracy should show near-zero drift. Tone might naturally vary by 5-10% without causing problems. You'll calibrate these thresholds based on your specific quality requirements.
Anchoring agents against drift
Detecting drift is step one. Preventing it requires changing how agents maintain state across long conversations.
Three strategies work, each addressing a different root cause.
Persistent memory as a drift anchor. The core insight: information stored in external memory doesn't decay with conversation length. A fact retrieved from persistent memory at turn 30 is just as accurate as it was at turn 1. In-context information, by contrast, degrades as the context window fills with more recent tokens.
This means treating memory not just as a convenience feature but as a reliability mechanism. Key facts from the conversation, the customer's stated problem, any commitments the agent has made, relevant policy constraints, should be written to persistent memory early and re-injected into context at regular intervals. The agent's working context degrades. The external memory store doesn't.
Instruction re-injection. System prompts lose influence as conversations grow. The model's attention increasingly focuses on recent turns, and the system prompt set 20 turns ago has diminishing weight. A practical fix: periodically re-inject critical instructions into the conversation context. Not the full system prompt, just the constraints that matter most. Policy boundaries, tone guidelines, factual commitments already made.
Some teams do this every 10 turns. Others trigger re-injection when they detect early signs of drift, like a response that scores below threshold on a mid-conversation evaluation. The right cadence depends on how quickly your specific agent drifts, which you'll know from the depth-comparison testing described above.
Context compression and summarization. Instead of letting the context window fill with raw conversation history, periodically compress earlier turns into summaries. The agent gets a summary of turns 1-15 plus the full text of turns 16-20, rather than the full text of all 20 turns. This preserves the essential information while reducing the noise that causes drift.
The risk: summaries lose nuance. The specific word a customer used, the emotional tone of a particular exchange, edge-case details that matter. Good compression preserves the facts and commitments while dropping the filler. Bad compression drops exactly the details the agent needs at turn 25.
What to do Monday morning
Agent drift is real, it's measurable, and you can start testing for it this week. Here's the priority order.
First, run a depth test. Take your three most common customer scenarios. Run each one as a standalone 5-turn conversation and score it. Then embed each one at turn 20 of a longer conversation and score it again. Compare the results. If you see more than a 15% drop on any scorecard dimension, you have a drift problem worth investigating.
Second, add consistency to your scorecards. If your current evaluation only measures accuracy and helpfulness, you're missing drift entirely. Add "fact retention" (does the agent remember what was established earlier?), "policy consistency" (does the agent still follow the same rules?), and "tone stability" (does the agent maintain appropriate register?). These dimensions are where drift shows up first.
Third, test with repetition, not just variation. Run every critical scenario at least five times. An agent that passes once and fails once out of five runs has a 20% production failure rate. That's pass^5 thinking instead of pass@1 thinking. It changes how you interpret your eval results.
Fourth, monitor conversation length in production. Track where your longest conversations happen and sample-score them specifically. If your average conversation is 8 turns but 5% of conversations reach 25 turns, those long conversations are where drift is hiding. Score them separately from your overall metrics.
Fifth, treat memory as infrastructure, not a feature. If your agent handles conversations longer than 10 turns, persistent memory isn't optional. It's the difference between an agent that degrades predictably and one that maintains quality. Write critical facts to memory early. Re-inject them when context degrades.
The research is clear: agents drift, and they drift in ways that standard evaluations don't catch. But drift is a solvable problem once you know to look for it. The teams that test at depth, measure over time, and anchor their agents with external memory will ship agents that work as well on turn 30 as they do on turn 3.
The teams that don't will keep wondering why their 94% eval score doesn't match their 2.1-star customer rating.
Catch drift before your customers do
Run the same scenario at 5, 15, and 30 turns. Chanl's scorecards show you exactly where consistency breaks down.
Start testingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



