Open your browser's developer tools during an AI agent conversation. Watch the network tab. Every request to the model carries a payload that's much larger than what the user typed. Most of the tokens aren't the conversation. They're infrastructure.
System prompt. Tool definitions. MCP schemas. Injected memories from previous sessions. Knowledge base chunks pulled in via RAG. Conversation history. All of it compressed into a single context window, all of it competing for the same fixed budget of tokens.
The user types "What's my order status?" That's 6 tokens. The full request to the model? Easily 15,000. Maybe 40,000 if your agent has a lot of tools. And the model has to reason over all of it to produce a response.
This is the context window crisis. Not that context windows are too small (they've grown from 4k to 200k tokens in two years), but that everything we're putting into them has grown faster. Every feature you add to an agent, whether it's a new tool, a memory system, or a knowledge base, costs tokens. And tokens aren't just money. They're attention. They're accuracy. They're the difference between an agent that picks the right tool and one that confidently picks the wrong one.
Martin Fowler calls this discipline "context engineering": the skill of designing and orchestrating the full information environment that a model operates within. It's what separates developers who get 10x value from AI agents from those who get 2x. And it starts with understanding where your tokens actually go.
| Layer | Tokens per Request | Notes |
|---|---|---|
| System prompt | 500 - 3,000 | Persona, rules, formatting instructions |
| Tool definitions | 2,000 - 55,000+ | 150-400 tokens per tool; scales with tool count |
| Memory injection | 500 - 2,000 | Customer facts from previous conversations |
| RAG / knowledge base | 1,000 - 4,000 | Retrieved document chunks |
| Conversation history | 500 - 100,000+ | Grows with every turn |
| Model response | 500 - 4,000 | The actual output |
| Total overhead (before user message) | 4,500 - 64,000+ | Often 10-50% of the window |
This article breaks down each layer, shows you the math, and walks through the compression strategies that keep your agent effective as conversations grow. We'll look at how Claude Code manages context when it has access to hundreds of tools, how production chat services handle history compression, and how memory injection stays selective instead of dumping everything into the prompt.
Prerequisites & Setup
This is a conceptual deep-dive. You don't need to install anything to follow along, but if you want to count tokens yourself:
// npm install tiktoken
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
console.log(enc.encode('What is my order status?').length);
// => 6 tokensTwo prerequisite articles will help if you're new to agent infrastructure: Build Your Own AI Agent Memory System covers the three memory layers (session, persistent, semantic), and MCP Explained covers how tool schemas work at the protocol level.
What Is a Context Window, Really?
A context window is the total number of tokens a language model can process in a single request. It includes everything: inputs, instructions, history, and the model's own response. Think of it as working memory. Not storage. Not long-term recall. Working memory, with a hard ceiling.
A common misconception is that the context window is "how much the model can read." It's more accurate to say it's how much the model can hold in its head while producing a single response. Everything must fit: the system prompt that defines behavior, the tool definitions that describe capabilities, the conversation so far, any injected context, and the response the model is generating.
Tokens Are Not Words
Tokens are the fundamental unit of language models, and they don't map cleanly to words. In English, one token averages about 0.75 words (or about 4 characters). The word "tokenization" is two tokens. "AI" is one. A JSON schema for a tool parameter might look short to a human but consume hundreds of tokens because of structural characters like braces, colons, and quotes.
This matters because tool definitions, which are heavily structured JSON, are more token-expensive per "word" of useful information than natural language. A tool description that's three sentences long might consume 80 tokens for the text and another 200 for the parameter schema. The schema is the expensive part.
That diagram shows a healthy budget. The concerning version is when tool definitions alone consume 30-50% of the window, which happens more often than you'd think.
Where Do Your Tokens Actually Go?
Tokens are consumed by five infrastructure layers before the user's message even enters the context. The system prompt is the cheapest layer; tool definitions are usually the most expensive. Understanding the cost of each layer is the foundation of context engineering.
Let's walk through each layer in the order the model sees them.
Layer 1: The System Prompt
The system prompt defines who the agent is, how it should behave, what format to use for responses, and what constraints to follow. In production agents, this is rarely a single paragraph. It often includes persona instructions, safety rules, output formatting requirements, language preferences, and domain-specific guidelines.
A minimal system prompt runs 200-500 tokens. A production system prompt for a customer service agent with detailed escalation rules, formatting requirements, and persona guidelines typically runs 1,500-3,000 tokens. Some enterprise deployments push past 5,000.
The system prompt is relatively cheap compared to other layers, but it's also the layer developers over-engineer most. Every rule you add is tokens you can't use for conversation. A system prompt that's 5,000 tokens of exhaustive edge-case rules is often less effective than a 1,500-token prompt that covers the core behavior and lets the model's training handle the rest.
Layer 2: Tool Definitions (The Hidden Giant)
This is where context budgets break. Each tool definition includes a name, a description (which the model reads to decide when to use the tool), and a JSON Schema for input parameters. A simple tool with two string parameters costs 150-200 tokens. A complex tool with nested objects, enums, and detailed descriptions costs 300-400 tokens.
The math gets alarming fast:
| Tool Count | Estimated Tokens | % of 128k Window | % of 200k Window |
|---|---|---|---|
| 5 tools | 1,000 | 0.8% | 0.5% |
| 15 tools | 3,500 | 2.7% | 1.8% |
| 30 tools | 8,000 | 6.3% | 4.0% |
| 50 tools | 15,000 | 11.7% | 7.5% |
| 100 tools | 35,000 | 27.3% | 17.5% |
| 7 MCP servers | 67,000 | 52.3% | 33.5% |
That last row isn't hypothetical. Anthropic's measurements found that a typical multi-server MCP setup (GitHub, Slack, Sentry, Grafana, Splunk, and a couple of internal tools) consumes approximately 67,000 tokens in tool definitions before the agent does any actual work. That's more than half of a 128k context window, gone before the first message.
This is why Perplexity's CTO Denis Yarats announced at the Ask 2026 conference that Perplexity was moving away from MCP internally, citing context window overhead as a core issue. When tool definitions eat more than half your budget, you've inverted the purpose of the agent. It's spending more attention on what it could do than on what the user is asking it to do.
Berkeley Function-Calling Leaderboard benchmarks confirm what practitioners observe: tool selection accuracy drops 15-25% when moving from a 10-tool agent to a 50-tool agent with otherwise identical prompts. The failure mode is subtle. The agent doesn't refuse to use tools. It confidently picks the wrong one. A customer asks about their order status and the agent calls update_shipping_address instead of get_order_status because both descriptions mention "order" and "address."
Layer 3: Memory Injection
If your agent has a memory system, relevant facts from previous conversations are retrieved and injected into the context at the start of each request. A customer who called last week about a billing issue, prefers email, and has an enterprise account generates memory entries. The agent searches stored memories semantically, retrieves the top matches, and prepends them to the context.
The cost is moderate: typically 500-2,000 tokens for 5-10 memory entries. But it's a layer that grows over time. A customer with 50+ memory entries and loose retrieval settings (low score threshold, high result limit) will inject noise. Memories about a resolved shipping issue from six months ago waste tokens when the customer is calling about a new product.
The key insight: memory injection should be surgical. You want the 5-10 facts most relevant to this conversation, not a complete biography.
Layer 4: RAG (Knowledge Base Retrieval)
When an agent needs domain knowledge that isn't in its training data (your product catalog, internal policies, technical documentation), a RAG pipeline retrieves relevant document chunks and injects them. Each chunk is typically 200-800 tokens, and a standard retrieval returns 3-5 chunks: 1,000-4,000 tokens per request.
The catch: retrieval isn't perfect. If your embeddings aren't well-tuned or your chunks are too large, you inject partially relevant content that wastes tokens and confuses the model. Two chunks about slightly different product versions, both partially matching the query, burn 1,200 tokens and give the model contradictory information.
Layer 5: Conversation History
Conversation history is the one layer that grows during the session. Every user message and assistant response adds to the history. A single exchange (user turn + assistant response) typically costs 200-1,000 tokens depending on response length.
After 20 exchanges, history can consume 8,000-20,000 tokens. After 50 exchanges, 30,000-60,000 tokens. In a long customer service conversation with detailed back-and-forth about a complex issue, history alone can push past 100,000 tokens.
This is where the math gets critical. Your static overhead (system prompt + tools + memory + RAG) is fixed per request. History grows linearly. At some point, the sum exceeds what the model can hold. And unlike the other layers, you can't just remove history without losing the thread of the conversation.
What Is the Lost-in-the-Middle Problem?
Language models pay disproportionate attention to information at the beginning and end of the context window. Content placed in the middle 40-60% of a long context sees measurably lower recall. This means that even when you have budget remaining, the position of information within the context matters as much as whether it fits.
Research from Stanford and other groups on long-context language models consistently shows this U-shaped attention pattern. The model reads the beginning carefully (system prompt, tool definitions), skims the middle (older conversation turns), and reads the end carefully (recent messages, the current query).
This has practical implications for every layer:
System prompt: Placed first, gets maximum attention. Good.
Tool definitions: Placed right after the system prompt, gets strong attention. But in a 50-tool setup, the tools in positions 20-40 get less attention than the first 10 and last 10. This partly explains why tool selection degrades with tool count. It's not just quantity. It's position.
Memory and RAG: Often placed in the middle of the context, between tools and conversation history. This is the worst position for recall. If a critical memory fact is injected between tool definitions and conversation history, the model may effectively ignore it.
Conversation history: The oldest turns are in the middle. The newest turns are at the end, getting strong attention. This is why models seem to "forget" things mentioned 20 turns ago even when the history is still in context.
The takeaway: fitting everything into the window is necessary but not sufficient. You also need to position high-priority information at the beginning or end of the context, and compress or summarize the middle.
How Does Claude Code Manage 200+ Tools?
Claude Code has access to hundreds of tools through MCP servers: file operations, search, web browsing, code execution, Git operations, and dozens of project-specific tools. Loading all of them into every request would consume most of the context window. Instead, Anthropic built Tool Search: a meta-tool that dynamically discovers relevant tools per request.
The approach is elegant. Instead of injecting all 200+ tool definitions (which would cost 40,000-80,000 tokens), Claude Code injects a single Tool Search tool (~200 tokens) that can query for relevant tools on demand. When the model needs a capability, it first calls Tool Search with a description of what it needs. Tool Search returns only the matching tools, which are then available for the current request.
Anthropic reports that Tool Search preserves 191,300 tokens of context compared to 122,800 with the traditional "load everything" approach. That's an 85% reduction in tool definition overhead while maintaining access to the full tool library.
The recommendation is clear: when your agent needs access to more than 30 tools, stop loading them all. Use a dynamic discovery mechanism. This could be Anthropic's Tool Search, a custom tool-routing layer, or (the approach we'll discuss later) scoping tools into named subsets that load based on the agent's role.
“A typical multi-server setup (GitHub, Slack, Sentry, Grafana, Splunk) can consume ~55k tokens in definitions before Claude does any actual work.”
What Compression Strategies Actually Work?
The four most effective compression strategies are: conversation history summarization, sliding window with summary prefix, selective tool loading via toolsets, and smart retrieval thresholds for memory and RAG. Each targets a different layer and they stack.
Strategy 1: History Summarization
When conversation history exceeds a threshold (typically 15-20 turns), older messages are summarized into a compact representation. A 20-turn conversation that originally consumed 15,000 tokens might compress to a 2,000-token summary that preserves the key facts: customer identity, problem description, steps taken, current status, and any commitments made.
The summary replaces the original messages in the context. Recent messages (the last 5-8 turns) are kept verbatim so the model has full fidelity on the current thread. The result is a "sliding window" where the model always has:
- Full context on the system prompt, tools, and injected knowledge (static layers)
- A compressed summary of the conversation so far (summarized middle)
- Verbatim recent messages (high-fidelity recent context)
The compression model matters. Using a smaller, faster model (like GPT-4o-mini) for summarization keeps latency low and cost negligible. The summarization itself takes 200-500ms and costs fractions of a cent. The savings in context window space and downstream model inference cost far outweigh it.
Strategy 2: Selective Tool Loading (Toolsets)
Instead of loading every tool an agent could use, load only the tools relevant to the agent's current role or task. A billing support agent doesn't need shipping tools. A scheduling agent doesn't need CRM write tools.
The toolset pattern groups tools into named subsets. Each agent is configured with one or more toolsets. At request time, only tools from those toolsets are loaded into the context.
The impact is dramatic. An agent with access to 80 tools across 4 toolsets might only load 15-20 tools per request, cutting tool definition overhead from 25,000 tokens to 5,000-7,000 tokens. Selection accuracy improves too, because the model is choosing among 15 relevant tools instead of 80 loosely related ones.
For agents that genuinely need broad tool access (a workspace assistant or admin bot), the toolset approach still helps. Instead of loading all tools from all toolsets simultaneously, you load one toolset at a time, creating separate MCP connections per toolset. Each connection resolves only that toolset's tools, not the entire workspace catalog.
Strategy 3: Smart Memory Retrieval
Memory injection becomes a budget problem when retrieval is too aggressive. The fix is tuning three parameters:
- Score threshold: Only inject memories with a semantic similarity score above 0.3 (or higher for noisy memory stores). This filters out tangentially related memories that waste tokens.
- Result limit: Cap at 5-10 memories per request. More than 10 rarely adds useful information and starts competing with conversation history for attention.
- Recency bias: Weight recent memories higher. A customer preference from last week is more likely relevant than a complaint from six months ago.
The goal is surgical injection: the 5-8 facts most likely to help this specific conversation, not the customer's entire dossier.
Strategy 4: Chunked RAG with Reranking
RAG retrieval quality directly affects context budget efficiency. The two biggest token-wasters in RAG are:
- Oversized chunks: 800-token chunks that contain 200 tokens of relevant information and 600 tokens of surrounding context. Smaller, more focused chunks (200-400 tokens) are more token-efficient.
- Low-precision retrieval: Returning 5 chunks when only 2 are truly relevant. A reranking step after initial retrieval (using a cross-encoder or a fast reranker) can cut the chunk count by 40-60% while improving relevance.
Both optimizations reduce RAG overhead from 4,000+ tokens to 1,000-2,000 without reducing answer quality. If anything, quality improves because the model has less noise to reason through.
How Does Chanl Manage Context for Production Agents?
Chanl's chat service handles context management at the platform level so you don't reimplement compression and retrieval tuning for every agent. Three mechanisms work together: automatic history compression, selective memory injection, and per-toolset MCP resolution.
Automatic History Compression
When you send a message through chanl.chat.streamMessage(), the service checks conversation length against a configurable threshold. If the history exceeds it (default: 15 turns), older messages are compressed using a fast summarization model before the full request goes to the primary model.
The compression is transparent to the caller. The SDK handles it internally. From the developer's perspective, conversations can run indefinitely without context window overflows. From the model's perspective, it always sees a right-sized context: full-fidelity recent turns plus a compressed summary of earlier ones.
If compression fails (model timeout, rate limit), the service falls back to full history rather than dropping messages silently. Lossy failure is better than silent truncation.
Selective Memory Injection
Memory retrieval is configured per agent. Each agent has a memory configuration that controls:
- Whether memory auto-injection is enabled (default: yes)
- The search query used for retrieval (default: "customer context preferences account history")
- Maximum results (default: 10)
- Minimum similarity score (default: 0.3)
- Minimum confidence threshold (optional additional filter)
When a customer is identified, the service calls memoryService.search() with these parameters and injects matching memories as a structured <customer_memories> block. The model sees a bulleted list of relevant facts, not raw database entries.
This design means memory injection scales with relevance, not with customer history length. A customer with 200 memory entries gets the same 5-10 most relevant facts injected as a customer with 20 entries. The token cost stays bounded.

Customer Memory
4 memories recalled
“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”
Per-Toolset MCP Resolution
Rather than loading all workspace tools into every agent conversation, Chanl's MCP architecture resolves tools per toolset. When an agent has multiple toolsets configured, the chat service creates a separate MCP connection for each toolset. Each connection resolves only that toolset's tools.
This is the toolset pattern applied at the protocol level. An agent with three toolsets (billing-tools, account-tools, escalation-tools) creates three MCP connections, each loading 8-15 tools. The total tool count in context is the sum of the active toolsets, but each individual resolution is scoped and efficient.
For agents with no tools configured, the service skips MCP entirely. No connection, no tool definitions, no wasted tokens. This check (toolIds.length > 0 || toolsetIds.length > 0) prevents the "empty MCP" error that occurs when a client tries to discover tools from a server that has none to advertise.
How Do You Build a Context Budget?
Start by measuring your actual overhead, not estimating it. Count tokens for each layer in a real request, then set your compression thresholds based on what you find. The goal is a budget that leaves at least 60% of the context window available for conversation and response.
Here's a practical worksheet:
| Step | Action | Target |
|---|---|---|
| 1 | Count your system prompt tokens | Under 3,000 |
| 2 | Count tool definitions (multiply tools by ~250) | Under 12,000 |
| 3 | Set memory injection limit | 5-10 results, score > 0.3 |
| 4 | Set RAG chunk limit | 3-5 chunks, 200-400 tokens each |
| 5 | Calculate static overhead (steps 1-4) | Under 20,000 |
| 6 | Subtract from context window | Remaining = conversation budget |
| 7 | Set compression threshold | When history exceeds 40% of conversation budget |
For a 200k token window with 18,000 tokens of static overhead, you have 182,000 tokens for conversation + response. Setting compression to trigger at 70,000 tokens of history (roughly 40-50 turns) gives you generous room.
For a 128k window with the same overhead, you have 110,000 tokens. Compression should trigger earlier: around 40,000-50,000 tokens of history (roughly 25-30 turns).
The Token Economics
Context management isn't just a technical concern. It's a cost concern. Every token in the context is billed on input. Anthropic charges $3.00 per million input tokens for Claude Sonnet (or $0.30/M for cached tokens with prefix caching). An agent with 40,000 tokens of overhead costs $0.12 per request in input tokens alone, before the conversation content.
Cutting overhead from 40,000 to 15,000 tokens saves $0.075 per request. At 100,000 requests per month, that's $7,500 in savings, just from managing context more efficiently. Add prefix caching (which caches static portions of the context across requests) and the savings compound further: cached tokens cost 90% less.
This is why Factory.ai treats context as "a scarce, high-value resource, carefully allocating and curating it with the same rigor one might apply to managing CPU time or memory." Their engineering blog on the context window problem describes building structured repository overviews, semantic search, and targeted file operations, all to keep token consumption within budget for coding agent tasks.
What Does the Future of Context Management Look Like?
Context windows will keep growing. Gemini already offers 1 million tokens. Google is testing 10 million. But bigger windows don't solve the underlying problem, because the things we put into context are growing faster than the windows themselves.
VentureBeat's analysis of why AI coding agents aren't production-ready identified "brittle context windows" as one of three core failure modes, alongside broken refactors and missing operational awareness. The issue isn't the size of the window. It's that models progressively lose coherence when reasoning across very long contexts. A 128k context where the first 80k tokens are infrastructure and only the last 48k are actual conversation is a model spending most of its attention on capabilities rather than the task.
Three trends are converging to address this:
Dynamic tool loading (like Anthropic's Tool Search) makes large tool catalogs compatible with bounded context budgets. Instead of "load everything, hope it fits," agents will discover tools on demand.
Hierarchical memory systems that store and retrieve at multiple granularity levels. Instead of injecting individual facts, future memory systems will inject pre-composed "context packets" that combine relevant facts, preferences, and history into a single, dense representation.
Model-side improvements in long-context attention. Techniques like RoPE scaling, landmark attention, and sparse attention mechanisms are making models better at using information from the middle of long contexts, partially addressing the lost-in-the-middle problem.
But the fundamental tension remains: a fixed context window is a shared resource, and everything in it competes for the model's attention. Context engineering, the discipline of deciding what goes in, where it goes, and when to compress it, will remain a core skill for agent developers regardless of how large windows grow.
Quick Reference: Context Budget Checklist
- Measure actual token count of system prompt (target: under 3,000)
- Count tool definitions (multiply tool count by ~250 tokens)
- Audit tool count: implement toolsets if over 20 tools
- Set memory injection threshold (score > 0.3, limit 5-10 results)
- Set RAG chunk size (200-400 tokens) and limit (3-5 chunks)
- Calculate total static overhead per request
- Set history compression threshold (40% of remaining budget)
- Enable prefix caching for static context layers
- Monitor per-request token usage in production
- Review and prune tool descriptions quarterly
Conclusion
Your context window is a budget. Like any budget, the answer isn't to spend less on everything. It's to spend deliberately.
The system prompt is cheap, so write it well. Tool definitions are expensive, so scope them with toolsets. Memory injection should be selective, not exhaustive. History compression should be automatic, not manual. RAG retrieval should favor precision over recall.
Every token in your context window is attention the model is spending. That first request where the user typed 6 tokens and you sent 15,000? Now you know where the other 14,994 went. The question is whether those tokens earned their place, or whether they're infrastructure the model will never use for this particular conversation.
Build agents with managed context
Chanl handles history compression, memory injection, and per-toolset tool loading automatically. Focus on your agent's behavior, not its token budget.
Explore the platform- Martin Fowler: Context Engineering for Coding Agents (2026)
- Factory.ai: The Context Window Problem
- Factory.ai: Compressing Context
- Anthropic: Advanced Tool Use and Tool Search
- Anthropic: Tool Search Tool Documentation
- VentureBeat: Why AI Coding Agents Arent Production-Ready
- Anthropic: Context Windows Documentation
- StackOne: Agentic Context Engineering
- Block (Goose): The AI Skeptic Guide to Context Windows
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



