What is an AI agent's context window?

A context window is the total amount of text (measured in tokens, not words) that a language model can process in a single request. It includes everything: the system prompt, tool definitions, conversation history, injected memory, RAG results, and the model's own response. A token is roughly 0.75 words in English. Claude's 200k context window holds about 150,000 words, but in practice you'll never use all of it for conversation because the infrastructure layers consume a large share before the user says anything.

Where do tokens go before the user's first message?

In a production AI agent, tokens are consumed by five layers before the conversation begins: the system prompt (500-3,000 tokens), tool and MCP schema definitions (2,000-55,000+ tokens depending on tool count), injected memory from previous conversations (500-2,000 tokens), RAG-retrieved knowledge base chunks (1,000-4,000 tokens), and conversation history from the current session (grows with each turn). A typical agent with 30 tools, memory enabled, and a knowledge base can consume 15,000-25,000 tokens of overhead per request.

How many tokens do MCP tool definitions consume?

Each MCP tool definition requires 150-400 tokens for its name, description, and parameter schema. With 30 tools, that is 4,500-12,000 tokens. Anthropic's own measurements show that a typical setup with seven MCP servers consumes around 67,000 tokens in tool definitions alone. This is why Anthropic recommends Tool Search when an agent needs access to more than 30 tools, reducing token usage by up to 85% by loading only relevant tools per request.

What is context engineering and how does it differ from prompt engineering?

Context engineering is the discipline of designing the full information environment a model operates within, not just the prompt text. Martin Fowler describes it as orchestrating system instructions, retrieved documents, tool definitions, conversation history, and structured memory to determine what the model knows and how it behaves. Prompt engineering focuses on writing better instructions. Context engineering focuses on managing the entire token budget across all layers so the model has the right information at the right time.

How does conversation history compression work?

When a conversation exceeds a configured threshold (typically 15-20 turns), older messages are summarized into a compact summary that preserves key facts while reducing token count by 70-90%. The summary replaces the original messages in the context, and recent messages are kept verbatim. This sliding window approach prevents the context from growing unboundedly while retaining the information the model needs to maintain coherent conversation.

Why does tool selection accuracy degrade with more tools?

Berkeley Function-Calling Leaderboard benchmarks consistently show that tool selection accuracy drops 15-25% when moving from 10 tools to 50 tools with identical prompts. The model faces a harder retrieval problem: more tool descriptions that blur together, more parameter schemas to distinguish between. The failure mode is not that the agent cannot use tools, but that it confidently selects the wrong tool, and nobody notices until a customer gets an incorrect answer.

What is the lost-in-the-middle problem with context windows?

Research from Stanford and other institutions shows that language models pay disproportionate attention to information at the beginning and end of the context window, while information in the middle receives less attention. For a 128k token context, inputs placed in the middle 40-60% range see measurably lower recall. This means that even if you have budget remaining, where you place information within the context matters as much as whether it fits at all.

How should I budget tokens for a production AI agent?

A practical starting budget for a 200k token window: reserve 3,000 tokens for your system prompt, 8,000-12,000 for tool definitions (using toolsets to limit active tools), 2,000 for memory injection, 3,000 for RAG retrieval, and 50,000 for the model's response. That leaves roughly 120,000-130,000 tokens for conversation history. For a 128k window, the margins are tighter. The key is to measure your actual overhead per request and set compression thresholds accordingly.

Your AI Agent's Context Window Is Already Half Full

Open your browser's developer tools during an AI agent conversation. Watch the network tab. Every request to the model carries a payload that's much larger than what the user typed. Most of the tokens aren't the conversation. They're infrastructure.

System prompt. Tool definitions. MCP schemas. Injected memories from previous sessions. Knowledge base chunks pulled in via RAG. Conversation history. All of it compressed into a single context window, all of it competing for the same fixed budget of tokens.

The user types "What's my order status?" That's 6 tokens. The full request to the model? Easily 15,000. Maybe 40,000 if your agent has a lot of tools. And the model has to reason over all of it to produce a response.

This is the context window crisis. Not that context windows are too small (they've grown from 4k to 200k tokens in two years), but that everything we're putting into them has grown faster. Every feature you add to an agent, whether it's a new tool, a memory system, or a knowledge base, costs tokens. And tokens aren't just money. They're attention. They're accuracy. They're the difference between an agent that picks the right tool and one that confidently picks the wrong one.

Martin Fowler calls this discipline "context engineering": the skill of designing and orchestrating the full information environment that a model operates within. It's what separates developers who get 10x value from AI agents from those who get 2x. And it starts with understanding where your tokens actually go.

Layer	Tokens per Request	Notes
System prompt	500 - 3,000	Persona, rules, formatting instructions
Tool definitions	2,000 - 55,000+	150-400 tokens per tool; scales with tool count
Memory injection	500 - 2,000	Customer facts from previous conversations
RAG / knowledge base	1,000 - 4,000	Retrieved document chunks
Conversation history	500 - 100,000+	Grows with every turn
Model response	500 - 4,000	The actual output
Total overhead (before user message)	4,500 - 64,000+	Often 10-50% of the window

This article breaks down each layer, shows you the math, and walks through the compression strategies that keep your agent effective as conversations grow. We'll look at how Claude Code manages context when it has access to hundreds of tools, how production chat services handle history compression, and how memory injection stays selective instead of dumping everything into the prompt.

Prerequisites & Setup

This is a conceptual deep-dive. You don't need to install anything to follow along, but if you want to count tokens yourself:

typescript

// npm install tiktoken
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
console.log(enc.encode('What is my order status?').length);
// => 6 tokens

Two prerequisite articles will help if you're new to agent infrastructure: Build Your Own AI Agent Memory System covers the three memory layers (session, persistent, semantic), and MCP Explained covers how tool schemas work at the protocol level.

What Is a Context Window, Really?

A context window is the total number of tokens a language model can process in a single request. It includes everything: inputs, instructions, history, and the model's own response. Think of it as working memory. Not storage. Not long-term recall. Working memory, with a hard ceiling.

A common misconception is that the context window is "how much the model can read." It's more accurate to say it's how much the model can hold in its head while producing a single response. Everything must fit: the system prompt that defines behavior, the tool definitions that describe capabilities, the conversation so far, any injected context, and the response the model is generating.

Tokens Are Not Words

Tokens are the fundamental unit of language models, and they don't map cleanly to words. In English, one token averages about 0.75 words (or about 4 characters). The word "tokenization" is two tokens. "AI" is one. A JSON schema for a tool parameter might look short to a human but consume hundreds of tokens because of structural characters like braces, colons, and quotes.

This matters because tool definitions, which are heavily structured JSON, are more token-expensive per "word" of useful information than natural language. A tool description that's three sentences long might consume 80 tokens for the text and another 200 for the parameter schema. The schema is the expensive part.

Context window budget breakdown for a typical production agent

That diagram shows a healthy budget. The concerning version is when tool definitions alone consume 30-50% of the window, which happens more often than you'd think.

Where Do Your Tokens Actually Go?

Tokens are consumed by five infrastructure layers before the user's message even enters the context. The system prompt is the cheapest layer; tool definitions are usually the most expensive. Understanding the cost of each layer is the foundation of context engineering.

Let's walk through each layer in the order the model sees them.

Layer 1: The System Prompt

The system prompt defines who the agent is, how it should behave, what format to use for responses, and what constraints to follow. In production agents, this is rarely a single paragraph. It often includes persona instructions, safety rules, output formatting requirements, language preferences, and domain-specific guidelines.

A minimal system prompt runs 200-500 tokens. A production system prompt for a customer service agent with detailed escalation rules, formatting requirements, and persona guidelines typically runs 1,500-3,000 tokens. Some enterprise deployments push past 5,000.

The system prompt is relatively cheap compared to other layers, but it's also the layer developers over-engineer most. Every rule you add is tokens you can't use for conversation. A system prompt that's 5,000 tokens of exhaustive edge-case rules is often less effective than a 1,500-token prompt that covers the core behavior and lets the model's training handle the rest.

Layer 2: Tool Definitions (The Hidden Giant)

This is where context budgets break. Each tool definition includes a name, a description (which the model reads to decide when to use the tool), and a JSON Schema for input parameters. A simple tool with two string parameters costs 150-200 tokens. A complex tool with nested objects, enums, and detailed descriptions costs 300-400 tokens.

The math gets alarming fast:

Tool Count	Estimated Tokens	% of 128k Window	% of 200k Window
5 tools	1,000	0.8%	0.5%
15 tools	3,500	2.7%	1.8%
30 tools	8,000	6.3%	4.0%
50 tools	15,000	11.7%	7.5%
100 tools	35,000	27.3%	17.5%
7 MCP servers	67,000	52.3%	33.5%

That last row isn't hypothetical. Anthropic's measurements found that a typical multi-server MCP setup (GitHub, Slack, Sentry, Grafana, Splunk, and a couple of internal tools) consumes approximately 67,000 tokens in tool definitions before the agent does any actual work. That's more than half of a 128k context window, gone before the first message.

This is why Perplexity's CTO Denis Yarats announced at the Ask 2026 conference that Perplexity was moving away from MCP internally, citing context window overhead as a core issue. When tool definitions eat more than half your budget, you've inverted the purpose of the agent. It's spending more attention on what it could do than on what the user is asking it to do.

Berkeley Function-Calling Leaderboard benchmarks confirm what practitioners observe: tool selection accuracy drops 15-25% when moving from a 10-tool agent to a 50-tool agent with otherwise identical prompts. The failure mode is subtle. The agent doesn't refuse to use tools. It confidently picks the wrong one. A customer asks about their order status and the agent calls update_shipping_address instead of get_order_status because both descriptions mention "order" and "address."

Layer 3: Memory Injection

If your agent has a memory system, relevant facts from previous conversations are retrieved and injected into the context at the start of each request. A customer who called last week about a billing issue, prefers email, and has an enterprise account generates memory entries. The agent searches stored memories semantically, retrieves the top matches, and prepends them to the context.

The cost is moderate: typically 500-2,000 tokens for 5-10 memory entries. But it's a layer that grows over time. A customer with 50+ memory entries and loose retrieval settings (low score threshold, high result limit) will inject noise. Memories about a resolved shipping issue from six months ago waste tokens when the customer is calling about a new product.

The key insight: memory injection should be surgical. You want the 5-10 facts most relevant to this conversation, not a complete biography.

Layer 4: RAG (Knowledge Base Retrieval)

When an agent needs domain knowledge that isn't in its training data (your product catalog, internal policies, technical documentation), a RAG pipeline retrieves relevant document chunks and injects them. Each chunk is typically 200-800 tokens, and a standard retrieval returns 3-5 chunks: 1,000-4,000 tokens per request.

The catch: retrieval isn't perfect. If your embeddings aren't well-tuned or your chunks are too large, you inject partially relevant content that wastes tokens and confuses the model. Two chunks about slightly different product versions, both partially matching the query, burn 1,200 tokens and give the model contradictory information.

Layer 5: Conversation History

Conversation history is the one layer that grows during the session. Every user message and assistant response adds to the history. A single exchange (user turn + assistant response) typically costs 200-1,000 tokens depending on response length.

After 20 exchanges, history can consume 8,000-20,000 tokens. After 50 exchanges, 30,000-60,000 tokens. In a long customer service conversation with detailed back-and-forth about a complex issue, history alone can push past 100,000 tokens.

This is where the math gets critical. Your static overhead (system prompt + tools + memory + RAG) is fixed per request. History grows linearly. At some point, the sum exceeds what the model can hold. And unlike the other layers, you can't just remove history without losing the thread of the conversation.

How context fills up over a 30-turn conversation

What Is the Lost-in-the-Middle Problem?

Language models pay disproportionate attention to information at the beginning and end of the context window. Content placed in the middle 40-60% of a long context sees measurably lower recall. This means that even when you have budget remaining, the position of information within the context matters as much as whether it fits.

Research from Stanford and other groups on long-context language models consistently shows this U-shaped attention pattern. The model reads the beginning carefully (system prompt, tool definitions), skims the middle (older conversation turns), and reads the end carefully (recent messages, the current query).

This has practical implications for every layer:

System prompt: Placed first, gets maximum attention. Good.

Tool definitions: Placed right after the system prompt, gets strong attention. But in a 50-tool setup, the tools in positions 20-40 get less attention than the first 10 and last 10. This partly explains why tool selection degrades with tool count. It's not just quantity. It's position.

Memory and RAG: Often placed in the middle of the context, between tools and conversation history. This is the worst position for recall. If a critical memory fact is injected between tool definitions and conversation history, the model may effectively ignore it.

Conversation history: The oldest turns are in the middle. The newest turns are at the end, getting strong attention. This is why models seem to "forget" things mentioned 20 turns ago even when the history is still in context.

The takeaway: fitting everything into the window is necessary but not sufficient. You also need to position high-priority information at the beginning or end of the context, and compress or summarize the middle.

How Does Claude Code Manage 200+ Tools?

Claude Code has access to hundreds of tools through MCP servers: file operations, search, web browsing, code execution, Git operations, and dozens of project-specific tools. Loading all of them into every request would consume most of the context window. Instead, Anthropic built Tool Search: a meta-tool that dynamically discovers relevant tools per request.

The approach is elegant. Instead of injecting all 200+ tool definitions (which would cost 40,000-80,000 tokens), Claude Code injects a single Tool Search tool (~200 tokens) that can query for relevant tools on demand. When the model needs a capability, it first calls Tool Search with a description of what it needs. Tool Search returns only the matching tools, which are then available for the current request.

Anthropic reports that Tool Search preserves 191,300 tokens of context compared to 122,800 with the traditional "load everything" approach. That's an 85% reduction in tool definition overhead while maintaining access to the full tool library.

The recommendation is clear: when your agent needs access to more than 30 tools, stop loading them all. Use a dynamic discovery mechanism. This could be Anthropic's Tool Search, a custom tool-routing layer, or (the approach we'll discuss later) scoping tools into named subsets that load based on the agent's role.

“A typical multi-server setup (GitHub, Slack, Sentry, Grafana, Splunk) can consume ~55k tokens in definitions before Claude does any actual work.”

Anthropic Engineering — Advanced Tool Use Documentation

What Compression Strategies Actually Work?

The four most effective compression strategies are: conversation history summarization, sliding window with summary prefix, selective tool loading via toolsets, and smart retrieval thresholds for memory and RAG. Each targets a different layer and they stack.

Strategy 1: History Summarization

When conversation history exceeds a threshold (typically 15-20 turns), older messages are summarized into a compact representation. A 20-turn conversation that originally consumed 15,000 tokens might compress to a 2,000-token summary that preserves the key facts: customer identity, problem description, steps taken, current status, and any commitments made.

The summary replaces the original messages in the context. Recent messages (the last 5-8 turns) are kept verbatim so the model has full fidelity on the current thread. The result is a "sliding window" where the model always has:

Full context on the system prompt, tools, and injected knowledge (static layers)
A compressed summary of the conversation so far (summarized middle)
Verbatim recent messages (high-fidelity recent context)

Sliding window with summary: compress old turns, keep recent verbatim

The compression model matters. Using a smaller, faster model (like GPT-4o-mini) for summarization keeps latency low and cost negligible. The summarization itself takes 200-500ms and costs fractions of a cent. The savings in context window space and downstream model inference cost far outweigh it.

Strategy 2: Selective Tool Loading (Toolsets)

Instead of loading every tool an agent could use, load only the tools relevant to the agent's current role or task. A billing support agent doesn't need shipping tools. A scheduling agent doesn't need CRM write tools.

The toolset pattern groups tools into named subsets. Each agent is configured with one or more toolsets. At request time, only tools from those toolsets are loaded into the context.

The impact is dramatic. An agent with access to 80 tools across 4 toolsets might only load 15-20 tools per request, cutting tool definition overhead from 25,000 tokens to 5,000-7,000 tokens. Selection accuracy improves too, because the model is choosing among 15 relevant tools instead of 80 loosely related ones.

For agents that genuinely need broad tool access (a workspace assistant or admin bot), the toolset approach still helps. Instead of loading all tools from all toolsets simultaneously, you load one toolset at a time, creating separate MCP connections per toolset. Each connection resolves only that toolset's tools, not the entire workspace catalog.

Strategy 3: Smart Memory Retrieval

Memory injection becomes a budget problem when retrieval is too aggressive. The fix is tuning three parameters:

Score threshold: Only inject memories with a semantic similarity score above 0.3 (or higher for noisy memory stores). This filters out tangentially related memories that waste tokens.
Result limit: Cap at 5-10 memories per request. More than 10 rarely adds useful information and starts competing with conversation history for attention.
Recency bias: Weight recent memories higher. A customer preference from last week is more likely relevant than a complaint from six months ago.

The goal is surgical injection: the 5-8 facts most likely to help this specific conversation, not the customer's entire dossier.

Strategy 4: Chunked RAG with Reranking

RAG retrieval quality directly affects context budget efficiency. The two biggest token-wasters in RAG are:

Oversized chunks: 800-token chunks that contain 200 tokens of relevant information and 600 tokens of surrounding context. Smaller, more focused chunks (200-400 tokens) are more token-efficient.
Low-precision retrieval: Returning 5 chunks when only 2 are truly relevant. A reranking step after initial retrieval (using a cross-encoder or a fast reranker) can cut the chunk count by 40-60% while improving relevance.

Both optimizations reduce RAG overhead from 4,000+ tokens to 1,000-2,000 without reducing answer quality. If anything, quality improves because the model has less noise to reason through.

How Does Chanl Manage Context for Production Agents?

Chanl's chat service handles context management at the platform level so you don't reimplement compression and retrieval tuning for every agent. Three mechanisms work together: automatic history compression, selective memory injection, and per-toolset MCP resolution.

Automatic History Compression

When you send a message through chanl.chat.streamMessage(), the service checks conversation length against a configurable threshold. If the history exceeds it (default: 15 turns), older messages are compressed using a fast summarization model before the full request goes to the primary model.

The compression is transparent to the caller. The SDK handles it internally. From the developer's perspective, conversations can run indefinitely without context window overflows. From the model's perspective, it always sees a right-sized context: full-fidelity recent turns plus a compressed summary of earlier ones.

If compression fails (model timeout, rate limit), the service falls back to full history rather than dropping messages silently. Lossy failure is better than silent truncation.

Selective Memory Injection

Memory retrieval is configured per agent. Each agent has a memory configuration that controls:

Whether memory auto-injection is enabled (default: yes)
The search query used for retrieval (default: "customer context preferences account history")
Maximum results (default: 10)
Minimum similarity score (default: 0.3)
Minimum confidence threshold (optional additional filter)

When a customer is identified, the service calls memoryService.search() with these parameters and injects matching memories as a structured <customer_memories> block. The model sees a bulleted list of relevant facts, not raw database entries.

This design means memory injection scales with relevance, not with customer history length. A customer with 200 memory entries gets the same 5-10 most relevant facts injected as a customer with 20 entries. The token cost stays bounded.

Customer Memory

4 memories recalled

Sarah Chen

Premium

Last call

2 days ago

Prefers

Email follow-up

Session Memory

“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”

85% relevance

Per-Toolset MCP Resolution

Rather than loading all workspace tools into every agent conversation, Chanl's MCP architecture resolves tools per toolset. When an agent has multiple toolsets configured, the chat service creates a separate MCP connection for each toolset. Each connection resolves only that toolset's tools.

This is the toolset pattern applied at the protocol level. An agent with three toolsets (billing-tools, account-tools, escalation-tools) creates three MCP connections, each loading 8-15 tools. The total tool count in context is the sum of the active toolsets, but each individual resolution is scoped and efficient.

For agents with no tools configured, the service skips MCP entirely. No connection, no tool definitions, no wasted tokens. This check (toolIds.length > 0 || toolsetIds.length > 0) prevents the "empty MCP" error that occurs when a client tries to discover tools from a server that has none to advertise.

How Do You Build a Context Budget?

Start by measuring your actual overhead, not estimating it. Count tokens for each layer in a real request, then set your compression thresholds based on what you find. The goal is a budget that leaves at least 60% of the context window available for conversation and response.

Here's a practical worksheet:

Step	Action	Target
1	Count your system prompt tokens	Under 3,000
2	Count tool definitions (multiply tools by ~250)	Under 12,000
3	Set memory injection limit	5-10 results, score > 0.3
4	Set RAG chunk limit	3-5 chunks, 200-400 tokens each
5	Calculate static overhead (steps 1-4)	Under 20,000
6	Subtract from context window	Remaining = conversation budget
7	Set compression threshold	When history exceeds 40% of conversation budget

For a 200k token window with 18,000 tokens of static overhead, you have 182,000 tokens for conversation + response. Setting compression to trigger at 70,000 tokens of history (roughly 40-50 turns) gives you generous room.

For a 128k window with the same overhead, you have 110,000 tokens. Compression should trigger earlier: around 40,000-50,000 tokens of history (roughly 25-30 turns).

The Token Economics

Context management isn't just a technical concern. It's a cost concern. Every token in the context is billed on input. Anthropic charges $3.00 per million input tokens for Claude Sonnet (or $0.30/M for cached tokens with prefix caching). An agent with 40,000 tokens of overhead costs $0.12 per request in input tokens alone, before the conversation content.

Cutting overhead from 40,000 to 15,000 tokens saves $0.075 per request. At 100,000 requests per month, that's $7,500 in savings, just from managing context more efficiently. Add prefix caching (which caches static portions of the context across requests) and the savings compound further: cached tokens cost 90% less.

This is why Factory.ai treats context as "a scarce, high-value resource, carefully allocating and curating it with the same rigor one might apply to managing CPU time or memory." Their engineering blog on the context window problem describes building structured repository overviews, semantic search, and targeted file operations, all to keep token consumption within budget for coding agent tasks.

What Does the Future of Context Management Look Like?

Context windows will keep growing. Gemini already offers 1 million tokens. Google is testing 10 million. But bigger windows don't solve the underlying problem, because the things we put into context are growing faster than the windows themselves.

VentureBeat's analysis of why AI coding agents aren't production-ready identified "brittle context windows" as one of three core failure modes, alongside broken refactors and missing operational awareness. The issue isn't the size of the window. It's that models progressively lose coherence when reasoning across very long contexts. A 128k context where the first 80k tokens are infrastructure and only the last 48k are actual conversation is a model spending most of its attention on capabilities rather than the task.

Three trends are converging to address this:

Dynamic tool loading (like Anthropic's Tool Search) makes large tool catalogs compatible with bounded context budgets. Instead of "load everything, hope it fits," agents will discover tools on demand.

Hierarchical memory systems that store and retrieve at multiple granularity levels. Instead of injecting individual facts, future memory systems will inject pre-composed "context packets" that combine relevant facts, preferences, and history into a single, dense representation.

Model-side improvements in long-context attention. Techniques like RoPE scaling, landmark attention, and sparse attention mechanisms are making models better at using information from the middle of long contexts, partially addressing the lost-in-the-middle problem.

But the fundamental tension remains: a fixed context window is a shared resource, and everything in it competes for the model's attention. Context engineering, the discipline of deciding what goes in, where it goes, and when to compress it, will remain a core skill for agent developers regardless of how large windows grow.

Quick Reference: Context Budget Checklist

Progress0/10

Conclusion

Your context window is a budget. Like any budget, the answer isn't to spend less on everything. It's to spend deliberately.

The system prompt is cheap, so write it well. Tool definitions are expensive, so scope them with toolsets. Memory injection should be selective, not exhaustive. History compression should be automatic, not manual. RAG retrieval should favor precision over recall.

Every token in your context window is attention the model is spending. That first request where the user typed 6 tokens and you sent 15,000? Now you know where the other 14,994 went. The question is whether those tokens earned their place, or whether they're infrastructure the model will never use for this particular conversation.

Build agents with managed context

Chanl handles history compression, memory injection, and per-toolset tool loading automatically. Focus on your agent's behavior, not its token budget.

Explore the platform

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

context-window tokens ai-agents mcp memory learning-ai context-engineering

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.