You've spent hours perfecting a system prompt. The agent still hallucinates, forgets what the customer said three turns ago, and ignores half its tools. Sound familiar?
The problem isn't your prompt. It's everything else in the context window.
Anthropic's Applied AI team published their guide to context engineering in late 2025, and by early 2026 the term had swept through the developer community. Shopify CEO Tobi Lutke put it bluntly: context engineering "describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy called it "the delicate art and science of filling the context window with just the right information for the next step."
This isn't a rebrand. It's a recognition that the hard engineering problem behind production AI agents was never about writing better sentences. It's about building systems that dynamically assemble the right information -- system instructions, conversation history, retrieved knowledge, persistent memory, tool definitions -- into a limited context window, on every single inference call, across multi-step workflows that can run for dozens of turns.
This tutorial teaches you to build that system. We'll go from the conceptual shift through each layer of a production context pipeline, with working code in both TypeScript and Python. By the end, you'll have a context engine that handles memory injection, knowledge retrieval, history compression, and dynamic tool resolution.
What's in this article
- Prerequisites
- What Is Context Engineering
- The Five Layers
- Layer 1: System Instructions
- Layer 2: Knowledge Retrieval
- Layer 3: Memory Injection
- Layer 4: History Compression
- Layer 5: Tool Resolution
- Assembling the Engine
- Just-In-Time vs. Upfront Retrieval
- Sub-Agents for Complex Tasks
- Token Budgeting
- Monitoring
- Common Pitfalls
- What's Next
Prerequisites
You'll need Node.js 20+ and Python 3.11+ installed. We'll use the Vercel AI SDK for TypeScript and the openai package for Python.
TypeScript setup:
mkdir context-engine && cd context-engine
npm init -y
npm install ai @ai-sdk/openai zod
npm install -D typescript @types/node tsx
npx tsc --init --target ES2022 --module NodeNext --moduleResolution NodeNext --outDir distPython setup:
mkdir context_engine && cd context_engine
python -m venv .venv && source .venv/bin/activate
pip install openai numpy tiktokenWe install tiktoken alongside openai because accurate token counting matters -- the naive "divide by 4" heuristic can be off by 20-30% on code-heavy or multilingual content, and that margin is the difference between fitting in your budget and silently truncating context.
Environment:
export OPENAI_API_KEY="sk-..."If you've worked through our RAG tutorial or MCP server guide, you already have the foundation. Context engineering is what ties those pieces together into a coherent runtime.
What Is Context Engineering
Context engineering is the discipline of designing and managing everything a model sees at inference time -- not just the prompt, but the entire assembled context window including system instructions, tool schemas, retrieved documents, conversation history, and injected memories. It's the difference between writing a good question and building the information system that makes good answers possible.
The scale of the shift is hard to overstate. According to LangChain's 2026 State of Agent Engineering report, 57% of organizations now have agents in production -- but among large enterprises, "quality" remains the number-one barrier to scaling, cited by 32% of respondents. The report specifically calls out "context engineering and managing context at scale" as an ongoing difficulty. In other words: most teams can get an agent demo working. The context pipeline is what separates the demo from production.
Here's how Anthropic's team defines it: "Context engineering is the art and science of curating what will go into the limited context window from that constantly evolving universe of possible information."
That word "curating" is doing a lot of work. Remember the agent that hallucinates and forgets what the customer said three turns ago? Here's exactly why. When your agent handles a customer support call, the universe of possibly-relevant information includes:
- The system prompt defining agent behavior
- The customer's full conversation history (maybe 50+ turns)
- Their profile and previous interactions from your CRM
- Knowledge base articles matching their question
- Definitions and schemas for 20+ available tools
- Memories from previous conversations ("prefers email over phone", "has a premium plan")
- Results from tools already called in this conversation
All of that needs to fit in a context window. Claude Opus 4.6 gives you 1 million tokens -- roughly 750,000 words. That sounds like plenty until you realize that a 30-minute customer support conversation with tool calls, knowledge retrieval, and memory injection can easily consume 50,000+ tokens. And more isn't better: as deepset's research notes, "models can actually get worse at recalling specific facts as the context gets very large." They call this phenomenon context rot.
The transformer architecture processes tokens with attention that scales quadratically -- n tokens create n-squared pairwise relationships. Stuff in everything, and the model's attention budget gets diluted across irrelevant information.
Prompt Engineering vs. Context Engineering
If you've read our prompt engineering techniques guide, you know that individual prompt techniques -- chain-of-thought, few-shot examples, role prompting -- are powerful. Context engineering doesn't replace those techniques. It's the layer above them.
| Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
| Scope | The instruction text you write | Everything the model sees at inference |
| Nature | Mostly static templates | Dynamic, assembled per-request at runtime |
| Focus | How to phrase the question | What information to provide and when |
| Integration | Text-based | Tools, APIs, memory, retrieved docs, history |
| Lifecycle | Write once, iterate | Evolves with every conversation turn |
| Debugging | Read the prompt | Inspect the full assembled context |
| Cost driver | Prompt length (small, fixed) | Retrieved context volume (variable, 5-50x prompt size) |
| Performance ceiling | ~15-20% improvement from prompt rewrites | ~40-60% improvement from context pipeline changes |
As deepset puts it: "Performance gains increasingly come not from better models, but from smarter context." You could swap from GPT-4o to Claude Opus and see a 10% improvement. Or you could fix your context pipeline and see a 50% improvement on the same model.
Here's the counterintuitive part that most teams learn the hard way: more context often makes agents worse, not better. Google's "Needle in a Haystack" evaluations and subsequent research show that recall accuracy drops as context fills up -- even on models that technically support the full window. A model at 20% context utilization will reliably retrieve a specific fact. At 80% utilization, that same fact can get lost -- especially information buried in the middle third of the window (the "lost in the middle" effect documented by Liu et al., 2023). A well-curated 30K-token context will outperform a carelessly assembled 120K-token context every time. Context engineering is as much about what you exclude as what you include.
The Five Layers
A production context pipeline has five layers, each assembled dynamically at runtime: system instructions at the foundation, then retrieved knowledge, persistent memory, conversation history, and tool definitions. The order matters -- models pay more attention to the beginning and end of context windows.
Here's the architecture:
Each layer has its own engineering challenges. Let's build them.
Layer 1: System Instructions
System instructions should be the smallest set of high-signal tokens that fully define your agent's behavior. Anthropic's team advises aiming for "minimal information that fully outlines expected behavior" -- specific enough to guide decisions, flexible enough to handle edge cases.
The mistake most teams make is treating the system prompt like a feature specification. They stuff in every edge case, every policy rule, every exception. A 5,000-token system prompt might sound thorough, but you've just consumed 5% of a 100K context window before the conversation even starts.
TypeScript -- Structured System Prompt Builder:
import { z } from "zod";
// Zod validates config at runtime -- catches missing fields before they silently degrade output
const SystemPromptConfig = z.object({
role: z.string(),
constraints: z.array(z.string()),
personality: z.string().optional(),
escalationRules: z.array(z.string()).optional(),
outputFormat: z.string().optional(),
});
type SystemPromptConfig = z.infer<typeof SystemPromptConfig>;
function buildSystemPrompt(
config: SystemPromptConfig,
injectedContext: { memories?: string; knowledge?: string }
): string {
const sections: string[] = [];
// Role first: models weight the beginning of context most heavily
sections.push(`<role>\n${config.role}\n</role>`);
// Constraints next: behavioral boundaries the model must not cross
if (config.constraints.length > 0) {
sections.push(
`<constraints>\n${config.constraints.map((c) => `- ${c}`).join("\n")}\n</constraints>`
);
}
// Memories injected per-request -- this section doesn't exist at design time
if (injectedContext.memories) {
sections.push(
`<customer_context>\nRelevant information from previous interactions:\n${injectedContext.memories}\n</customer_context>`
);
}
// RAG results injected per-request -- grounds the model in facts, prevents hallucination
if (injectedContext.knowledge) {
sections.push(
`<knowledge>\nRelevant documentation:\n${injectedContext.knowledge}\n</knowledge>`
);
}
// Escalation rules: safety net for situations the agent shouldn't handle alone
if (config.escalationRules?.length) {
sections.push(
`<escalation>\n${config.escalationRules.map((r) => `- ${r}`).join("\n")}\n</escalation>`
);
}
// XML sections help the model attend to specific blocks vs. parsing a wall of text
return sections.join("\n\n");
}
// Usage
const systemPrompt = buildSystemPrompt(
{
role: "You are a customer support agent for Acme Corp. You help customers with billing questions, account issues, and product information.",
constraints: [
"Never reveal internal pricing formulas or discount logic",
"Always verify customer identity before accessing account details",
"If you cannot resolve an issue in 3 attempts, offer to escalate to a human agent",
],
escalationRules: [
"Escalate immediately if customer mentions legal action",
"Escalate if customer sentiment is consistently negative for 3+ turns",
],
},
{
memories: "Customer prefers email communication. Has been a subscriber since 2024. Previously had a billing dispute resolved in their favor.",
knowledge: "Current promotion: 20% off annual plans through March 31. Refund policy: full refund within 30 days, prorated after.",
}
);Python -- Structured System Prompt Builder:
from dataclasses import dataclass, field
@dataclass
class SystemPromptConfig:
role: str
constraints: list[str] = field(default_factory=list)
personality: str | None = None
escalation_rules: list[str] = field(default_factory=list)
output_format: str | None = None
def build_system_prompt(
config: SystemPromptConfig,
memories: str | None = None,
knowledge: str | None = None,
) -> str:
sections = []
# Role first: models weight the beginning of context most heavily
sections.append(f"<role>\n{config.role}\n</role>")
# Hard behavioral boundaries the model must not cross
if config.constraints:
items = "\n".join(f"- {c}" for c in config.constraints)
sections.append(f"<constraints>\n{items}\n</constraints>")
# Per-request injection -- doesn't exist at design time, populated from memory store
if memories:
sections.append(
f"<customer_context>\nRelevant information from previous interactions:\n{memories}\n</customer_context>"
)
# Per-request injection -- RAG results ground the model in facts
if knowledge:
sections.append(
f"<knowledge>\nRelevant documentation:\n{knowledge}\n</knowledge>"
)
# Safety net for situations the agent shouldn't handle alone
if config.escalation_rules:
items = "\n".join(f"- {r}" for r in config.escalation_rules)
sections.append(f"<escalation>\n{items}\n</escalation>")
return "\n\n".join(sections)
# Usage
config = SystemPromptConfig(
role="You are a customer support agent for Acme Corp.",
constraints=[
"Never reveal internal pricing formulas",
"Verify customer identity before accessing account details",
],
escalation_rules=[
"Escalate immediately if customer mentions legal action",
],
)
system_prompt = build_system_prompt(
config,
memories="Customer prefers email. Subscriber since 2024.",
knowledge="Current promotion: 20% off annual plans through March 31.",
)The key insight is that this system prompt is assembled at runtime. The memories and knowledge sections don't exist at design time -- they're populated per-request based on who the customer is and what they're asking about. That's context engineering. That's why your carefully crafted static prompt wasn't enough -- it had no mechanism to inject what it needed to know right now.
Layer 2: Knowledge Retrieval
Before each LLM call, retrieve the 3-5 most relevant knowledge base documents using the current message as a query. This grounds the model in facts and prevents hallucination. (New to RAG? Start with our RAG from Scratch tutorial and come back here.)
The challenge isn't retrieval -- it's deciding how much to retrieve. A naive approach dumps 10 chunks into the context. That's 2,000-5,000 tokens that might be 80% irrelevant.
TypeScript -- Token-Budgeted RAG Retrieval:
interface RetrievedChunk {
content: string;
score: number;
source: string;
tokenCount: number;
}
interface TokenBudget {
knowledge: number; // Hard ceiling -- RAG cannot exceed this
memory: number; // Hard ceiling -- memory injection
history: number; // Hard ceiling -- conversation history
tools: number; // Hard ceiling -- tool schemas
system: number; // Hard ceiling -- system prompt
}
// These numbers enforce the 60/20/20 split described below
const DEFAULT_BUDGET: TokenBudget = {
system: 2000,
knowledge: 3000,
memory: 1000,
history: 8000,
tools: 4000,
};
async function retrieveWithBudget(
query: string,
budget: number,
retriever: { search(query: string, topK: number): Promise<RetrievedChunk[]> }
): Promise<string> {
// Over-fetch, then trim -- cheaper than multiple retrieval calls
const chunks = await retriever.search(query, 10);
// 0.7 threshold: below this, chunks inject noise that dilutes attention
const relevant = chunks.filter((c) => c.score >= 0.7);
// Greedy packing: highest-score chunks first until budget exhausted
const selected: RetrievedChunk[] = [];
let tokenCount = 0;
for (const chunk of relevant) {
if (tokenCount + chunk.tokenCount > budget) break;
selected.push(chunk);
tokenCount += chunk.tokenCount;
}
if (selected.length === 0) return "";
// Source attribution helps the model cite its reasoning
return selected
.map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.content}`)
.join("\n\n---\n\n");
}Python -- Token-Budgeted RAG Retrieval:
from dataclasses import dataclass
@dataclass
class RetrievedChunk:
content: str
score: float
source: str
token_count: int
# Hard ceilings per layer -- prevents any single layer from starving the others
DEFAULT_BUDGET = {
"system": 2000,
"knowledge": 3000,
"memory": 1000,
"history": 8000,
"tools": 4000,
}
async def retrieve_with_budget(
query: str,
budget: int,
retriever, # has async search(query, top_k) -> list[RetrievedChunk]
) -> str:
# Over-fetch, then trim -- cheaper than multiple retrieval calls
chunks = await retriever.search(query, top_k=10)
# 0.7 threshold: below this, chunks inject noise that dilutes attention
relevant = [c for c in chunks if c.score >= 0.7]
# Greedy packing: highest-score chunks first until budget exhausted
selected = []
token_count = 0
for chunk in relevant:
if token_count + chunk.token_count > budget:
break
selected.append(chunk)
token_count += chunk.token_count
if not selected:
return ""
sections = []
for i, c in enumerate(selected, 1):
sections.append(f"[Source {i}: {c.source}]\n{c.content}")
return "\n\n---\n\n".join(sections)The token budget forces a critical design decision: how do you allocate your context window? In production, we've found a roughly 60/20/20 split works well for customer-facing agents -- 60% for conversation history (the customer expects you to remember what they said), 20% for knowledge and memory, 20% for system prompt and tools.
Layer 3: Memory Injection
Memory injection transforms a stateless chatbot into something that knows you. Before each LLM call, the system performs a semantic search against the user's stored memories and injects relevant hits into the system prompt.
Without memory, every conversation starts from zero. The customer re-explains their situation, preferences, history. With memory injection, the agent already knows this customer prefers email, has been a subscriber for two years, and had a billing dispute last month. Here's the flow:
TypeScript -- Memory Injection Service:
interface Memory {
id: string;
content: string;
category: "preference" | "fact" | "interaction" | "feedback";
createdAt: Date;
embedding: number[];
}
interface MemorySearchResult {
memory: Memory;
score: number;
}
class MemoryInjector {
private minScore = 0.3; // Lower than RAG (0.7) because memories are broader signals
private maxMemories = 10; // Cap prevents flooding context with stale facts
private tokenBudget = 1000;
async injectMemories(
customerId: string,
currentMessage: string,
memoryStore: {
search(
customerId: string,
query: string,
options: { minScore: number; limit: number }
): Promise<MemorySearchResult[]>;
}
): Promise<string> {
// The current message IS the search query -- "billing address" retrieves billing-related memories
const results = await memoryStore.search(customerId, currentMessage, {
minScore: this.minScore,
limit: this.maxMemories,
});
if (results.length === 0) return ""; // No memories = no tokens wasted
// Grouping by category creates structure the model can attend to selectively
const grouped = new Map<string, string[]>();
for (const { memory } of results) {
const items = grouped.get(memory.category) || [];
items.push(memory.content);
grouped.set(memory.category, items);
}
const sections: string[] = [];
if (grouped.has("preference")) {
sections.push(
`Customer preferences:\n${grouped.get("preference")!.map((m) => `- ${m}`).join("\n")}`
);
}
if (grouped.has("fact")) {
sections.push(
`Known facts:\n${grouped.get("fact")!.map((m) => `- ${m}`).join("\n")}`
);
}
if (grouped.has("interaction")) {
sections.push(
`Previous interactions:\n${grouped.get("interaction")!.map((m) => `- ${m}`).join("\n")}`
);
}
return sections.join("\n\n");
}
}Python -- Memory Injection Service:
from dataclasses import dataclass
from collections import defaultdict
@dataclass
class Memory:
id: str
content: str
category: str # "preference" | "fact" | "interaction" | "feedback"
created_at: str
embedding: list[float]
@dataclass
class MemorySearchResult:
memory: Memory
score: float
class MemoryInjector:
def __init__(self, min_score: float = 0.3, max_memories: int = 10):
self.min_score = min_score
self.max_memories = max_memories
async def inject_memories(
self,
customer_id: str,
current_message: str,
memory_store, # has async search(customer_id, query, min_score, limit)
) -> str:
results = await memory_store.search(
customer_id,
current_message,
min_score=self.min_score,
limit=self.max_memories,
)
if not results:
return ""
# Group by category
grouped: dict[str, list[str]] = defaultdict(list)
for result in results:
grouped[result.memory.category].append(result.memory.content)
sections = []
if "preference" in grouped:
items = "\n".join(f"- {m}" for m in grouped["preference"])
sections.append(f"Customer preferences:\n{items}")
if "fact" in grouped:
items = "\n".join(f"- {m}" for m in grouped["fact"])
sections.append(f"Known facts:\n{items}")
if "interaction" in grouped:
items = "\n".join(f"- {m}" for m in grouped["interaction"])
sections.append(f"Previous interactions:\n{items}")
return "\n\n".join(sections)The minScore: 0.3 threshold is important. Too high and you miss relevant memories. Too low and you inject noise. In production, 0.3 is a good starting point for cosine similarity with modern embedding models -- it catches memories that are topically related without flooding the context with tangential facts.
Memory injection feeds directly into the system prompt builder from Layer 1. The <customer_context> section is populated by whatever the memory injector returns. If there are no relevant memories, that section is simply omitted -- no wasted tokens.
This is the layer that directly fixes the problem we opened with. Your agent "forgot" what the customer said three turns ago? That's a history problem. But your agent doesn't know the customer at all -- their preferences, their plan, their last dispute? That's a memory problem. And no amount of prompt engineering can solve it, because the information simply isn't there.
Layer 4: History Compression
When conversations exceed a token threshold, older messages get summarized while recent ones stay verbatim. Anthropic calls this compaction: "summarizing conversation nearing context window limits, reinitializing new windows with summaries." It reduces token usage by 70-90%.
A 50-turn conversation can easily hit 20,000 tokens. Without compression, you're burning budget on "Hi, how can I help you?" With compression, those 50 turns become a 500-token summary plus the last 5-10 turns verbatim.
TypeScript -- History Compressor:
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
interface Message {
role: "user" | "assistant" | "system";
content: string;
}
interface CompressedHistory {
summary: string | null;
recentMessages: Message[];
originalCount: number;
compressedTokens: number;
}
class HistoryCompressor {
private maxTokens: number;
private recentTurnCount: number;
// gpt-4o-mini for summaries: 16x cheaper than gpt-4o, good enough for compression
private summaryModel = openai("gpt-4o-mini");
constructor(maxTokens = 8000, recentTurnCount = 10) {
this.maxTokens = maxTokens;
this.recentTurnCount = recentTurnCount;
}
async compress(messages: Message[]): Promise<CompressedHistory> {
const estimatedTokens = this.estimateTokens(messages);
// No compression needed
if (estimatedTokens <= this.maxTokens) {
return {
summary: null,
recentMessages: messages,
originalCount: messages.length,
compressedTokens: estimatedTokens,
};
}
// Split point: recent turns stay verbatim (customer expects you to remember what was just said)
const recentMessages = messages.slice(-this.recentTurnCount * 2); // *2 for user+assistant pairs
const olderMessages = messages.slice(0, -this.recentTurnCount * 2);
if (olderMessages.length === 0) {
return {
summary: null,
recentMessages: messages,
originalCount: messages.length,
compressedTokens: estimatedTokens,
};
}
// Summarize older messages
const conversationText = olderMessages
.map((m) => `${m.role}: ${m.content}`)
.join("\n");
const { text: summary } = await generateText({
model: this.summaryModel,
system: `You are a conversation summarizer. Extract key facts, decisions,
and unresolved issues from the conversation. Be concise but preserve all
actionable information. Format as bullet points.`,
prompt: `Summarize this conversation:\n\n${conversationText}`,
});
return {
summary,
recentMessages,
originalCount: messages.length,
compressedTokens: this.estimateTokens(recentMessages) + (summary.length / 4),
};
}
private estimateTokens(messages: Message[]): number {
// Rough estimate: 1 token ~= 4 chars for English text.
// For production, use tiktoken-node or the AI SDK's token counting.
// This heuristic overestimates code/JSON by ~20% and underestimates CJK by ~40%.
return messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
}
}Python -- History Compressor:
from openai import AsyncOpenAI
from dataclasses import dataclass
@dataclass
class CompressedHistory:
summary: str | None
recent_messages: list[dict]
original_count: int
compressed_tokens: int
class HistoryCompressor:
def __init__(
self,
max_tokens: int = 8000,
recent_turn_count: int = 10,
):
self.max_tokens = max_tokens
self.recent_turn_count = recent_turn_count
self.client = AsyncOpenAI()
async def compress(self, messages: list[dict]) -> CompressedHistory:
estimated = self._estimate_tokens(messages)
if estimated <= self.max_tokens:
return CompressedHistory(
summary=None,
recent_messages=messages,
original_count=len(messages),
compressed_tokens=estimated,
)
# Split into older (summarize) and recent (keep verbatim)
split_idx = -self.recent_turn_count * 2
recent = messages[split_idx:]
older = messages[:split_idx]
if not older:
return CompressedHistory(
summary=None,
recent_messages=messages,
original_count=len(messages),
compressed_tokens=estimated,
)
conversation_text = "\n".join(
f"{m['role']}: {m['content']}" for m in older
)
response = await self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are a conversation summarizer. Extract key facts, "
"decisions, and unresolved issues. Be concise but preserve "
"all actionable information. Format as bullet points."
),
},
{"role": "user", "content": f"Summarize:\n\n{conversation_text}"},
],
)
summary = response.choices[0].message.content
return CompressedHistory(
summary=summary,
recent_messages=recent,
original_count=len(messages),
compressed_tokens=(
self._estimate_tokens(recent) + len(summary or "") // 4
),
)
def _estimate_tokens(self, messages: list[dict]) -> int:
# Fast approximation. For production accuracy, use:
# import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o")
# return sum(len(enc.encode(m.get("content", ""))) for m in messages)
return sum(len(m.get("content", "")) // 4 for m in messages)Two production details worth calling out. First, we use gpt-4o-mini for summarization -- 16x cheaper than gpt-4o. At 10,000 conversations per day, compression costs roughly $20/day versus $320/day for naively sending full history to a frontier model. The compression pays for itself by reducing input tokens on every subsequent turn. Second, we keep recent turns verbatim. Summarizing the last message ("Customer asked about billing") when the customer is actively waiting for a billing answer would be jarring.
Layer 5: Tool Resolution
Tool definitions should be resolved per-request, not hardcoded. An agent with 30 tools wastes thousands of tokens on irrelevant schemas every call. This is where MCP (Model Context Protocol) transforms context engineering: instead of static tool lists, MCP servers advertise capabilities at runtime, and the context engine includes only what's relevant.
Anthropic's team warns: "If engineers cannot definitively choose which tool applies, agents cannot either."
TypeScript -- Dynamic Tool Resolution:
interface ToolDefinition {
name: string;
description: string;
inputSchema: Record<string, unknown>;
tokenCost: number; // Estimated tokens for this tool's schema
}
interface ToolSet {
id: string;
name: string;
tools: ToolDefinition[];
}
class ToolResolver {
private tokenBudget: number;
constructor(tokenBudget = 4000) {
this.tokenBudget = tokenBudget;
}
async resolveTools(
agentConfig: {
toolsetIds: string[];
toolIds: string[];
},
toolsetStore: { get(id: string): Promise<ToolSet> },
conversationContext?: string // Current conversation topic for relevance filtering
): Promise<ToolDefinition[]> {
const allTools: ToolDefinition[] = [];
// Toolsets group related tools (e.g., "Billing", "CRM") -- load only what's relevant
for (const tsId of agentConfig.toolsetIds) {
const toolset = await toolsetStore.get(tsId);
allTools.push(...toolset.tools);
}
// Fallback: if no toolsets configured, resolve individual tool IDs
if (agentConfig.toolsetIds.length === 0 && agentConfig.toolIds.length > 0) {
// Agent-scoped resolution -- implementation depends on your registry
}
// Hard budget enforcement: surplus tools get dropped, not squeezed in
let tokenCount = 0;
const selected: ToolDefinition[] = [];
for (const tool of allTools) {
if (tokenCount + tool.tokenCost > this.tokenBudget) {
console.warn(
`Tool budget exceeded. Included ${selected.length}/${allTools.length} tools.`
);
break;
}
selected.push(tool);
tokenCount += tool.tokenCost;
}
return selected;
}
}Python -- Dynamic Tool Resolution:
from dataclasses import dataclass
@dataclass
class ToolDefinition:
name: str
description: str
input_schema: dict
token_cost: int # Estimated tokens for schema
@dataclass
class ToolSet:
id: str
name: str
tools: list[ToolDefinition]
class ToolResolver:
def __init__(self, token_budget: int = 4000):
self.token_budget = token_budget
async def resolve_tools(
self,
toolset_ids: list[str],
tool_ids: list[str],
toolset_store, # has async get(id) -> ToolSet
) -> list[ToolDefinition]:
all_tools: list[ToolDefinition] = []
# Resolve from toolsets (MCP-style grouped tools)
for ts_id in toolset_ids:
toolset = await toolset_store.get(ts_id)
all_tools.extend(toolset.tools)
# Agent-scoped fallback
if not toolset_ids and tool_ids:
pass # Resolve individual tools from registry
# Budget enforcement
token_count = 0
selected: list[ToolDefinition] = []
for tool in all_tools:
if token_count + tool.token_cost > self.token_budget:
print(
f"Tool budget exceeded. "
f"Included {len(selected)}/{len(all_tools)} tools."
)
break
selected.append(tool)
token_count += tool.token_cost
return selectedRather than one giant bag of tools, agents organize tools into logical groups (a "Customer Management" toolset, a "Billing" toolset). The context engine creates separate MCP connections per toolset and merges the results. This is why the agent in our opening scenario "ignored half its tools" -- they were all dumped in with no relevance filtering, and the model's attention was spread too thin to use them effectively.
Assembling the Engine
Now we wire all five layers into one engine. This is the core of the system -- the thing that runs on every single inference call.
TypeScript -- Full Context Engine:
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
interface ContextEngineConfig {
model: Parameters<typeof openai>[0];
maxContextTokens: number;
budget: TokenBudget;
}
class ContextEngine {
private systemPromptBuilder: typeof buildSystemPrompt;
private memoryInjector: MemoryInjector;
private historyCompressor: HistoryCompressor;
private toolResolver: ToolResolver;
private config: ContextEngineConfig;
constructor(config: ContextEngineConfig) {
this.config = config;
this.memoryInjector = new MemoryInjector();
this.historyCompressor = new HistoryCompressor(config.budget.history);
this.toolResolver = new ToolResolver(config.budget.tools);
this.systemPromptBuilder = buildSystemPrompt;
}
async processMessage(
message: string,
context: {
customerId: string;
agentConfig: SystemPromptConfig;
toolsetIds: string[];
conversationHistory: Message[];
// Injected dependencies
memoryStore: any;
knowledgeRetriever: any;
toolsetStore: any;
}
) {
// 1. Parallel retrieval saves 200-300ms vs. sequential -- critical for voice agents
const [memories, knowledge, tools] = await Promise.all([
this.memoryInjector.injectMemories(
context.customerId,
message,
context.memoryStore
),
retrieveWithBudget(
message,
this.config.budget.knowledge,
context.knowledgeRetriever
),
this.toolResolver.resolveTools(
{ toolsetIds: context.toolsetIds, toolIds: [] },
context.toolsetStore
),
]);
// 2. Compress history if needed
const history = await this.historyCompressor.compress(
context.conversationHistory
);
// 3. Build system prompt with injected context
const systemPrompt = this.systemPromptBuilder(context.agentConfig, {
memories,
knowledge,
});
// 4. Assemble messages array
const messages: Message[] = [];
// Add compressed history summary if it exists
if (history.summary) {
messages.push({
role: "system",
content: `Previous conversation summary:\n${history.summary}`,
});
}
// Add recent messages
messages.push(...history.recentMessages);
// Add current message
messages.push({ role: "user", content: message });
// 5. maxSteps: 10 allows multi-turn tool use (call tool, read result, call another)
const result = await generateText({
model: openai(this.config.model),
system: systemPrompt,
messages,
tools: this.convertTools(tools),
maxSteps: 10,
});
return {
response: result.text,
toolCalls: result.steps.flatMap((s) => s.toolCalls),
tokenUsage: result.usage,
};
}
private convertTools(tools: ToolDefinition[]) {
// Convert to AI SDK tool format
const converted: Record<string, any> = {};
for (const tool of tools) {
converted[tool.name] = {
description: tool.description,
parameters: tool.inputSchema,
};
}
return converted;
}
}
// Usage
const engine = new ContextEngine({
model: "gpt-4o",
maxContextTokens: 128000,
budget: DEFAULT_BUDGET,
});Python -- Full Context Engine:
import asyncio
from openai import AsyncOpenAI
from dataclasses import dataclass
@dataclass
class ContextEngineConfig:
model: str = "gpt-4o"
max_context_tokens: int = 128000
budget: dict = None
def __post_init__(self):
if self.budget is None:
self.budget = DEFAULT_BUDGET
class ContextEngine:
def __init__(self, config: ContextEngineConfig):
self.config = config
self.client = AsyncOpenAI()
self.memory_injector = MemoryInjector()
self.history_compressor = HistoryCompressor(
max_tokens=config.budget["history"]
)
self.tool_resolver = ToolResolver(
token_budget=config.budget["tools"]
)
async def process_message(
self,
message: str,
customer_id: str,
agent_config: SystemPromptConfig,
toolset_ids: list[str],
conversation_history: list[dict],
memory_store,
knowledge_retriever,
toolset_store,
) -> dict:
# 1. Parallel retrieval saves 200-300ms vs. sequential -- critical for voice agents
memories, knowledge, tools = await asyncio.gather(
self.memory_injector.inject_memories(
customer_id, message, memory_store
),
retrieve_with_budget(
message, self.config.budget["knowledge"], knowledge_retriever
),
self.tool_resolver.resolve_tools(
toolset_ids, [], toolset_store
),
)
# 2. Compress history
history = await self.history_compressor.compress(
conversation_history
)
# 3. Build system prompt
system_prompt = build_system_prompt(
agent_config, memories=memories, knowledge=knowledge
)
# 4. Assemble messages
messages = []
if history.summary:
messages.append({
"role": "system",
"content": f"Previous conversation summary:\n{history.summary}",
})
messages.extend(history.recent_messages)
messages.append({"role": "user", "content": message})
# 5. Call LLM
tool_defs = [
{
"type": "function",
"function": {
"name": t.name,
"description": t.description,
"parameters": t.input_schema,
},
}
for t in tools
]
response = await self.client.chat.completions.create(
model=self.config.model,
messages=[
{"role": "system", "content": system_prompt},
*messages,
],
tools=tool_defs if tool_defs else None,
)
return {
"response": response.choices[0].message.content,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
},
}The Promise.all / asyncio.gather pattern matters more than most teams expect. Memory search, knowledge retrieval, and tool resolution each take 50-150ms. Sequentially: 150-450ms before the LLM even starts. In parallel: 80-150ms total. On a voice agent, that 200-300ms savings is the difference between a natural response and an awkward pause.
Just-In-Time vs. Upfront Retrieval
The most important architecture decision in context engineering is when to load information: upfront (before the conversation starts) or just-in-time (when the agent needs it).
Anthropic describes the just-in-time pattern as one that "mirrors human cognition by using external organization systems." Instead of memorizing an entire codebase, you remember where to look. The agent maintains lightweight identifiers -- file paths, search queries, tool names -- and dynamically loads full content at runtime. In practice, production systems use a hybrid:
| Strategy | Load When | Tokens Used | Best For |
|---|---|---|---|
| Upfront | Before first message | Fixed per session | Agent config, system prompt, customer profile |
| Just-in-time | When needed during conversation | Variable per turn | Knowledge base, tool results, detailed records |
| Cached | First access, then reuse | Amortized | Frequently accessed documents, pricing tables |
| Compressed | After threshold exceeded | Decreases over time | Conversation history, earlier tool results |
Claude Code is a good example of this hybrid pattern. It loads CLAUDE.md files upfront (your project context), but uses glob and grep tools for just-in-time retrieval of specific code files. The model decides what to look up based on what the conversation needs.
For customer-facing agents, we typically load upfront: the agent configuration, system prompt template, and customer profile. Everything else -- knowledge base results, tool outputs, detailed account records -- is loaded just-in-time as the conversation evolves.
Sub-Agents for Complex Tasks
When a single context window isn't enough, sub-agents handle focused sub-tasks and return condensed summaries -- typically "1,000-2,000 tokens" each, per Anthropic. A research task might need 50 documents. A debugging task might need dozens of files. No single context window can hold all of that.
The pattern: an orchestrator dispatches sub-tasks. Each sub-agent gets a clean, focused context window. When finished, it returns a condensed summary. The orchestrator synthesizes the final result.
The key insight: each sub-agent has an optimized context window. The billing research agent doesn't need product documentation. The product lookup agent doesn't need account history. By splitting context, each agent performs better than a single agent would with everything crammed in.
The economics are counterintuitive. Three sub-agent calls at 10K tokens each cost the same as one call at 30K tokens -- but the sub-agents produce better results because each operates in the high-accuracy zone of context utilization. You pay the same and get better output.
Token Budgeting
Without explicit budgets, a long conversation history silently consumes your entire context window, leaving no room for the knowledge the agent actually needs. Here's a practical framework for a 128K token window:
| Layer | Budget | Tokens | Notes |
|---|---|---|---|
| System prompt | 2% | 2,560 | Role, constraints, escalation rules |
| Injected memories | 1% | 1,280 | 10-15 memories max |
| Retrieved knowledge | 3% | 3,840 | 3-5 RAG chunks |
| Tool definitions | 4% | 5,120 | 15-20 tools with schemas |
| Conversation history | 15% | 19,200 | Compressed if needed |
| Current turn | 5% | 6,400 | User message + immediate context |
| Reserved for output | 10% | 12,800 | Response generation |
| Safety margin | 60% | 76,800 | Breathing room for attention quality |
That 60% safety margin looks wasteful. It's not. Anthropic notes that LLMs have "an attention budget that they draw on when parsing large volumes of context." Using 40% of a 128K window gives better results than using 90%. And the cost angle seals it: at GPT-4o's pricing ($2.50/1M input tokens), sending 115K tokens per call instead of 50K costs an extra $480/month at 100K daily requests -- for worse quality. You're paying more for degraded performance.
The goal, as Anthropic puts it, is to "find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome."
Monitoring
You can't improve what you can't measure. LangChain's 2026 survey found that 89% of organizations with production agents have implemented some form of observability, and 62% have detailed tracing down to individual tool calls. The teams that skip this step are the ones stuck debugging context pipelines blind.
Track token usage per layer, retrieval relevance scores, compression ratios, and which tools actually get called. You need observability into what's actually happening inside that context window:
What to track:
| Metric | Why It Matters | Target |
|---|---|---|
| Tokens per layer | Identify budget violations | Within 10% of budget |
| RAG relevance scores | Are you retrieving useful knowledge? | Mean score > 0.75 |
| Memory hit rate | Are injected memories being used? | > 60% utilization |
| Compression ratio | How much history are you losing? | 5:1 to 10:1 |
| Tool usage rate | Are defined tools actually called? | > 30% per session |
| Context window utilization | How full is the window? | 20-50% for best quality |
If your tool usage rate is below 10%, you're wasting tokens on tool definitions the model never uses. If your RAG relevance scores are below 0.5, your retrieval is injecting noise. If your context window utilization is consistently above 70%, you're likely seeing context rot -- the model starts forgetting earlier parts of the conversation.
Common Pitfalls
The most common mistake is treating the context window like a database -- stuffing everything in and hoping the model finds what it needs.
Kitchen sink system prompt. You write a 6,000-token system prompt covering every edge case. The model ignores half of it. Fix: start under 1,000 tokens. Add rules only when you see the model fail without them.
Retrieving everything, filtering nothing. Your RAG pipeline returns 10 chunks per query, half marginally relevant. Fix: set a minimum relevance threshold (0.7 for knowledge, 0.3 for memories). Three highly relevant chunks beat ten mediocre ones.
Ignoring history growth. The conversation is 100 turns deep, 40,000 tokens of "Let me check on that for you." Fix: compress when history exceeds 8,000 tokens. Keep the last 10 turns verbatim, summarize the rest.
Static tool lists. All 30 tool schemas included on every request -- 4,000+ tokens -- even for a simple FAQ. Fix: resolve tools per-request. Use toolsets to load only relevant groups.
No budget enforcement. Each layer decides independently how much to include. On a bad day, the window overflows. Fix: hard token limits per layer. The context engine is the budget authority.
What's Next
Remember the agent we opened with -- the one that hallucinates, forgets the customer, and ignores its tools? Every one of those failures traces back to a context problem. The hallucination happens because no knowledge was retrieved. The forgetting happens because history wasn't compressed and memories weren't injected. The ignored tools happen because 30 schemas were dumped in with no budget enforcement.
The fix was never a better prompt. It was a better pipeline.
Start simple: a structured system prompt, a single RAG source, basic history management. Add memory injection when you see the agent forgetting customer context. Add compression when conversations get long enough to hurt quality. Add sub-agents when tasks outgrow a single context window.
As models get larger context windows -- Claude Opus 4.6 now offers 1 million tokens -- the temptation will be to skip all of this. Don't. A well-engineered 50K-token context will outperform a carelessly assembled 500K-token context every time. The discipline of curating what goes in, and what stays out, is what separates agents that work in demos from agents that work in production.
Build agents with context engineering built in
Chanl handles memory injection, tool resolution, and history compression at runtime so you can focus on what your agent does, not how it remembers.
Start buildingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



