We got our first real production bill: $13,247. For one agent.
It was a customer-service agent on Claude Sonnet. Nothing exotic. A 4,000-token system prompt, 8 tools for order lookups and CRM updates, and about 500 conversations a day. Each conversation averaged 6 turns with full history replay. The math was brutal, but only once we actually did it.
That bill forced us to learn every optimization technique available. Over three months, we cut that same agent's cost to $1,100/month. Same quality. Same tools. Same conversation depth. This article is the playbook.
In this article:
- The $13K Bill, Decomposed -- where the money actually goes
- Prompt Caching: 90% Input Savings -- the single biggest lever
- Model Routing: Right Model, Right Task -- 30-50% off the remaining cost
- Batch Processing: Half-Price Background Work -- 50% off non-real-time tasks
- Plan-and-Execute: Stop Re-Planning -- caching decisions, not just prompts
- Context Management: Smaller Windows -- pay for what matters
- The Full Stack: $13K to $1,100 -- all techniques combined
- Implementation Checklist -- week-by-week rollout
The $13K Bill, Decomposed
Before you optimize anything, you need to know where your money goes. Here's what our 500-conversation/day agent actually consumed:
| Component | Tokens per Conversation | Monthly Volume (15K convos) | Notes |
|---|---|---|---|
| System prompt | 4,000 input | 60M input | Resent every turn |
| Conversation history | 3,000 input (avg) | 45M input | Grows each turn |
| Tool schemas (8 tools) | 2,400 input | 36M input | Resent every turn |
| Tool call results | 1,200 input | 18M input | CRM/order data |
| Agent responses | 1,500 output | 22.5M output | The actual replies |
| Total | ~12,100 | 159M input + 22.5M output |
On Claude Sonnet at $3/MTok input and $15/MTok output:
Input: 159M tokens x $3.00/MTok = $477
Output: 22.5M tokens x $15.00/MTok = $337.50Wait. That's only $815. Where's the $13K?
Here's the part nobody warns you about: each conversation has multiple turns. Our 6-turn average means the system prompt, tool schemas, and growing history are resent on every single turn. The real math:
// Each turn resends: system prompt + tools + full history
// Turn 1: 4,000 + 2,400 + 0 history = 6,400 input
// Turn 2: 4,000 + 2,400 + 2,500 history = 8,900 input
// Turn 3: 4,000 + 2,400 + 5,000 history = 11,400 input
// Turn 4: 4,000 + 2,400 + 7,500 history = 13,900 input
// Turn 5: 4,000 + 2,400 + 10,000 history = 16,400 input
// Turn 6: 4,000 + 2,400 + 12,500 history = 18,900 input
// Total per conversation: ~75,900 input tokens + ~9,000 outputNow the real bill:
Monthly input: 75,900 x 15,000 convos = 1,138.5M tokens
Monthly output: 9,000 x 15,000 convos = 135M tokens
Input cost: 1,138.5M x $3.00/MTok = $3,415.50
Output cost: 135M x $15.00/MTok = $2,025.00
Total: $5,440.50/monthAnd if your agent uses multi-step tool calls (ours did, averaging 2.4 tool-use rounds per conversation), add another $4,000-$7,800 in additional LLM calls for tool reasoning. Our actual bill: $13,247.
The system prompt and tool schemas alone -- content that never changes between conversations -- accounted for over 40% of input tokens. That's where we started cutting.
Prompt Caching: 90% Input Savings
Prompt caching is the single most impactful optimization for production agents. It eliminates re-processing of static content you send on every request.
How it works
Your system prompt, tool schemas, and few-shot examples are cached after the first request. Subsequent requests read from cache at a fraction of the normal input price.
| Provider | Cache Write Cost | Cache Read Cost | Savings on Reads | Cache Duration |
|---|---|---|---|---|
| Anthropic | 1.25x input price | 0.1x input price | 90% | 5 minutes |
| Anthropic (extended) | 2x input price | 0.1x input price | 90% | 1 hour |
| OpenAI (GPT-4.1) | 1x (automatic) | 0.25x input price | 75% | Automatic |
| OpenAI (GPT-5) | 1x (automatic) | 0.1x input price | 90% | Automatic |
| Google (Gemini) | 1x (automatic) | 0.25x input price | 75% | Automatic |
For our agent, the static content per turn was 6,400 tokens (4,000 system prompt + 2,400 tool schemas). At 6 turns per conversation, that's 38,400 static tokens per conversation resent and reprocessed.
The math
Before caching (Claude Sonnet):
Static tokens: 38,400/convo x 15,000 convos = 576M tokens/month
Cost: 576M x $3.00/MTok = $1,728/month (just for static content)After caching (Anthropic 5-minute cache):
// First request per conversation: cache write (1.25x)
Write cost: 6,400 tokens x 15,000 x $3.75/MTok = $360/month
// Remaining 5 turns per conversation: cache read (0.1x)
Read cost: 6,400 x 5 turns x 15,000 x $0.30/MTok = $144/month
Total: $504/month (vs $1,728 -- saving $1,224/month)That's a 71% reduction on static content. If you use Anthropic's 1-hour cache and your agent handles enough volume to keep the cache warm, cache reads drop to $0.30/MTok across nearly all requests.
Implementation
With Anthropic, add a single cache_control field:
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt, // 4,000 tokens of instructions
// Cache this block -- reads are 90% cheaper
cache_control: { type: "ephemeral" }
}
],
tools: tools, // 2,400 tokens of schemas (auto-cached with system)
messages: conversationHistory
});With OpenAI, caching is automatic for prompts over 1,024 tokens. No code changes needed -- the discount appears on your bill.
Running savings: $13,247 - $1,224 = $12,023 remaining.
Model Routing: Right Model, Right Task
Conventional wisdom says you need the best model for customer-facing conversations. The data says 60-65% of real support traffic is simple enough for a model that costs 30x less. "What are your hours?" and "I need to dispute a charge on my account from a merchant I don't recognize, the transaction posted twice and I've already contacted them but got nowhere" are vastly different tasks.
The pricing gap is enormous
| Model | Input ($/MTok) | Output ($/MTok) | Good For |
|---|---|---|---|
| GPT-4.1 Nano | $0.10 | $0.40 | Classification, FAQ, routing |
| GPT-4o-mini | $0.15 | $0.60 | Simple Q&A, extraction |
| Claude Haiku 3.5 | $0.80 | $4.00 | Standard support, summaries |
| Gemini 2.5 Flash | $0.30 | $2.50 | Mid-complexity reasoning |
| GPT-4.1 | $2.00 | $8.00 | Complex multi-step tasks |
| Claude Sonnet 4 | $3.00 | $15.00 | Nuanced reasoning, writing |
| GPT-4o | $2.50 | $10.00 | General flagship tasks |
The gap between GPT-4.1 Nano ($0.10/MTok) and Claude Sonnet ($3.00/MTok) is 30x on input. Even routing 50% of requests to a cheaper model creates massive savings.
A practical routing strategy
We classified our conversations into three tiers:
// Tier 1 -- Simple (60-65% of traffic)
// Greetings, FAQs, order status, store hours
// Route to: GPT-4.1 Nano ($0.10/$0.40 per MTok)
const SIMPLE_INTENTS = [
"greeting", "hours", "order_status",
"return_policy", "faq"
];
// Tier 2 -- Standard (25-30% of traffic)
// Account changes, standard complaints, multi-step lookups
// Route to: Claude Haiku 3.5 ($0.80/$4.00 per MTok)
const STANDARD_INTENTS = [
"account_update", "complaint", "billing_question",
"product_comparison"
];
// Tier 3 -- Complex (10-15% of traffic)
// Disputes, escalations, multi-tool reasoning chains
// Route to: Claude Sonnet ($3.00/$15.00 per MTok)
const COMPLEX_INTENTS = [
"dispute", "escalation", "multi_issue",
"policy_exception"
];The classifier itself runs on GPT-4.1 Nano. At $0.10/MTok input, classifying every inbound message costs about $4.50/month. The first request in each conversation gets classified, then the tier sticks for the session (with automatic upgrade if the conversation complexity increases).
The math
Before routing (everything on Claude Sonnet):
All traffic: $5,440/month (base token cost, pre-tool-use)After routing (60% Nano, 25% Haiku, 15% Sonnet):
// Simplified: using average tokens per conversation tier
Tier 1 (9,000 convos): 45,000 tokens avg x $0.10/$0.40
Input: 405M x $0.10/MTok = $40.50
Output: 81M x $0.40/MTok = $32.40
Tier 2 (3,750 convos): 70,000 tokens avg x $0.80/$4.00
Input: 262.5M x $0.80/MTok = $210.00
Output: 56.25M x $4.00/MTok = $225.00
Tier 3 (2,250 convos): 90,000 tokens avg x $3.00/$15.00
Input: 202.5M x $3.00/MTok = $607.50
Output: 33.75M x $15.00/MTok = $506.25
Classifier: $4.50/month
Total: $1,626.15/month (vs $5,440 -- saving ~$3,814)In practice, we saw a 42% reduction in total token spend after implementing routing, because the conversation-level token counts also dropped (simpler conversations on cheaper models were shorter).
Running savings: $12,023 - $3,814 = $8,209 remaining.
Batch Processing: Half-Price Background Work
Every major provider offers a Batch API with 50% discounts on both input and output tokens. The trade-off: results arrive asynchronously, typically within 24 hours.
This doesn't work for real-time conversations. But production agents have a surprising amount of background work:
| Background Task | Token Volume | Frequency | Batch-Eligible? |
|---|---|---|---|
| Conversation summarization | ~2,000/convo | Nightly | Yes |
| Quality scoring (LLM-as-judge) | ~3,500/convo | Nightly | Yes |
| Knowledge base refresh | ~50,000/batch | Weekly | Yes |
| Memory extraction & facts | ~1,500/convo | Post-call | Yes |
| Analytics narrative generation | ~4,000/report | Daily | Yes |
| Prompt regression testing | ~8,000/test | Per deploy | Yes |
For our agent, background tasks consumed roughly 25% of total monthly spend -- about $3,300/month. All of it was batch-eligible.
The math
Before batching:
Background task spend: $3,300/monthAfter batching (50% discount):
Background task spend: $1,650/month
Savings: $1,650/monthWith Anthropic's Batch API, you can also combine batch processing with prompt caching. A cached batch request on Claude Sonnet costs $1.50/MTok input and $7.50/MTok output -- 75% off the standard input rate when you combine both discounts.
// Batch quality scoring -- runs nightly, no rush
const batch = await anthropic.beta.messages.batches.create({
requests: conversations.map(convo => ({
custom_id: convo.id,
params: {
model: "claude-sonnet-4-20250514",
max_tokens: 500,
// Cached system prompt + scoring rubric
system: [{ type: "text", text: scoringPrompt, cache_control: { type: "ephemeral" } }],
messages: [{ role: "user", content: convo.transcript }]
}
}))
});
// Results arrive within 24 hours at 50% offRunning savings: $8,209 - $1,650 = $6,559 remaining.
Plan-and-Execute: Stop Re-Planning
Remember that 6-turn conversation from the $13K bill breakdown? Here's an insight that saved us the most after caching: agents re-plan the same workflows constantly. "Cancel my order" triggers the same reasoning chain every time -- look up order, check cancellation window, process refund. But a ReAct agent burns tokens re-deriving those steps from scratch on each call.
Plan-and-Execute separates planning from execution:
- A lightweight model creates a step-by-step plan
- Each step executes independently (often with a smaller model)
- Completed plans are cached in a memory store
- Matching requests retrieve the cached plan instead of regenerating
The architecture
async function handleRequest(message: string, context: AgentContext) {
// Step 1: Classify intent (GPT-4.1 Nano, ~$0.0001)
const intent = await classifyIntent(message);
// Step 2: Check plan cache (free -- it's a DB lookup)
const cachedPlan = await planCache.find(intent, context.parameters);
if (cachedPlan) {
// Cache hit: skip planning entirely, execute steps directly
// Saves 2,000-5,000 tokens of planning per request
return await executePlan(cachedPlan, context);
}
// Cache miss: generate plan with mid-tier model
const plan = await generatePlan(intent, message, context);
// Store for future reuse
await planCache.store(intent, context.parameters, plan);
return await executePlan(plan, context);
}Why it works
We analyzed 30 days of conversations and found that 58% of requests matched one of 23 distinct plan templates. Order status checks, cancellations, address updates, billing inquiries -- they follow the same steps with different parameters.
For those 58% of requests, we eliminated the planning LLM call entirely. For the remaining 42%, we still generate fresh plans, but with a smaller model (Haiku instead of Sonnet) since the plan structure is simpler than the full agent reasoning.
The math
Before Plan-and-Execute:
// Planning cost: ~3,000 tokens per request on Sonnet
Planning: 3,000 tokens x 6 turns x 15,000 convos x $3.00/MTok
= 270M tokens x $3.00/MTok = $810/month
(Plus output tokens for plans: ~$400/month)
Total planning cost: ~$1,210/monthAfter Plan-and-Execute:
// 58% cache hits: $0 planning cost
// 42% cache misses: plan on Haiku instead of Sonnet
Miss planning: 3,000 x 6 x 6,300 convos x $0.80/MTok = $90.72
Miss output: ~$50/month
Intent classifier: already counted in routing
Total planning cost: ~$141/month (saving ~$1,069)The savings compound with model routing -- cached plans can execute their individual steps on the cheapest capable model per step.
Running savings: $6,559 - $1,069 = $5,490 remaining.
Context Management: Smaller Windows
The most overlooked cost driver is conversation history. By turn 6, you're sending 12,500 tokens of history on every request. Most of it is low-value for the current turn.
Three techniques that work:
1. Sliding window with summary
Instead of sending full history, keep the last 3 turns verbatim and summarize earlier turns into a compressed context block:
function buildContext(history: Message[]): Message[] {
if (history.length <= 6) return history; // 3 turns = 6 messages
// Summarize older turns into ~200 tokens (vs ~2,500 raw)
const oldTurns = history.slice(0, -6);
const summary = await summarize(oldTurns); // Run on Nano, costs ~$0.00005
return [
{ role: "system", content: `Previous context: ${summary}` },
...history.slice(-6) // Keep last 3 turns verbatim
];
}This typically reduces history tokens by 40-60% on longer conversations.
2. Tool schema pruning
Don't send all 8 tool schemas on every turn. After intent classification, send only the 2-3 tools relevant to the current task:
// Before: 2,400 tokens of tool schemas on every turn
const allTools = [orderLookup, orderCancel, orderModify, crmUpdate,
billingCheck, refundProcess, escalate, faqSearch];
// After: ~800 tokens -- only what this intent needs
const relevantTools = selectToolsForIntent(intent, allTools);
// "order_status" -> [orderLookup, orderModify]
// "billing_question" -> [billingCheck, refundProcess]This cuts tool schema tokens by 60-70% per turn.
3. Structured extraction over raw replay
Instead of replaying raw tool results in conversation history, extract structured data once and reference it compactly:
// Before: 800 tokens of raw JSON from order lookup
// { "order": { "id": "ORD-9284", "items": [...], "shipping": {...}, ... } }
// After: 120 tokens of extracted facts
// Order ORD-9284: 2 items, shipped 3/18, arriving 3/21, tracking UPS-1Z999Combined context savings
Before context optimization:
Average input per turn 6: 18,900 tokens
After (sliding window + tool pruning + structured extraction):
Average input per turn 6: 9,200 tokens (~51% reduction)
Monthly savings at blended model rate (~$1.20/MTok avg after routing):
Reduction: ~580M fewer tokens/month
Savings: ~$696/monthRunning savings: $5,490 - $696 = $4,794 remaining.
The Full Stack: $13K to $1,100
Here's what happens when you layer all five techniques:
| Technique | Monthly Savings | Cumulative Cost | Reduction |
|---|---|---|---|
| Baseline (unoptimized) | -- | $13,247 | -- |
| + Prompt caching | -$1,224 | $12,023 | 9% |
| + Model routing | -$3,814 | $8,209 | 38% |
| + Batch processing | -$1,650 | $6,559 | 51% |
| + Plan-and-Execute | -$1,069 | $5,490 | 59% |
| + Context management | -$696 | $4,794 | 64% |
| + All techniques compounding | -$3,694* | ~$1,100 | 92% |
*The compounding effect is real: routing means cheaper models for caching, caching means fewer tokens for routing decisions, context management means smaller payloads everywhere, and Plan-and-Execute skips entire LLM calls. The techniques multiply rather than simply add.
The actual production cost after full optimization: $1,100/month for the same 500-conversation/day, 8-tool agent with the same conversation quality.
What we monitored to prove quality held
Conventional wisdom says cost cutting means quality cutting. Our data showed the opposite. We tracked these metrics through every change using scorecards and analytics:
- Resolution rate: held at 84% (pre-optimization: 83%)
- Customer satisfaction: 4.2/5.0 (pre: 4.1/5.0 -- routing actually improved simple cases)
- Escalation rate: dropped from 16% to 14% (Plan-and-Execute was more consistent)
- Average handle time: 2.1 minutes (pre: 2.3 minutes -- cached plans executed faster)
Quality monitoring isn't optional -- it's what makes optimization safe. Without real-time analytics on resolution rates and satisfaction scores, you're flying blind.
Implementation Checklist
Don't try everything at once. Here's the order that maximizes impact with minimal risk:
Week 1: Prompt Caching (highest ROI, lowest risk)
- Enable
cache_controlon system prompts (Anthropic) or verify automatic caching is active (OpenAI) - Cache tool schemas alongside system prompt
- Monitor cache hit rates in your analytics dashboard -- target 85%+ after warmup
- Expected savings: 20-30% of input costs
Week 2: Context Management (no model changes)
- Implement sliding window summarization for conversations over 4 turns
- Prune tool schemas per intent (requires intent classification)
- Switch raw tool results to structured extraction
- Expected savings: additional 10-15%
Week 3: Model Routing (requires testing)
- Build intent classifier on cheapest model (GPT-4.1 Nano or Gemini Flash Lite)
- Define tier boundaries using historical conversation analysis
- Run shadow routing for 1 week: route silently, compare quality scores between tiers
- Deploy with automatic upgrade triggers (complexity score threshold)
- Expected savings: additional 30-40%
Week 4: Batch Processing + Plan-and-Execute
- Move summarization, quality scoring, and analytics to Batch API
- Implement plan cache with top 20 intent templates
- Set cache TTL based on how often your workflows change
- Expected savings: additional 15-25%
What Comes Next
Token prices drop every quarter. GPT-4o costs 92% less than GPT-4 did at launch. Claude Sonnet 4.5 matches the quality of earlier Opus models at one-fifth the price. The optimization techniques in this article will keep working as prices fall -- they're multiplicative, not additive.
The real shift is architectural. Teams that build tool-equipped agents with routing, caching, and plan reuse from day one never see a $13K bill. They start at $1,100 and scale to 5,000 conversations/day for $8,000 instead of $130K.
The techniques aren't theoretical. They're table stakes for anyone running agents in production. Start with caching this week. Route by next week. Your CFO will notice.
Monitor Your Agent's Token Economics
Chanl tracks cost per conversation, token breakdown by component, and quality scores alongside spend -- so you can optimize without guessing.
See the analyticsCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



