How much does a production AI agent cost per month?

A production AI agent handling 500 conversations per day on a flagship model like GPT-4o or Claude Sonnet typically costs $3,200-$13,000/month in token spend alone. The exact cost depends on conversation length, tool usage, system prompt size, and whether you've implemented caching. Unoptimized agents with long system prompts and multi-step tool use sit at the high end.

What is prompt caching and how much does it save?

Prompt caching stores your system prompt and static context so subsequent requests read from cache instead of reprocessing. Anthropic charges cache reads at 10% of input price (90% savings). OpenAI offers 50-90% cached input discounts depending on model family. For an agent with a 4,000-token system prompt handling 500 daily conversations, caching saves $2,700-$5,400/month.

What is model routing for AI agents?

Model routing sends each request to the cheapest model capable of handling it. Simple intents like greetings or FAQ lookups go to lightweight models (GPT-4.1 Nano at $0.10/MTok input), while complex reasoning stays on flagship models ($2-5/MTok input). Most production agents find that 60-70% of requests are simple enough for a small model, saving 30-50% overall.

Does the Batch API work for real-time AI agents?

Not for real-time conversations. The Batch API processes requests asynchronously over 24 hours at a 50% discount. It works for background agent tasks: nightly conversation summarization, bulk quality scoring, knowledge base updates, and analytics generation. These background workloads often represent 20-40% of total token spend.

What is Plan-and-Execute and how does it reduce costs?

Plan-and-Execute separates planning from execution. A lightweight model creates a step-by-step plan, then each step runs independently. If the same plan was executed before, it's retrieved from a memory cache instead of regenerated. This avoids re-planning for the 40-60% of requests that follow known patterns, cutting planning costs by up to 90%.

How do I calculate my AI agent's token cost per conversation?

Multiply your system prompt tokens plus average conversation tokens by your model's input price per million tokens, then add output tokens times the output price. A typical conversation: (4,000 system + 3,000 history) input tokens plus 1,500 output tokens. On GPT-4o, that's (7,000 x $2.50 / 1M) + (1,500 x $10 / 1M) = $0.0175 + $0.015 = $0.0325 per conversation. At 500/day, that's $487/month just for the base calls.

Your AI Agent Costs $13K/Month. Here's the Fix.

We got our first real production bill: $13,247. For one agent.

It was a customer-service agent on Claude Sonnet. Nothing exotic. A 4,000-token system prompt, 8 tools for order lookups and CRM updates, and about 500 conversations a day. Each conversation averaged 6 turns with full history replay. The math was brutal, but only once we actually did it.

That bill forced us to learn every optimization technique available. Over three months, we cut that same agent's cost to $1,100/month. Same quality. Same tools. Same conversation depth. This article is the playbook.

In this article:

The $13K Bill, Decomposed -- where the money actually goes
Prompt Caching: 90% Input Savings -- the single biggest lever
Model Routing: Right Model, Right Task -- 30-50% off the remaining cost
Batch Processing: Half-Price Background Work -- 50% off non-real-time tasks
Plan-and-Execute: Stop Re-Planning -- caching decisions, not just prompts
Context Management: Smaller Windows -- pay for what matters
The Full Stack: $13K to $1,100 -- all techniques combined
Implementation Checklist -- week-by-week rollout

The $13K Bill, Decomposed

Before you optimize anything, you need to know where your money goes. Here's what our 500-conversation/day agent actually consumed:

Component	Tokens per Conversation	Monthly Volume (15K convos)	Notes
System prompt	4,000 input	60M input	Resent every turn
Conversation history	3,000 input (avg)	45M input	Grows each turn
Tool schemas (8 tools)	2,400 input	36M input	Resent every turn
Tool call results	1,200 input	18M input	CRM/order data
Agent responses	1,500 output	22.5M output	The actual replies
Total	~12,100	159M input + 22.5M output

On Claude Sonnet at $3/MTok input and $15/MTok output:

text

Input:  159M tokens x $3.00/MTok  = $477
Output: 22.5M tokens x $15.00/MTok = $337.50

Wait. That's only $815. Where's the $13K?

Here's the part nobody warns you about: each conversation has multiple turns. Our 6-turn average means the system prompt, tool schemas, and growing history are resent on every single turn. The real math:

text

// Each turn resends: system prompt + tools + full history
// Turn 1: 4,000 + 2,400 + 0 history     = 6,400 input
// Turn 2: 4,000 + 2,400 + 2,500 history  = 8,900 input
// Turn 3: 4,000 + 2,400 + 5,000 history  = 11,400 input
// Turn 4: 4,000 + 2,400 + 7,500 history  = 13,900 input
// Turn 5: 4,000 + 2,400 + 10,000 history = 16,400 input
// Turn 6: 4,000 + 2,400 + 12,500 history = 18,900 input
// Total per conversation: ~75,900 input tokens + ~9,000 output

Now the real bill:

text

Monthly input:  75,900 x 15,000 convos = 1,138.5M tokens
Monthly output: 9,000 x 15,000 convos  = 135M tokens
 
Input cost:  1,138.5M x $3.00/MTok  = $3,415.50
Output cost: 135M x $15.00/MTok     = $2,025.00
Total: $5,440.50/month

And if your agent uses multi-step tool calls (ours did, averaging 2.4 tool-use rounds per conversation), add another $4,000-$7,800 in additional LLM calls for tool reasoning. Our actual bill: $13,247.

The system prompt and tool schemas alone -- content that never changes between conversations -- accounted for over 40% of input tokens. That's where we started cutting.

Prompt Caching: 90% Input Savings

Prompt caching is the single most impactful optimization for production agents. It eliminates re-processing of static content you send on every request.

How it works

Your system prompt, tool schemas, and few-shot examples are cached after the first request. Subsequent requests read from cache at a fraction of the normal input price.

Provider	Cache Write Cost	Cache Read Cost	Savings on Reads	Cache Duration
Anthropic	1.25x input price	0.1x input price	90%	5 minutes
Anthropic (extended)	2x input price	0.1x input price	90%	1 hour
OpenAI (GPT-4.1)	1x (automatic)	0.25x input price	75%	Automatic
OpenAI (GPT-5)	1x (automatic)	0.1x input price	90%	Automatic
Google (Gemini)	1x (automatic)	0.25x input price	75%	Automatic

For our agent, the static content per turn was 6,400 tokens (4,000 system prompt + 2,400 tool schemas). At 6 turns per conversation, that's 38,400 static tokens per conversation resent and reprocessed.

The math

Before caching (Claude Sonnet):

text

Static tokens: 38,400/convo x 15,000 convos = 576M tokens/month
Cost: 576M x $3.00/MTok = $1,728/month (just for static content)

After caching (Anthropic 5-minute cache):

text

// First request per conversation: cache write (1.25x)
Write cost: 6,400 tokens x 15,000 x $3.75/MTok = $360/month
 
// Remaining 5 turns per conversation: cache read (0.1x)
Read cost: 6,400 x 5 turns x 15,000 x $0.30/MTok = $144/month
 
Total: $504/month (vs $1,728 -- saving $1,224/month)

That's a 71% reduction on static content. If you use Anthropic's 1-hour cache and your agent handles enough volume to keep the cache warm, cache reads drop to $0.30/MTok across nearly all requests.

Implementation

With Anthropic, add a single cache_control field:

typescript

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: systemPrompt, // 4,000 tokens of instructions
      // Cache this block -- reads are 90% cheaper
      cache_control: { type: "ephemeral" }
    }
  ],
  tools: tools, // 2,400 tokens of schemas (auto-cached with system)
  messages: conversationHistory
});

With OpenAI, caching is automatic for prompts over 1,024 tokens. No code changes needed -- the discount appears on your bill.

Running savings: $13,247 - $1,224 = $12,023 remaining.

Model Routing: Right Model, Right Task

Conventional wisdom says you need the best model for customer-facing conversations. The data says 60-65% of real support traffic is simple enough for a model that costs 30x less. "What are your hours?" and "I need to dispute a charge on my account from a merchant I don't recognize, the transaction posted twice and I've already contacted them but got nowhere" are vastly different tasks.

The pricing gap is enormous

Model	Input ($/MTok)	Output ($/MTok)	Good For
GPT-4.1 Nano	$0.10	$0.40	Classification, FAQ, routing
GPT-4o-mini	$0.15	$0.60	Simple Q&A, extraction
Claude Haiku 3.5	$0.80	$4.00	Standard support, summaries
Gemini 2.5 Flash	$0.30	$2.50	Mid-complexity reasoning
GPT-4.1	$2.00	$8.00	Complex multi-step tasks
Claude Sonnet 4	$3.00	$15.00	Nuanced reasoning, writing
GPT-4o	$2.50	$10.00	General flagship tasks

The gap between GPT-4.1 Nano ($0.10/MTok) and Claude Sonnet ($3.00/MTok) is 30x on input. Even routing 50% of requests to a cheaper model creates massive savings.

A practical routing strategy

We classified our conversations into three tiers:

typescript

// Tier 1 -- Simple (60-65% of traffic)
// Greetings, FAQs, order status, store hours
// Route to: GPT-4.1 Nano ($0.10/$0.40 per MTok)
const SIMPLE_INTENTS = [
  "greeting", "hours", "order_status",
  "return_policy", "faq"
];
 
// Tier 2 -- Standard (25-30% of traffic)
// Account changes, standard complaints, multi-step lookups
// Route to: Claude Haiku 3.5 ($0.80/$4.00 per MTok)
const STANDARD_INTENTS = [
  "account_update", "complaint", "billing_question",
  "product_comparison"
];
 
// Tier 3 -- Complex (10-15% of traffic)
// Disputes, escalations, multi-tool reasoning chains
// Route to: Claude Sonnet ($3.00/$15.00 per MTok)
const COMPLEX_INTENTS = [
  "dispute", "escalation", "multi_issue",
  "policy_exception"
];

The classifier itself runs on GPT-4.1 Nano. At $0.10/MTok input, classifying every inbound message costs about $4.50/month. The first request in each conversation gets classified, then the tier sticks for the session (with automatic upgrade if the conversation complexity increases).

The math

Before routing (everything on Claude Sonnet):

text

All traffic: $5,440/month (base token cost, pre-tool-use)

After routing (60% Nano, 25% Haiku, 15% Sonnet):

text

// Simplified: using average tokens per conversation tier
Tier 1 (9,000 convos): 45,000 tokens avg x $0.10/$0.40
  Input: 405M x $0.10/MTok  = $40.50
  Output: 81M x $0.40/MTok  = $32.40
 
Tier 2 (3,750 convos): 70,000 tokens avg x $0.80/$4.00
  Input: 262.5M x $0.80/MTok = $210.00
  Output: 56.25M x $4.00/MTok = $225.00
 
Tier 3 (2,250 convos): 90,000 tokens avg x $3.00/$15.00
  Input: 202.5M x $3.00/MTok = $607.50
  Output: 33.75M x $15.00/MTok = $506.25
 
Classifier: $4.50/month
Total: $1,626.15/month (vs $5,440 -- saving ~$3,814)

In practice, we saw a 42% reduction in total token spend after implementing routing, because the conversation-level token counts also dropped (simpler conversations on cheaper models were shorter).

Running savings: $12,023 - $3,814 = $8,209 remaining.

Batch Processing: Half-Price Background Work

Every major provider offers a Batch API with 50% discounts on both input and output tokens. The trade-off: results arrive asynchronously, typically within 24 hours.

This doesn't work for real-time conversations. But production agents have a surprising amount of background work:

Background Task	Token Volume	Frequency	Batch-Eligible?
Conversation summarization	~2,000/convo	Nightly	Yes
Quality scoring (LLM-as-judge)	~3,500/convo	Nightly	Yes
Knowledge base refresh	~50,000/batch	Weekly	Yes
Memory extraction & facts	~1,500/convo	Post-call	Yes
Analytics narrative generation	~4,000/report	Daily	Yes
Prompt regression testing	~8,000/test	Per deploy	Yes

For our agent, background tasks consumed roughly 25% of total monthly spend -- about $3,300/month. All of it was batch-eligible.

The math

Before batching:

text

Background task spend: $3,300/month

After batching (50% discount):

text

Background task spend: $1,650/month
Savings: $1,650/month

With Anthropic's Batch API, you can also combine batch processing with prompt caching. A cached batch request on Claude Sonnet costs $1.50/MTok input and $7.50/MTok output -- 75% off the standard input rate when you combine both discounts.

typescript

// Batch quality scoring -- runs nightly, no rush
const batch = await anthropic.beta.messages.batches.create({
  requests: conversations.map(convo => ({
    custom_id: convo.id,
    params: {
      model: "claude-sonnet-4-20250514",
      max_tokens: 500,
      // Cached system prompt + scoring rubric
      system: [{ type: "text", text: scoringPrompt, cache_control: { type: "ephemeral" } }],
      messages: [{ role: "user", content: convo.transcript }]
    }
  }))
});
// Results arrive within 24 hours at 50% off

Running savings: $8,209 - $1,650 = $6,559 remaining.

Plan-and-Execute: Stop Re-Planning

Remember that 6-turn conversation from the $13K bill breakdown? Here's an insight that saved us the most after caching: agents re-plan the same workflows constantly. "Cancel my order" triggers the same reasoning chain every time -- look up order, check cancellation window, process refund. But a ReAct agent burns tokens re-deriving those steps from scratch on each call.

Plan-and-Execute separates planning from execution:

A lightweight model creates a step-by-step plan
Each step executes independently (often with a smaller model)
Completed plans are cached in a memory store
Matching requests retrieve the cached plan instead of regenerating

The architecture

typescript

async function handleRequest(message: string, context: AgentContext) {
  // Step 1: Classify intent (GPT-4.1 Nano, ~$0.0001)
  const intent = await classifyIntent(message);
 
  // Step 2: Check plan cache (free -- it's a DB lookup)
  const cachedPlan = await planCache.find(intent, context.parameters);
 
  if (cachedPlan) {
    // Cache hit: skip planning entirely, execute steps directly
    // Saves 2,000-5,000 tokens of planning per request
    return await executePlan(cachedPlan, context);
  }
 
  // Cache miss: generate plan with mid-tier model
  const plan = await generatePlan(intent, message, context);
 
  // Store for future reuse
  await planCache.store(intent, context.parameters, plan);
 
  return await executePlan(plan, context);
}

Why it works

We analyzed 30 days of conversations and found that 58% of requests matched one of 23 distinct plan templates. Order status checks, cancellations, address updates, billing inquiries -- they follow the same steps with different parameters.

For those 58% of requests, we eliminated the planning LLM call entirely. For the remaining 42%, we still generate fresh plans, but with a smaller model (Haiku instead of Sonnet) since the plan structure is simpler than the full agent reasoning.

The math

Before Plan-and-Execute:

text

// Planning cost: ~3,000 tokens per request on Sonnet
Planning: 3,000 tokens x 6 turns x 15,000 convos x $3.00/MTok
= 270M tokens x $3.00/MTok = $810/month
(Plus output tokens for plans: ~$400/month)
Total planning cost: ~$1,210/month

After Plan-and-Execute:

text

// 58% cache hits: $0 planning cost
// 42% cache misses: plan on Haiku instead of Sonnet
Miss planning: 3,000 x 6 x 6,300 convos x $0.80/MTok = $90.72
Miss output: ~$50/month
Intent classifier: already counted in routing
Total planning cost: ~$141/month (saving ~$1,069)

The savings compound with model routing -- cached plans can execute their individual steps on the cheapest capable model per step.

Running savings: $6,559 - $1,069 = $5,490 remaining.

Context Management: Smaller Windows

The most overlooked cost driver is conversation history. By turn 6, you're sending 12,500 tokens of history on every request. Most of it is low-value for the current turn.

Three techniques that work:

1. Sliding window with summary

Instead of sending full history, keep the last 3 turns verbatim and summarize earlier turns into a compressed context block:

typescript

function buildContext(history: Message[]): Message[] {
  if (history.length <= 6) return history; // 3 turns = 6 messages
 
  // Summarize older turns into ~200 tokens (vs ~2,500 raw)
  const oldTurns = history.slice(0, -6);
  const summary = await summarize(oldTurns); // Run on Nano, costs ~$0.00005
 
  return [
    { role: "system", content: `Previous context: ${summary}` },
    ...history.slice(-6) // Keep last 3 turns verbatim
  ];
}

This typically reduces history tokens by 40-60% on longer conversations.

2. Tool schema pruning

Don't send all 8 tool schemas on every turn. After intent classification, send only the 2-3 tools relevant to the current task:

typescript

// Before: 2,400 tokens of tool schemas on every turn
const allTools = [orderLookup, orderCancel, orderModify, crmUpdate,
                  billingCheck, refundProcess, escalate, faqSearch];
 
// After: ~800 tokens -- only what this intent needs
const relevantTools = selectToolsForIntent(intent, allTools);
// "order_status" -> [orderLookup, orderModify]
// "billing_question" -> [billingCheck, refundProcess]

This cuts tool schema tokens by 60-70% per turn.

3. Structured extraction over raw replay

Instead of replaying raw tool results in conversation history, extract structured data once and reference it compactly:

typescript

// Before: 800 tokens of raw JSON from order lookup
// { "order": { "id": "ORD-9284", "items": [...], "shipping": {...}, ... } }
 
// After: 120 tokens of extracted facts
// Order ORD-9284: 2 items, shipped 3/18, arriving 3/21, tracking UPS-1Z999

Combined context savings

text

Before context optimization:
  Average input per turn 6: 18,900 tokens
 
After (sliding window + tool pruning + structured extraction):
  Average input per turn 6: 9,200 tokens (~51% reduction)
 
Monthly savings at blended model rate (~$1.20/MTok avg after routing):
  Reduction: ~580M fewer tokens/month
  Savings: ~$696/month

Running savings: $5,490 - $696 = $4,794 remaining.

The Full Stack: $13K to $1,100

Here's what happens when you layer all five techniques:

Technique	Monthly Savings	Cumulative Cost	Reduction
Baseline (unoptimized)	--	$13,247	--
+ Prompt caching	-$1,224	$12,023	9%
+ Model routing	-$3,814	$8,209	38%
+ Batch processing	-$1,650	$6,559	51%
+ Plan-and-Execute	-$1,069	$5,490	59%
+ Context management	-$696	$4,794	64%
+ All techniques compounding	-$3,694*	~$1,100	92%

*The compounding effect is real: routing means cheaper models for caching, caching means fewer tokens for routing decisions, context management means smaller payloads everywhere, and Plan-and-Execute skips entire LLM calls. The techniques multiply rather than simply add.

The actual production cost after full optimization: $1,100/month for the same 500-conversation/day, 8-tool agent with the same conversation quality.

What we monitored to prove quality held

Conventional wisdom says cost cutting means quality cutting. Our data showed the opposite. We tracked these metrics through every change using scorecards and analytics:

Resolution rate: held at 84% (pre-optimization: 83%)
Customer satisfaction: 4.2/5.0 (pre: 4.1/5.0 -- routing actually improved simple cases)
Escalation rate: dropped from 16% to 14% (Plan-and-Execute was more consistent)
Average handle time: 2.1 minutes (pre: 2.3 minutes -- cached plans executed faster)

Quality monitoring isn't optional -- it's what makes optimization safe. Without real-time analytics on resolution rates and satisfaction scores, you're flying blind.

Implementation Checklist

Don't try everything at once. Here's the order that maximizes impact with minimal risk:

Week 1: Prompt Caching (highest ROI, lowest risk)

Enable cache_control on system prompts (Anthropic) or verify automatic caching is active (OpenAI)
Cache tool schemas alongside system prompt
Monitor cache hit rates in your analytics dashboard -- target 85%+ after warmup
Expected savings: 20-30% of input costs

Week 2: Context Management (no model changes)

Implement sliding window summarization for conversations over 4 turns
Prune tool schemas per intent (requires intent classification)
Switch raw tool results to structured extraction
Expected savings: additional 10-15%

Week 3: Model Routing (requires testing)

Build intent classifier on cheapest model (GPT-4.1 Nano or Gemini Flash Lite)
Define tier boundaries using historical conversation analysis
Run shadow routing for 1 week: route silently, compare quality scores between tiers
Deploy with automatic upgrade triggers (complexity score threshold)
Expected savings: additional 30-40%

Week 4: Batch Processing + Plan-and-Execute

Move summarization, quality scoring, and analytics to Batch API
Implement plan cache with top 20 intent templates
Set cache TTL based on how often your workflows change
Expected savings: additional 15-25%

Progress0/0

What Comes Next

Token prices drop every quarter. GPT-4o costs 92% less than GPT-4 did at launch. Claude Sonnet 4.5 matches the quality of earlier Opus models at one-fifth the price. The optimization techniques in this article will keep working as prices fall -- they're multiplicative, not additive.

The real shift is architectural. Teams that build tool-equipped agents with routing, caching, and plan reuse from day one never see a $13K bill. They start at $1,100 and scale to 5,000 conversations/day for $8,000 instead of $130K.

The techniques aren't theoretical. They're table stakes for anyone running agents in production. Start with caching this week. Route by next week. Your CFO will notice.

Monitor Your Agent's Token Economics

Chanl tracks cost per conversation, token breakdown by component, and quality scores alongside spend -- so you can optimize without guessing.

See the analytics

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

cost-optimization production agent-infrastructure prompts tools analytics operations

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.