The Demo Worked. Then What?
You spent a weekend wiring up a voice agent. VAPI, Retell, Bland — pick your flavor. By Sunday evening, you had something impressive: a phone number that answers, understands speech, talks back with a natural-sounding voice, and even calls an LLM to generate responses. You showed the demo to your team on Monday. Everyone was excited.
Then someone asked: "Does it remember what the customer said last time they called?"
No.
"Can we A/B test different prompts without redeploying?"
No.
"How do we know it's not hallucinating product details?"
Silence.
That weekend demo exposed something the voice platform marketing pages don't emphasize: you've solved the voice problem. You haven't solved the agent problem. The telephony, speech-to-text, text-to-speech, and LLM orchestration — that's the part these platforms nail. Everything behind the voice? That's entirely on you.
And "everything behind the voice" turns out to be most of the work.
What Voice Platforms Actually Do (And Where They Stop)
Let's be precise about what you get when you sign up for one of the big three voice AI platforms. Each takes a slightly different approach, but they're all solving the same core problem: making AI talk on the phone.
VAPI gives you a developer-first API for orchestrating voice calls. You configure agents via JSON, wire up STT/LLM/TTS providers, and handle function calls through webhooks. It's flexible, but that flexibility means you're writing a lot of glue code. Real costs run $0.13-0.31 per minute once you add up the separate STT, LLM, TTS, and telephony charges on top of the $0.05/min orchestration fee.
Retell offers the lowest latency of the three — around 600ms response time — with a more visual builder that non-developers can use. It includes HIPAA compliance out of the box and transparent per-minute pricing.
Bland optimizes for high-volume outbound use cases like sales calls and appointment reminders. If you're making thousands of calls with relatively simple scripts, it's the fastest to get running.
Here's what all three handle well:
- Telephony: Phone number provisioning, call routing, SIP trunking
- Speech-to-Text: Converting caller audio to text in real-time
- LLM Orchestration: Routing transcribed text to a language model and getting responses
- Text-to-Speech: Converting LLM responses back to natural-sounding audio
- Basic function calling: Triggering webhooks during conversations
And here's where every single one of them stops:
- Persistent memory across conversations
- Prompt versioning with rollback and staging environments
- Knowledge base management with retrieval and freshness guarantees
- Tool integration beyond simple webhooks
- Automated testing of agent behavior
- Production monitoring with quality scoring and drift detection
That gap — between what the voice platform handles and what a production agent needs — is where teams burn weeks and months of engineering time. Or worse, ship without it and watch quality collapse.
The Six Backend Problems Nobody Warns You About
What does it actually take to go from "demo that works" to "agent I'd trust with real customers"? Here's what you'll run into, roughly in the order you'll discover each problem.
1. Memory: Your Agent Has Amnesia
LLMs are stateless. Every API call is independent — the model has no awareness of prior turns unless context is explicitly passed in. Your voice platform handles short-term memory within a single call (it keeps the conversation history in the context window), but the moment that call ends, everything evaporates.
Customer calls on Monday about a billing issue. You resolve it. They call back Wednesday with a follow-up question. Your agent has no idea who they are, what they discussed, or what was resolved. It's like talking to a different person every time.
Building real memory for voice agents means solving several hard sub-problems:
- Session persistence: Storing conversation summaries after each call
- Semantic retrieval: Finding relevant past interactions without dumping everything into the context window
- Latency management: Memory retrieval adds 50-200ms of latency, which matters when users notice delays above 250ms
- Write timing: Synchronous memory writes during a call create noticeable pauses; you need async pipelines
None of the major voice platforms include this. You'll build it yourself with a vector database, embedding pipeline, and retrieval layer — or you'll find a backend that handles it.
2. Prompt Management: Copy-Paste Isn't a Deployment Strategy
Your agent's behavior lives in its system prompt. Change a few words and it might start offering discounts it shouldn't, or refuse to schedule appointments it should handle. In production, prompt changes are as consequential as code deployments.
But voice platforms treat prompts like a text field. You edit the prompt, save it, and it's live immediately for every caller. There's no staging environment. No version history. No way to roll back if something breaks. No A/B testing to compare performance between prompt variants.
Production prompt management requires treating prompts like code: versioned, reviewed, tested in staging, deployed with rollback capability. Teams that skip this learn the hard way when a "small wording tweak" causes a spike in customer complaints and nobody can figure out which change caused it.
3. Knowledge Bases: RAG Is Harder Than It Looks
Your voice agent needs to answer questions about your products, policies, procedures, and pricing. The standard approach is retrieval-augmented generation (RAG): chunk your documents, embed them in a vector store, retrieve relevant chunks at query time, and inject them into the LLM context.
Simple in theory. In practice, voice RAG introduces challenges that text-based RAG doesn't face:
- Latency budget is tighter. A chatbot user tolerates a 2-second retrieval delay. A caller on the phone does not. You need sub-200ms retrieval to keep conversations natural.
- Noisy transcripts: STT output contains errors, filler words, and partial sentences. Your retrieval pipeline needs to handle queries that look nothing like clean search terms.
- Dynamic data: Product prices change. Business hours vary by holiday. Return policies get updated. If your knowledge base is stale, your agent confidently states wrong information — which is worse than saying "I don't know."
- Citation is impossible: In text chat, you can link to sources. On a phone call, your agent can't say "according to paragraph 3 of document KB-2847."
Voice platforms typically offer basic knowledge base uploads — you can attach a PDF or paste some text. But managing knowledge freshness, chunking strategy, retrieval quality, and accuracy monitoring? That's backend infrastructure you'll need to build or buy.
4. Tool Integration: Webhooks Don't Scale
Every voice platform supports function calling — your agent can trigger a webhook to check order status, schedule appointments, or look up account information. For a demo with two or three tools, webhooks work fine.
But production agents need 10, 20, sometimes 50+ tools. They need to check CRM records, query inventory systems, create support tickets, process payments, verify identities, and route to human agents. Each tool needs authentication, error handling, rate limiting, timeout management, and retry logic.
The Model Context Protocol (MCP) is emerging as the standard for this. Introduced by Anthropic in late 2024 and now adopted by OpenAI, Google, and Microsoft, MCP provides a standardized way to connect AI agents to external tools and data sources. Instead of writing custom webhook handlers for every integration, you define tool interfaces once and any MCP-compatible system can use them.
But here's the catch: none of the voice platforms natively support MCP yet. You'll need a backend layer that translates between your voice platform's function calling mechanism and your MCP tool servers. That's another piece of infrastructure to build and maintain.
5. Testing: "Just Call It and See" Doesn't Count
This one might be the most painful. How do you test a voice agent?
With traditional software, you write unit tests, integration tests, end-to-end tests. You run them in CI. You know if something broke before it hits production. With voice agents on platforms like VAPI, the primary testing method is... calling your agent and talking to it. After every prompt change. After every tool update. After every knowledge base edit.
That's not testing. That's hoping.
Production voice agents need:
- Scenario-based testing: Define conversations that cover your critical paths — happy paths, edge cases, error conditions, multi-turn flows. Run them automatically.
- Persona simulation: Test with different caller types — angry customers, confused users, people with accents, people who interrupt, people who go off-topic.
- Regression detection: When you change a prompt or update a knowledge base, automatically verify that existing behavior didn't break.
- Quality scoring: Automated scorecards that grade every test conversation on accuracy, tone, compliance, and task completion.
Building a testing framework for voice agents is a project unto itself. Research from Hamming AI identifies a four-layer evaluation model — infrastructure, execution, user behavior, business outcome — and most teams only measure the first layer, if that.
6. Monitoring: Flying Blind in Production
Your agent is live. Calls are coming in. How's it doing?
If you're relying on the voice platform's dashboard, you'll see call volume, duration, and maybe some basic transcripts. You won't see:
- Whether your agent is hallucinating more this week than last week
- Which tool calls are failing silently
- Whether prompt drift is causing quality degradation
- How specific customer segments are experiencing the agent
- Whether your knowledge base answers are accurate or outdated
Production monitoring for voice agents needs to be continuous, automated, and actionable. Not a dashboard you check once a week — an alerting system that tells you before customers start complaining that something changed.
Consider what happens when your LLM provider quietly updates their model weights (it happens more often than you'd think). Your prompts were tuned for one model version. Now responses are subtly different — slightly more verbose, slightly less accurate on edge cases, slightly more likely to hallucinate on topics where it used to stay grounded. Without automated quality monitoring, you won't notice until customer satisfaction dips weeks later. With it, you catch the drift within hours and adjust.
The Real Cost of Building It Yourself
Teams that discover the backend gap typically respond in one of three ways: build everything from scratch, ignore the gap and ship anyway, or find purpose-built infrastructure for the backend layer.
Building from scratch is more expensive than most teams anticipate. Here's a rough breakdown based on what enterprise deployments report:
| Component | Build Time | Ongoing Maintenance |
|---|---|---|
| Memory system (vector DB + retrieval + async writes) | 4-8 weeks | 10-15 hrs/month |
| Prompt versioning + staging + rollback | 2-4 weeks | 5-8 hrs/month |
| Knowledge base pipeline (chunking, embedding, freshness) | 4-6 weeks | 15-20 hrs/month |
| Tool integration layer (auth, errors, retries, MCP) | 3-6 weeks | 10-15 hrs/month |
| Testing framework (scenarios, personas, regression) | 6-10 weeks | 15-20 hrs/month |
| Production monitoring + alerting | 3-5 weeks | 10-15 hrs/month |
| Total | 22-39 weeks | 65-93 hrs/month |
That's 5-9 months of engineering time before you even get to building the actual business logic your agent needs to handle. And the ongoing maintenance costs don't go down — they go up as you add more agents, more tools, and more conversation flows.
Ignoring the gap is the faster path to production and the faster path to failure. A RAND Corporation study found that over 80% of AI projects fail to reach production — roughly double the failure rate of non-AI IT projects. Gartner estimates that by 2027, more than 40% of agentic AI projects specifically will be canceled due to escalating costs, unclear value, or insufficient controls. The agent works in the demo. It works for the first hundred calls. By call ten thousand, quality has degraded to the point where you're losing customers instead of serving them.
A Backend Architecture That Actually Works
Whether you build or buy, here's the architecture pattern that production voice agents need behind whatever voice platform you've chosen:
┌─────────────────────────────────────────────┐
│ VOICE PLATFORM LAYER │
│ (VAPI / Retell / Bland / Custom) │
│ Telephony, STT, TTS, LLM Orchestration │
└─────────────────┬───────────────────────────┘
│
│ Function calls / Webhooks
│
┌─────────────────▼───────────────────────────┐
│ BACKEND AGENT LAYER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Memory │ │ Tools │ │ Knowledge│ │
│ │ Service │ │ Service │ │ Service │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Prompt │ │ Testing │ │ Monitor │ │
│ │ Manager │ │ Engine │ │ + Alerts │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ MCP Gateway (Tool Protocol) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘The key insight is that this backend layer should be voice-platform-agnostic. Your memory system, tool integrations, prompt management, and testing framework shouldn't be coupled to VAPI or Retell or Bland. When you switch voice providers (and you might — the space is moving fast), your backend infrastructure should survive the transition unchanged.
This is the same principle behind the Model Context Protocol: standardize the interface between the agent and its tools so that neither side needs to know about the other's implementation details.
Why does platform-agnostic matter? Because the voice AI market is consolidating fast. Bessemer's Voice AI Roadmap suggests that the most durable value in the voice stack sits at the infrastructure and application layers, not the orchestration layer alone. The voice platforms you're choosing between today may merge, pivot, or get acquired. The backend infrastructure — your agent's memory, tools, knowledge, and quality framework — is what carries your business logic forward regardless of what happens to the voice layer above it.
There's also a practical benefit: you can test different voice platforms without rebuilding your entire agent. Swap VAPI for Retell for a subset of calls. Try Bland for your outbound campaigns while keeping Retell for inbound support. The backend stays the same; only the voice layer changes.
Making It Work: A Practical Checklist
You've picked your voice platform. Here's how to evaluate and fill the backend gap, organized by priority — what'll hurt you first if you skip it.
- Memory architecture — Define short-term (in-call) vs long-term (cross-session) memory. Choose a vector store. Target sub-200ms retrieval.
- Prompt versioning — Set up version control for system prompts. Create staging and production environments. Build rollback capability.
- Knowledge base pipeline — Design chunking strategy for your domain. Set up embedding and indexing. Test retrieval accuracy with real caller queries.
- Tool integration layer — Catalog every external system your agent needs. Implement auth, error handling, retries, and timeouts for each.
- Automated testing — Write scenario tests for your top 20 conversation flows. Create persona variations. Set up regression suite for every prompt change.
- Production monitoring — Instrument quality scoring on live calls. Set up alerts for accuracy drops, tool failures, and latency spikes.
- Compliance and audit trail — Log all agent decisions, tool calls, and knowledge retrievals. Ensure PII handling meets regulatory requirements.
If building all of this from scratch sounds like it'll consume your entire roadmap, you're not wrong. That's exactly why backend platforms for AI agents exist — to handle the infrastructure so your team can focus on the business logic that makes your agent actually useful. Chanl, for example, provides the memory, tools, testing, and monitoring layers as managed services that sit behind any voice platform.
But regardless of whether you build or buy, the checklist above represents the work that separates a voice demo from a production agent. Skipping it doesn't make it go away — it just means your customers discover the gaps instead of your testing framework.
The Stack Is Bigger Than the Voice
The voice AI platform market has done something remarkable: it's made it trivially easy to get an AI agent on the phone. A few API calls, some JSON configuration, and you've got a talking agent. That's genuinely impressive engineering.
But it's also created a dangerous illusion. Because getting a voice agent talking is so easy, teams underestimate how much work remains to get it working — reliably, accurately, and at scale.
The voice platform is the front door. The backend is the house. And right now, most teams are building the door and wondering why the house is drafty.
Don't let the excitement of a working demo trick you into skipping the infrastructure that makes production possible. Pick your voice platform — they're all good at what they do. Then build the backend your agent actually needs.
Chanl Team
AI Agent Testing Platform
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Get AI Agent Insights
Subscribe to our newsletter for weekly tips and best practices.



