What's the ideal end-to-end latency for a voice agent?

Under 800ms voice-to-voice. Users start forming impressions about responsiveness within 200-300ms of finishing their sentence. Pauses beyond 1.5 seconds cause noticeable conversation degradation. The target is sub-300ms for each pipeline stage (STT, LLM, TTS) so the total stays under one second even with network overhead.

Should I use WebRTC or WebSocket for voice agent transport?

WebRTC for anything user-facing. It uses UDP, handles packet loss gracefully, and includes built-in echo cancellation and noise suppression. WebSocket is better for server-to-server connections (your relay to the AI API). Many production architectures combine both — WebRTC from client to media server, WebSocket from server to STT/LLM/TTS providers.

What's the difference between cascaded STT→LLM→TTS and speech-to-speech models?

Cascaded pipelines chain separate models — an STT transcribes audio to text, an LLM generates a text response, and a TTS synthesizes it back to audio. Speech-to-speech models like OpenAI's gpt-realtime process audio tokens directly, eliminating text intermediation. Cascaded gives you more control and provider flexibility; speech-to-speech reduces latency but locks you into a single provider.

How does barge-in (interruption handling) work in voice agents?

A Voice Activity Detection (VAD) model continuously monitors the user's audio stream. When it detects speech while the agent is talking, the system immediately stops TTS playback, flushes queued audio, cancels the in-flight LLM generation, and re-routes the new user audio to the STT pipeline. The whole interrupt-to-listen transition should complete in under 200ms.

Which STT provider has the lowest latency for voice agents?

As of early 2026, Deepgram Nova-3 streams transcription results in under 300ms with a WER between 5-7%. AssemblyAI's Universal-Streaming delivers 300ms P50 latency. For ultra-low latency at the cost of accuracy, ElevenLabs Scribe achieves sub-150ms. The right choice depends on your accuracy requirements, language support, and whether you need features like speaker diarization.

How do I choose between VAPI, Retell, Pipecat, and LiveKit?

VAPI and Retell are managed platforms — they handle telephony, STT/TTS orchestration, and hosting. You trade control for speed-to-market. Pipecat and LiveKit are open-source frameworks — you get full pipeline control but own the infrastructure. Choose managed for prototyping and standard use cases, open-source when you need custom turn-taking logic or specific provider combinations.

What backend infrastructure do voice orchestration platforms not provide?

Most orchestration platforms handle the real-time conversation loop but leave significant gaps: persistent agent memory across sessions, tool and API management (MCP/OpenAPI), quality monitoring and scorecards, conversation analytics, knowledge base (RAG) infrastructure, and multi-tenant workspace management. You'll need a separate backend layer for production readiness.

What TTS provider should I use for voice agents?

Cartesia Sonic Turbo leads on raw latency at 40ms time-to-first-byte, ideal for real-time agents. ElevenLabs Flash v2.5 offers 75ms TTFB with the widest voice cloning options. Deepgram Aura-2 provides sub-200ms baseline with strong pricing. For most teams, start with ElevenLabs for voice quality and switch to Cartesia if you need to shave milliseconds.

Voice Agent Platform Architecture: The Stack Behind Sub-300ms Responses | Chanl Blog

Your voice agent responds in 1.2 seconds. Users don't complain — they just hang up. There's no error in the logs, no timeout, no crash. The conversation simply feels wrong, and the caller disengages before the agent finishes its first sentence. This is the reality of voice AI: latency isn't a performance metric, it's a product-or-death threshold.

Human conversation operates on a roughly 300-millisecond turn-taking rhythm. When two people talk, the gap between one person stopping and the other starting averages 200-300ms. Your voice agent needs to match that cadence — hear the user, understand what they said, formulate a response, and begin speaking — all within the time it takes a human to draw a breath. That means every millisecond in your architecture has a job, and you need to know exactly where each one goes.

This article tears apart the voice agent stack layer by layer. We'll trace an audio packet from microphone to speaker, build latency budgets for each pipeline stage, explore how interruption handling actually works, compare transport protocols, evaluate the major platforms, and identify the backend infrastructure gap that most teams discover six months too late.

Prerequisites

You'll get the most from this article if you have:

Familiarity with real-time audio concepts — sampling rates, codecs, buffering
Basic understanding of LLM APIs — streaming responses, token generation, function calling
Some exposure to WebSocket or WebRTC — you don't need to be an expert, but knowing the difference between TCP and UDP helps
TypeScript reading ability — code examples are in TypeScript, though the architectural patterns are language-agnostic

If you've worked with MCP servers or agent tool infrastructure, you already understand the backend side. This article focuses on what happens before and after those tools get called — the real-time audio layer that makes voice agents feel alive.

The Voice Pipeline: STT to LLM to TTS

A voice agent is three models chained together in real-time, with streaming connections between each stage — audio in, text through an LLM, audio back out, all within a fraction of a second. This cascaded architecture has been the dominant pattern since 2023, and despite the emergence of speech-to-speech models, it remains the production standard because of its flexibility: you can swap any individual component without rebuilding the others.

The cascaded voice agent pipeline — audio flows through three stages with streaming connections between each

That diagram looks simple. The complexity hides in the word "streaming." Each component doesn't wait for the previous one to finish — they overlap. The STT streams partial transcripts as the user speaks. The LLM begins generating tokens the moment it has enough context. The TTS starts synthesizing audio from the first sentence fragment, sometimes before the LLM has finished thinking.

This pipeline parallelism is what makes sub-second responses possible. Consider what happens when a user says "What's my account balance?":

0-400ms: User speaks. STT receives audio chunks every 20ms (Opus codec frame size).
~200ms: STT emits a partial transcript: "What's my account"
~350ms: STT emits the final transcript: "What's my account balance?"
~350ms: LLM receives the final transcript, begins generating tokens
~550ms: LLM emits first tokens: "Your current account balance is"
~560ms: TTS receives first sentence fragment, begins synthesis
~635ms: TTS emits first audio chunk — the user hears "Your..."

The user stops speaking at ~400ms. They hear the agent's voice at ~635ms. That's a 235ms perceived latency — well under the 300ms threshold. But it only works because every stage streams its output to the next stage incrementally, rather than waiting for completion.

In production, you'll want to instrument each stage with time-to-first-output (TTFO) metrics. When your STT suddenly jumps from 180ms to 400ms — maybe a regional outage, maybe an encoding change — those metrics are the early warning system that saves you from debugging "the agent feels slow" with no idea which component degraded.

Where Every Millisecond Goes: The Latency Budget

Every voice agent has a latency budget — a fixed time envelope between the user stopping speech and hearing the agent's first word. Exceed it, and the conversation feels broken. Here's the full budget, based on published benchmarks from Twilio, Cresta, and Cerebrium:

Component	Budget	What's Happening
Network (user → server)	20-60ms	Audio packets traverse the internet to your media server
Audio buffering	20-40ms	Accumulating enough audio frames for the STT to process
STT transcription	100-350ms	Converting audio to text (streaming, not batch)
Endpointing	200-500ms	Detecting that the user has finished speaking
LLM time-to-first-token	200-400ms	Generating the first response token
TTS time-to-first-byte	40-150ms	Synthesizing the first audio chunk from text
Network (server → user)	20-60ms	Audio packets traverse back to the user
Audio playback buffer	20-60ms	Client-side jitter buffer before playback

Total range: 620ms-1,620ms. The spread is enormous. Three components dominate: endpointing, STT, and LLM.

Endpointing: The Hidden Latency Killer

Endpointing — detecting when the user has finished speaking — is the single largest source of unnecessary latency. A basic Voice Activity Detection (VAD) model watches for silence. When it detects 300-500ms of silence, it triggers the STT to finalize and sends the text to the LLM.

That means you're burning 300-500ms just waiting to confirm the user stopped talking. That's your entire latency budget consumed by silence detection alone.

Modern endpointing uses semantic signals, not just silence. AssemblyAI's Universal-Streaming model analyzes the linguistic content of partial transcripts to predict when a user's thought is complete — before the silence threshold fires. Deepgram's Flux model cuts response latency by 200-600ms compared to pure-silence endpointing while reducing false interruptions by roughly 30%.

Here's a simplified semantic endpointer that combines VAD silence detection with partial transcript analysis:

typescript

class SemanticEndpointer {
  private silenceThresholdMs = 400;
  private semanticThresholdMs = 150; // Fire earlier when semantics confirm
 
  evaluate(vadIsSpeaking: boolean, silenceDurationMs: number, partialTranscript: string) {
    // Pure silence endpoint — conservative fallback
    if (!vadIsSpeaking && silenceDurationMs >= this.silenceThresholdMs) {
      return { type: 'silence' as const, confidence: 0.9 };
    }
 
    // Semantic endpoint — fire earlier if the sentence looks complete
    if (!vadIsSpeaking && silenceDurationMs >= this.semanticThresholdMs
        && this.looksComplete(partialTranscript)) {
      return { type: 'semantic' as const, confidence: 0.85 };
    }
    return null;
  }
 
  private looksComplete(text: string): boolean {
    const t = text.trim();
    if (!t) return false;
    if (/[.?!]$/.test(t)) return true;
    return [/\bplease$/i, /\bthanks$/i, /\bthat's all$/i].some(p => p.test(t));
  }
}

Real implementations use trained neural models analyzing prosody, intonation, and syntactic completeness — not regex. But the principle holds: by examining what the user said alongside silence duration, you can fire the endpoint 200-300ms earlier.

The tension is speed versus accuracy. Fire too early, and you clip the user mid-sentence. Fire too late, and you've wasted hundreds of milliseconds. The best systems err toward speed and handle mid-sentence triggers by appending continued speech to the in-progress LLM generation.

LLM Latency: Time-to-First-Token Is Everything

For voice agents, total generation time is almost irrelevant. What matters is TTFT — time-to-first-token. The TTS starts synthesizing audio the moment the LLM produces its first few words. Everything after that streams in parallel with audio playback.

Model	TTFT (P50)	Notes
GPT-4o	250-400ms	Streaming enabled, US region
GPT-4o-mini	150-250ms	Faster, good enough for most agent tasks
Claude 3.5 Sonnet	200-350ms	Strong reasoning, slightly higher latency
Gemini 1.5 Flash	180-280ms	Google's speed-optimized model
Groq (Llama 3)	50-100ms	Custom LPU hardware, lowest TTFT

Groq's numbers stand out — custom LPU hardware built for inference speed. For voice agents where response quality is "good enough" with Llama 3, Groq's 50-100ms TTFT transforms your latency budget.

But here's the tradeoff nobody talks about: faster models tend to produce worse first tokens. A 50ms response might start with "Certainly!" before getting to the answer. A 250ms response might jump straight to "Your account balance is $4,230." The first few words matter disproportionately in voice — they're what the user hears while the rest generates. The prompting techniques from Prompt Engineering from First Principles apply directly: instruct the model to lead with the answer.

Interruption Handling: The Barge-In Problem

Handling user interruptions — called barge-in — is what separates a voice agent that feels conversational from one that feels like a recording. Users say "actually, never mind" mid-sentence, they correct themselves, they lose patience. If your agent can't handle these gracefully, it'll keep talking over the user.

Barge-in handling is a state machine coordinating four systems simultaneously. When the user starts speaking while the agent is talking, the system must — within 200ms — stop TTS playback, flush the audio buffer, cancel the in-flight LLM generation, and redirect user speech to the STT pipeline.

Voice agent state machine — interruption handling requires coordinated transitions across all pipeline components

The coordination logic manages these transitions. Order matters — stop audio first (user perception), then cancel generation (save compute):

typescript

type AgentState = 'listening' | 'processing' | 'speaking' | 'interrupting';
 
class InterruptionHandler {
  private state: AgentState = 'listening';
  private interruptDebounceMs = 150;
  private lastInterruptTime = 0;
 
  constructor(private controls: {
    stopTtsPlayback(): void;
    flushTtsBuffer(): void;
    cancelLlmGeneration(): void;
    restartSttStream(): void;
  }) {}
 
  onUserSpeechDetected(timestamp: number): void {
    if (this.state !== 'speaking') return;
    if (timestamp - this.lastInterruptTime < this.interruptDebounceMs) return;
    this.lastInterruptTime = timestamp;
    this.state = 'interrupting';
 
    this.controls.stopTtsPlayback();
    this.controls.flushTtsBuffer();
    this.controls.cancelLlmGeneration();
    this.controls.restartSttStream();
 
    this.state = 'listening';
  }
}

The 150ms debounce prevents false triggers from background noise or — critically — the agent's own audio leaking into the microphone. Without echo cancellation, the VAD detects the agent's voice as user speech and triggers a self-interrupt loop: the agent stops, the echo stops, the VAD clears, the agent resumes, and the cycle repeats. The result sounds like uncontrollable stuttering.

WebRTC handles this at the protocol level with built-in Acoustic Echo Cancellation (AEC). If you're using WebSocket transport with raw audio, you'll need to implement AEC yourself — a non-trivial problem involving audio delay estimation and signal subtraction. Even with AEC, edge cases persist: Bluetooth headsets add variable latency, car speakerphones create long reverb tails. Production agents need tuned VAD thresholds and fast recovery paths for inevitable false barge-ins.

Transport Protocols: WebRTC vs WebSocket

Your transport protocol determines the latency floor of your entire system — the minimum delay even if every other component is infinitely fast. The distinction comes down to reliability versus timeliness: TCP (WebSocket) guarantees every packet arrives in order; UDP (WebRTC) drops late packets instead of stalling.

For real-time audio, timeliness wins. A single dropped 20ms audio frame is barely perceptible — the codec's error concealment fills the gap. A 200ms stall while TCP retransmits a lost packet creates an audible glitch. This phenomenon — head-of-line blocking — is why WebSocket connections have 200-500ms+ P99 latency versus WebRTC's 80-150ms.

Factor	WebRTC	WebSocket
Transport	UDP (RTP/RTCP)	TCP
Latency (P50)	30-80ms	50-150ms
Latency (P99)	80-150ms	200-500ms+
Packet loss handling	Drops late packets, no stall	Retransmits, causes stalls
Echo cancellation	Built-in (AEC)	You implement it
NAT traversal	Built-in (ICE/STUN/TURN)	Proxy or direct connection
Setup complexity	Higher (STUN/TURN servers)	Lower (HTTP upgrade)
Server-to-server	Overkill	Ideal

The emerging production pattern combines both. WebRTC carries audio between the user and your media server — handling the unpredictable client network with UDP's resilience and built-in noise suppression. WebSocket carries data between your media server and AI providers (STT, LLM, TTS APIs) — TCP's simplicity on the controlled server-to-server path.

One nuance worth calling out: the Opus codec that WebRTC uses operates on 20ms frames by default, arriving as ~640-byte chunks at 16kHz mono. Your STT provider might expect different chunk sizes or sample rates (most want 16kHz PCM). The media relay server needs to handle this audio repackaging — a small detail that causes outsized debugging time when it goes wrong.

Choosing STT Providers: Accuracy vs Speed

Your STT provider is the gateway to everything downstream — it determines how fast the LLM gets text and how accurately that text represents what the user said. A voice agent handling prescription refills needs very high accuracy; one routing support tickets can tolerate more errors because intent survives noisy transcription.

Provider	Latency (Streaming)	WER	Streaming	Languages	Standout Feature
Deepgram Nova-3	200-300ms	5-7%	Yes	40+	Lowest WER at speed
AssemblyAI Universal-2	~300ms (P50)	Sub-5% (batch)	Yes	100+	Semantic endpointing
OpenAI Whisper (API)	1-3s (batch)	~10.6%	No	99	Multilingual breadth
Google Cloud STT v2	200-400ms	6-9%	Yes	125+	Largest language coverage
ElevenLabs Scribe	<150ms	7-10%	Yes (WS)	32	Ultra-low latency

Deepgram is the default for English voice agents — best accuracy-to-latency ratio. AssemblyAI wins on endpointing — their built-in semantic endpointing means you don't need a separate endpointing layer. Whisper isn't for real-time — no native streaming, seconds of latency. ElevenLabs Scribe is the speed play — sub-150ms but higher error rates.

Streaming Enables Speculative Execution

With streaming STT, the LLM can start processing partial transcripts before the user finishes speaking. If the partial says "What's the status of order..." the system can start querying the order API before the user says the number. When the full transcript arrives with "...twelve thirty-four," the data is already in cache.

This speculative execution — borrowed from CPU architecture — can eliminate the LLM's TTFT entirely for predictable queries. The tradeoff is wasted compute when predictions miss, but in customer service (where 70%+ of conversations follow predictable patterns) you're trading cheap API calls for expensive latency reduction.

Choosing TTS Providers: The Voice Your Users Hear

TTS is the last mile, and it has outsized impact on user perception. Every upstream optimization is wasted if the voice sounds robotic or the TTS adds 500ms after the LLM finishes. This is also where brand identity lives: the specific voice, cadence, and emotional range that makes your agent recognizable.

The metric that matters is TTFB (time-to-first-byte) — how quickly the TTS returns its first playable audio chunk:

Provider	TTFB	Voice Quality	Voice Cloning	Standout Feature
Cartesia Sonic Turbo	~40ms	Good	Yes	Lowest latency in the market
ElevenLabs Flash v2.5	~75ms	Excellent	Yes (best)	Widest voice cloning, multilingual
Deepgram Aura-2	90-200ms	Good	Limited	Best pricing for high volume
OpenAI TTS	150-300ms	Very good	No	Simplest API integration
PlayHT 2.0	100-200ms	Good	Yes	Emotion control

Cartesia's 40ms TTFB is remarkable — less than two audio frames at the 20ms Opus frame size. Those 35ms saved versus ElevenLabs compound with every turn. But ElevenLabs has the best voice cloning quality, Deepgram scales better on pricing, and OpenAI's TTS is the easiest to integrate if you're already on their LLM. In production, implement provider failover — cycle through providers in preference order and promote the next on consecutive failures. A single TTS outage shouldn't silence your agents.

Voice Agent Platforms: The Landscape

The ecosystem splits into managed platforms that handle telephony through AI orchestration, and open-source frameworks that give you building blocks. This build-vs-buy decision is consequential because switching later is expensive — infrastructure commitments that take months to unwind.

Managed Platforms: VAPI, Retell, Bland

VAPI — developer-first. JSON/API configuration, pick your STT/LLM/TTS providers, ~700ms end-to-end latency. Starts at $0.05/min plus provider costs.

Retell — fastest managed latency at ~600ms. Includes HIPAA compliance in standard pricing (VAPI charges $1,000/month). Ships a visual agent builder for non-developers. Starts at $0.07/min including the AI voice.

Bland — optimized for high-volume outbound. Visual Pathways builder, deploy agents in ~10 lines of code, dedicated infrastructure option. Starts at $0.09/min for connected calls.

All three handle the real-time conversation loop. Where they diverge is what surrounds it — and what they don't handle.

Open-Source Frameworks: Pipecat and LiveKit

Pipecat (by Daily) treats everything as a stream of Frames flowing through processors — like Unix pipes for real-time media. Define the pipeline (transport → STT → LLM → TTS → transport) and Pipecat manages streaming between stages. The strength is flexibility: swap providers mid-conversation, add custom processing stages, build novel topologies. The cost: you'll spend significant time getting turn-taking right.

LiveKit provides WebRTC infrastructure — an open-source SFU in Go. Your agent joins a Room as a headless participant, subscribes to the user's mic, processes audio through your pipeline, and publishes responses. LiveKit Agents handles turn-taking out of the box.

Here's a Pipecat-style pipeline composition showing how stages chain:

typescript

type Processor = (input: AsyncIterable<Frame>) => AsyncIterable<Frame>;
 
function createPipeline(...processors: Processor[]): Processor {
  return (input) => {
    let stream = input;
    for (const proc of processors) stream = proc(stream);
    return stream;
  };
}
 
const voiceAgent = createPipeline(
  webrtcTransportInput,   // User audio from WebRTC
  sileroVadProcessor,     // Voice activity detection
  deepgramSttProcessor,   // Speech-to-text
  semanticEndpointer,     // End-of-utterance detection
  openaiLlmProcessor,     // Response generation
  elevenLabsTtsProcessor, // Speech synthesis
  webrtcTransportOutput,  // Agent audio via WebRTC
);

Decision Framework

Choose...	When...
VAPI	Fast prototyping, API-first config, provider flexibility
Retell	Low latency, HIPAA compliance, non-developer team members
Bland	High-volume outbound campaigns, simplicity
Pipecat	Custom turn-taking, specific provider combinations, novel topologies
LiveKit	WebRTC infrastructure with turn-taking out of the box

And here's the question most comparison articles don't ask: What happens after the call?

The Backend Gap: What Orchestration Platforms Don't Handle

Orchestration platforms solve the hardest real-time problem — maintaining a sub-second conversation loop. But the conversation loop is maybe 30% of what you need for production. The other 70% is everything around, before, and after the call.

Here's a pattern that plays out everywhere. Team starts with VAPI or Retell. Prototype works brilliantly. Ships to production. Three to six months in, the questions accumulate: How do we remember what this customer told us last week? How do we know which agents perform well? How do we manage 47 tools across 12 agents? How do we roll back a prompt change that increased failures by 15%?

The full voice agent architecture — orchestration platforms handle the conversation loop, but the backend infrastructure layer is equally critical for production

Tool management. Agents need to check orders, schedule appointments, process payments. These require MCP servers or OpenAPI integrations, credential rotation, input validation, and version control. Orchestration platforms give you function calling — but not tool infrastructure that manages those functions at scale.

Persistent memory. A customer calls Tuesday about billing, then again Thursday. Does the agent remember? Orchestration platforms maintain within-session state, but cross-session memory requires a separate storage layer that retrieves relevant context while respecting privacy boundaries.

Monitoring and quality. Call volume and handle time are table stakes. You need conversation-level quality assessment. Production monitoring gives you dashboards, alerts, and trends. Quality scorecards give you automated evaluation of every conversation. Conversation analytics give you aggregate insights across thousands of calls. The difference: knowing your agent handled 10,000 calls versus knowing 8,200 met quality standards.

Knowledge base. Non-trivial agents need documentation, policies, product catalogs. That's RAG infrastructure — vector embeddings, similarity search, freshness tracking — with results fast enough to not blow your latency budget.

The real value is the closed loop these components create together. Call data flows to analytics, analytics surface quality issues, quality insights inform prompt changes, prompt changes improve the next call. Without this loop, improvement is anecdotal. With it, you know Tuesday's prompt change increased first-call resolution by 8% across 2,400 calls.

Platforms like Chanl exist to fill this gap — the infrastructure layer (tools, memory, monitoring, scorecards) behind whichever orchestration platform handles your real-time conversation loop.

Putting It All Together: A Production Architecture

A production voice agent runs two parallel paths: a real-time path optimized for sub-second response, and an async path optimized for deep analysis after the call ends. Here's the complete architecture that teams running thousands of concurrent agents have converged on:

Complete production voice agent architecture — real-time path and async post-call path

Two paths run simultaneously: the real-time path (phone → media → STT → LLM → TTS → phone) with a latency budget under 800ms, and the async path (recording → transcription → scoring → analytics) that runs post-call with no latency constraints.

Each concurrent call requires roughly 1 WebRTC connection, 1 streaming WebSocket to each provider, 1 LLM inference slot, and ~50-100KB/s of audio bandwidth. At 1,000 concurrent calls, the media server is rarely the bottleneck — STT and LLM provider rate limits are. Plan for those limits early. Hit them and new callers get silence.

Speech-to-Speech: The Alternative Architecture

There's a fundamentally different approach: speech-to-speech models that process audio tokens directly, skipping text entirely. OpenAI's gpt-realtime model is the leading example — audio-in, audio-out, single model, ~400ms voice-to-voice latency.

Speech-to-speech architecture — one model handles the entire audio pipeline

The tradeoffs are significant. Provider lock-in — you can't swap STT to Deepgram or TTS to Cartesia. Limited observability — no intermediate text to log, analyze, or score (you'd need parallel STT for transcripts). Tool calling latency — the audio pipeline pauses during function calls, and you have less control over filler behavior. Cost — significantly higher per minute than cascaded pipelines.

Cascaded architectures will remain dominant through 2026 because of flexibility, observability, and cost. Speech-to-speech is compelling for demos and low-volume use cases. Keep your architecture modular enough that swapping between them is a configuration change, not a rewrite.

Latency Optimization Playbook

Your pipeline sits at 900ms. You need it under 600ms. Here's the playbook, ordered by impact — most teams hit 500ms by implementing the first three:

1. Switch to semantic endpointing (-200 to -400ms). Replace silence-based endpointing with a model that understands when the user's thought is complete. AssemblyAI and Deepgram both offer this natively. Almost always the single biggest win.

2. Use a faster LLM for simple turns (-100 to -250ms). Route "yes," "no," and account numbers to GPT-4o-mini or Groq's Llama (50-100ms TTFT). Save frontier models for complex queries.

3. Pre-warm TTS connections (-50 to -150ms). Keep a pool of warm WebSocket connections to your TTS provider. Cold-start overhead adds up across every turn.

4. Optimize prompts for voice (-50 to -100ms perceived). "Your balance is $4,230" beats "Certainly! Let me check that for you. Your current account balance as of today is $4,230.00." Fewer tokens = faster TTS synthesis.

5. Reduce network hops (-20 to -80ms). Co-locate your media server with your AI providers. Transatlantic round-trips add 80-120ms per audio chunk.

6. Speculative execution (-100 to -300ms for predictable queries). Start prefetching data from partial transcripts. Wrong guesses waste cheap API calls; right guesses save expensive latency.

7. Sentence-level TTS streaming (-100 to -200ms for long responses). Buffer LLM tokens until a sentence boundary, then flush to TTS. The user hears sentence one while sentence two generates:

typescript

class SentenceBuffer {
  private buffer = '';
  addToken(token: string): string | null {
    this.buffer += token;
    const match = this.buffer.match(/[.!?]\s|[.!?]$/);
    if (match && match.index !== undefined) {
      const end = match.index + match[0].length;
      const sentence = this.buffer.slice(0, end).trim();
      this.buffer = this.buffer.slice(end);
      return sentence;
    }
    return null;
  }
  flush(): string | null {
    const r = this.buffer.trim(); this.buffer = ''; return r || null;
  }
}

What's Next for Voice Agent Architecture

The stack has matured from experimental to production-grade in 18 months. Three trends will reshape it over the next 12-24:

Multimodal agents. Voice-only gives way to voice-plus-screen — the user talks while looking at an app, dashboard, or car display. The pipeline needs to emit structured data alongside audio, synchronizing voice with visual updates.

Edge inference. Running STT and small LLMs on-device eliminates network latency for simple queries. The architecture becomes a hybrid: edge handles common patterns in 50ms, cloud handles complex ones in 500ms.

Persistent voice identities. Tomorrow's agents will adjust pacing, vocabulary, and tone to match individual users across months of interactions — requiring deep integration between the voice pipeline and agent memory.

The fundamental architecture won't change. Pipeline stages connected by streaming, latency budgets in milliseconds, interruption handling as a state machine — these patterns are dictated by physics and psychoacoustics, not fashion. But the components will keep getting faster and cheaper. Cartesia's 40ms TTS would have been science fiction two years ago. By next year, today's "fast" will be "average."

Build your pipeline with clean interfaces between stages. Measure everything. Plan for the components to change underneath you. And remember: the user doesn't care about your architecture. They care about whether the agent feels like a conversation or a hold queue.

Build the backend your voice agents need

Chanl provides the infrastructure layer behind your voice pipeline — tools, memory, monitoring, scorecards, and analytics. Connect any orchestration platform.

See the platform

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

learning-ai voice latency architecture typescript real-time infrastructure stt-tts

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.