That typing effect in ChatGPT isn't animation. Each character arrives from the server the moment it's generated — one token at a time, pushed over a persistent connection. Behind that smooth rendering sits a stack of decisions about transport protocols, buffer management, partial JSON parsing, and proxy configuration that can each break the experience in subtle ways. This tutorial builds streaming from scratch three ways (SSE, WebSocket, Chanl SDK) and covers every layer between the model generating a token and a character appearing on screen.
| What you'll build | Why it matters |
|---|---|
| SSE streaming server (Express) | The default protocol for 90% of AI streaming — simple, reliable, browser-native |
| WebSocket streaming server | Bidirectional streaming for voice AI, cancellation, and multi-user scenarios |
| Tool call accumulator | Parse partial JSON fragments as they stream — the hardest part of AI streaming |
| React streaming chat | Production-ready UI with TTFT metrics, cancellation, and tool call rendering |
| Backpressure handling | What happens when the client can't keep up — and how to prevent buffer bloat |
| Production proxy config | Nginx/CDN settings that break streaming if you get them wrong |
What you'll need
Runtime:
- Node.js 20+ and a Chanl account (free tier includes streaming)
Install dependencies:
# TypeScript — SDK + server framework
npm install @chanl-ai/sdk express
# For WebSocket examples:
npm install wsSet your API key:
export CHANL_API_KEY="your-api-key-here"Create a test agent we'll stream responses from:
import { ChanlClient } from "@chanl-ai/sdk";
const chanl = new ChanlClient({ apiKey: process.env.CHANL_API_KEY });
const agent = await chanl.agents.create({
name: "Streaming Demo Agent",
instructions: "You are a helpful assistant. Give detailed, thoughtful responses.",
model: "claude-sonnet-4-20250514",
});
console.log("Agent ID:", agent.id); // Save this — we'll use it throughoutAll code in this tutorial is complete and runnable.
Why streaming matters for AI
Streaming cuts perceived latency from 5-15 seconds to under 500ms. That's not a UX nicety — it's the difference between a product people use and one they abandon.
Without streaming, users stare at a blank screen while the entire response generates:
// Non-streaming: user waits for the ENTIRE response
import { ChanlClient } from "@chanl-ai/sdk";
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
const start = Date.now();
const response = await chanl.chat.send(agentId, [
{ role: "user", content: "Explain quantum computing" },
]);
const elapsed = Date.now() - start;
console.log(`User waited ${elapsed}ms for first character`);
// User waited 8,432ms for first character
// The full response appears all at once
console.log(response.content);With streaming, the first token arrives in a few hundred milliseconds:
// Streaming: first token arrives in ~200-400ms
import { ChanlClient } from "@chanl-ai/sdk";
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
const start = Date.now();
let ttft: number | null = null;
const stream = chanl.chat.stream(agentId, [
{ role: "user", content: "Explain quantum computing" },
]);
for await (const chunk of stream) {
if (chunk.type === "token") {
if (!ttft) {
ttft = Date.now() - start;
console.log(`First token in ${ttft}ms`); // First token in 287ms
}
process.stdout.write(chunk.content);
}
}Total generation time is identical — the model produces tokens at the same speed either way. But the user starts reading immediately instead of staring at nothing for 8 seconds.
Streaming also unlocks three capabilities batch responses can't provide:
Early cancellation. If the model goes off-track, the user (or your code) can abort mid-stream. Without streaming, you pay for the full generation whether you use it or not.
Progressive rendering. Markdown, code blocks, and lists render incrementally. The UI feels alive.
Real-time tool calls. When an agent invokes a tool during generation — looking up a database, calling an API, searching a knowledge base — streaming shows that execution live rather than hiding it behind a spinner.
Server-Sent Events: the default streaming protocol
SSE is a browser-native protocol for one-way server-to-client event streams over standard HTTP. OpenAI, Anthropic, and Google all use it. No WebSocket upgrade, no special handshake — just a persistent HTTP connection with Content-Type: text/event-stream.
Here's the data flow. The client opens a standard HTTP connection, the server proxies each token from the LLM as an SSE event, and the client appends it to the UI:
Building the SSE endpoint
Three headers establish the connection. Then each token from the Chanl SDK pipes directly to the client as an SSE event:
import express from "express";
import { ChanlClient } from "@chanl-ai/sdk";
const app = express();
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
app.use(express.json());
app.post("/api/chat/stream", async (req, res) => {
const { agentId, messages } = req.body;
// SSE headers — these three are non-negotiable
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
res.flushHeaders();
try {
const stream = chanl.chat.stream(agentId, messages);
for await (const chunk of stream) {
if (chunk.type === "token") {
// SSE format: "data: <payload>\n\n"
res.write(
`data: ${JSON.stringify({ type: "token", content: chunk.content })}\n\n`
);
}
// Check for tool calls in the stream
if (chunk.type === "tool_call") {
res.write(
`data: ${JSON.stringify({
type: "tool_call",
id: chunk.id,
name: chunk.name,
arguments: chunk.arguments,
})}\n\n`
);
}
// Stream finished
if (chunk.type === "done") {
res.write(
`data: ${JSON.stringify({
type: "done",
reason: chunk.reason,
})}\n\n`
);
}
}
} catch (error) {
res.write(
`data: ${JSON.stringify({
type: "error",
message: error instanceof Error ? error.message : "Unknown error",
})}\n\n`
);
}
res.end();
});
app.listen(3000);Consuming SSE from the browser
The browser has a built-in SSE client. For GET endpoints, it's one line:
const source = new EventSource("/api/chat/stream");
source.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "token") {
document.getElementById("response")!.textContent += data.content;
}
};But EventSource only supports GET. For POST (which you need for chat), use fetch with a streaming body reader:
async function streamChat(
agentId: string,
messages: Array<{ role: string; content: string }>
) {
const response = await fetch("/api/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ agentId, messages }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
const start = Date.now();
let ttft: number | null = null;
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE lines from buffer
const lines = buffer.split("\n");
buffer = lines.pop()!; // Keep incomplete line in buffer
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = JSON.parse(line.slice(6));
if (data.type === "token") {
if (!ttft) {
ttft = Date.now() - start;
console.log(`TTFT: ${ttft}ms`);
}
document.getElementById("response")!.textContent += data.content;
}
if (data.type === "done") {
console.log(`Total time: ${Date.now() - start}ms`);
}
}
}
}Simplifying with the Chanl SDK
That's a lot of boilerplate. The SDK handles SSE connection management, reconnection, and event parsing for you:
import { ChanlClient } from "@chanl-ai/sdk";
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
// Stream a message — tokens arrive in real-time
const stream = chanl.chat.stream(agentId, [
{ role: "user", content: "Explain how our pricing works" },
]);
for await (const chunk of stream) {
if (chunk.type === "token") process.stdout.write(chunk.content);
if (chunk.type === "tool_call") console.log("Calling:", chunk.name);
if (chunk.type === "done") console.log("\nComplete");
}Your agent's tools and MCP servers execute transparently — tool call events surface automatically during the stream.
WebSocket streaming: bidirectional and persistent
SSE is one-directional: server pushes to client. That covers "user sends message, model responds." But some scenarios need the client to push events during a stream — cancellation, voice interruption, real-time audio input. That's where WebSockets come in.
WebSockets open a persistent, full-duplex TCP connection. Either side can send messages at any time. More complexity, but you get bidirectional communication.
Notice the cancel message below traveling from client to server over the same connection that's actively streaming tokens:
Server implementation
The server manages an AbortController per connection. When the client sends cancel, the server aborts the stream immediately:
import { WebSocketServer } from "ws";
import { ChanlClient } from "@chanl-ai/sdk";
const wss = new WebSocketServer({ port: 8080 });
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
wss.on("connection", (ws) => {
let activeController: AbortController | null = null;
ws.on("message", async (raw) => {
const message = JSON.parse(raw.toString());
if (message.type === "cancel") {
// Client can cancel mid-stream — this is the WebSocket advantage
activeController?.abort();
ws.send(JSON.stringify({ type: "cancelled" }));
return;
}
if (message.type === "chat") {
activeController = new AbortController();
const start = Date.now();
try {
const stream = chanl.chat.stream(
message.agentId,
message.messages,
{ signal: activeController.signal }
);
let tokenCount = 0;
for await (const chunk of stream) {
if (chunk.type === "token") {
tokenCount++;
ws.send(
JSON.stringify({
type: "token",
content: chunk.content,
metrics: {
ttft: tokenCount === 1 ? Date.now() - start : undefined,
tokenCount,
},
})
);
}
if (chunk.type === "done") {
const elapsed = Date.now() - start;
ws.send(
JSON.stringify({
type: "done",
metrics: {
totalMs: elapsed,
tokenCount,
tokensPerSecond: Math.round((tokenCount / elapsed) * 1000),
},
})
);
}
}
} catch (err: any) {
if (err.name === "AbortError") return; // Expected on cancel
ws.send(JSON.stringify({ type: "error", message: err.message }));
} finally {
activeController = null;
}
}
});
});Client-side WebSocket with cancellation
This wrapper gives you a clean API for sending messages and cancelling mid-stream:
class StreamingChatClient {
private ws: WebSocket;
private handlers = new Map<string, (data: any) => void>();
constructor(url: string) {
this.ws = new WebSocket(url);
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
this.handlers.get(data.type)?.(data);
};
}
send(agentId: string, messages: Array<{ role: string; content: string }>) {
this.ws.send(JSON.stringify({ type: "chat", agentId, messages }));
}
cancel() {
this.ws.send(JSON.stringify({ type: "cancel" }));
}
on(event: string, handler: (data: any) => void) {
this.handlers.set(event, handler);
return this;
}
}
// Usage
const chat = new StreamingChatClient("ws://localhost:8080");
chat
.on("token", (data) => appendToUI(data.content))
.on("done", (data) => showMetrics(data.metrics))
.on("cancelled", () => showCancelledMessage());
chat.send(agentId, [{ role: "user", content: "Explain streaming" }]);
// User clicks "Stop" — cancels mid-stream
stopButton.onclick = () => chat.cancel();The key difference from SSE: cancel() travels over the same connection receiving tokens. With SSE, cancellation requires closing the connection entirely or sending a separate HTTP request.
Streaming tool calls and structured outputs
Plain text streaming is straightforward — append each token to a string. Tool calls are where it gets hard. When an LLM invokes a function, it streams the call as partial JSON fragments:
// These arrive as separate chunks:
{"type": "tool_call_delta", "arguments": "{\""}
{"type": "tool_call_delta", "arguments": "query"}
{"type": "tool_call_delta", "arguments": "\": \""}
{"type": "tool_call_delta", "arguments": "weather"}
{"type": "tool_call_delta", "arguments": " in"}
{"type": "tool_call_delta", "arguments": " NYC"}
{"type": "tool_call_delta", "arguments": "\"}"}You can't JSON.parse() any individual chunk. You need to accumulate fragments and detect when you have a complete JSON object. Here's an accumulator that buffers each fragment by index, trying to parse after every addition:
interface ToolCallBuffer {
id: string;
name: string;
argumentsBuffer: string;
}
interface CompleteToolCall {
id: string;
name: string;
arguments: Record<string, unknown>;
}
function createToolCallAccumulator() {
const buffers = new Map<number, ToolCallBuffer>();
return {
feed(delta: {
index: number;
id?: string;
name?: string;
arguments?: string;
}): { complete: boolean; toolCall?: CompleteToolCall } {
// Initialize buffer for new tool call
if (!buffers.has(delta.index)) {
buffers.set(delta.index, {
id: delta.id || "",
name: delta.name || "",
argumentsBuffer: "",
});
}
const buffer = buffers.get(delta.index)!;
// Accumulate pieces
if (delta.id) buffer.id = delta.id;
if (delta.name) buffer.name = delta.name;
if (delta.arguments) buffer.argumentsBuffer += delta.arguments;
// Try to parse — if it succeeds, the tool call is complete
try {
const parsed = JSON.parse(buffer.argumentsBuffer);
buffers.delete(delta.index);
return {
complete: true,
toolCall: {
id: buffer.id,
name: buffer.name,
arguments: parsed,
},
};
} catch {
// JSON not complete yet — keep accumulating
return { complete: false };
}
},
reset() {
buffers.clear();
},
};
}Wire the accumulator into your stream — text tokens go to the UI, tool call fragments feed the accumulator until complete JSON emerges:
import { ChanlClient } from "@chanl-ai/sdk";
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
const accumulator = createToolCallAccumulator();
const stream = chanl.chat.stream(agentId, messages, { tools: true });
for await (const chunk of stream) {
// Regular text tokens
if (chunk.type === "token") {
appendToUI(chunk.content);
}
// Tool call fragments
if (chunk.type === "tool_call_delta") {
const result = accumulator.feed({
index: chunk.index,
id: chunk.id,
name: chunk.name,
arguments: chunk.arguments,
});
if (result.complete) {
console.log(
`Tool call complete: ${result.toolCall!.name}`,
result.toolCall!.arguments
);
// Execute the tool, send results back to the model
}
}
}Letting the SDK handle accumulation
The Chanl SDK handles this automatically. The tool_call event fires only when a complete, parsed tool call is ready:
import { ChanlClient } from "@chanl-ai/sdk";
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
const stream = chanl.chat.stream(agentId, messages, { tools: true });
for await (const chunk of stream) {
if (chunk.type === "token") {
document.getElementById("response")!.textContent += chunk.content;
}
if (chunk.type === "tool_call") {
// Tool call JSON fully assembled from streaming chunks
showToolResult(chunk);
console.log(`Tool: ${chunk.name}`, chunk.arguments);
}
if (chunk.type === "done") {
console.log(`TTFT: ${chunk.metrics.ttft}ms`);
console.log(`Tokens/sec: ${chunk.metrics.tokensPerSecond}`);
}
}The SDK normalizes the event format regardless of model provider — whether it's OpenAI's choices[0].delta structure or Anthropic's content_block_delta. You write one handler; it works with both.
React streaming with Chanl SDK
A streaming chat UI in React introduces state management challenges: appending tokens without re-rendering the entire list, tracking metrics, supporting cancellation, handling tool calls — all while staying responsive.
The SDK's React hook wraps all of this into a single call:
import { useStreamingChat } from "@chanl-ai/sdk/react";
function ChatInterface({ agentId }: { agentId: string }) {
const { messages, isStreaming, ttft, tokensPerSecond, send, cancel } =
useStreamingChat(agentId, {
onToolCall: (call) => {
// Show real-time tool execution in UI
addToolCallIndicator(call.name, call.arguments);
},
});
return (
<div className="flex flex-col h-full">
<div className="flex-1 overflow-y-auto p-4 space-y-4">
{messages.map((msg) => (
<MessageBubble key={msg.id} {...msg} />
))}
</div>
{isStreaming && (
<div className="flex items-center gap-3 px-4 py-2 text-sm text-muted-foreground">
{ttft && <span>TTFT: {ttft}ms</span>}
{tokensPerSecond && <span>{tokensPerSecond} tok/s</span>}
<button
onClick={cancel}
className="ml-auto text-destructive hover:underline"
>
Stop generating
</button>
</div>
)}
<ChatInput onSend={send} disabled={isStreaming} />
</div>
);
}Under the hood, useStreamingChat handles four things you'd otherwise build yourself:
- Optimistic message insertion. The user's message appears before the server acknowledges it.
- Token batching. Tokens batch on
requestAnimationFrameboundaries — one React render per frame instead of per token. - Abort controller.
cancelcallsAbortController.abort(), tearing down the SSE connection cleanly. - TTFT tracking. Measures time from
send()to the first token event automatically.
Manual implementation (without SDK)
If you're not using React or need full control, here's what the manual version looks like — about 80 lines of state management covering token buffering, animation-frame batching, and abort control:
function useManualStreamingChat(endpoint: string) {
const [messages, setMessages] = useState<Message[]>([]);
const [isStreaming, setIsStreaming] = useState(false);
const [ttft, setTtft] = useState<number | null>(null);
const controllerRef = useRef<AbortController | null>(null);
const tokenBufferRef = useRef("");
const rafRef = useRef<number | null>(null);
// Batch DOM updates to animation frames
const flushTokens = useCallback(() => {
if (!tokenBufferRef.current) return;
const tokens = tokenBufferRef.current;
tokenBufferRef.current = "";
setMessages((prev) => {
const updated = [...prev];
const last = updated[updated.length - 1];
updated[updated.length - 1] = {
...last,
content: last.content + tokens,
};
return updated;
});
rafRef.current = null;
}, []);
const appendToken = useCallback(
(token: string) => {
tokenBufferRef.current += token;
if (!rafRef.current) {
rafRef.current = requestAnimationFrame(flushTokens);
}
},
[flushTokens]
);
const send = useCallback(
async (content: string) => {
const controller = new AbortController();
controllerRef.current = controller;
setIsStreaming(true);
setTtft(null);
// Add user message + empty assistant message
const userMsg: Message = {
id: crypto.randomUUID(),
role: "user",
content,
};
const assistantMsg: Message = {
id: crypto.randomUUID(),
role: "assistant",
content: "",
};
setMessages((prev) => [...prev, userMsg, assistantMsg]);
const start = Date.now();
try {
const response = await fetch(endpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages: [{ role: "user", content }] }),
signal: controller.signal,
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop()!;
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = JSON.parse(line.slice(6));
if (data.type === "token") {
if (ttft === null) setTtft(Date.now() - start);
appendToken(data.content);
}
}
}
} catch (err: any) {
if (err.name !== "AbortError") console.error(err);
} finally {
flushTokens(); // Flush remaining tokens
setIsStreaming(false);
controllerRef.current = null;
}
},
[endpoint, appendToken, flushTokens, ttft]
);
const cancel = useCallback(() => {
controllerRef.current?.abort();
}, []);
return { messages, isStreaming, ttft, send, cancel };
}The requestAnimationFrame batching is the critical piece. Without it, you'd trigger a React render per token — 80-100 renders per second on a fast model — causing janky, sluggish UI.
Backpressure: when the client can't keep up
GPT-4o generates 80-100 tokens per second. Each token triggers a DOM update. On a fast laptop, fine. On a budget Android phone rendering Markdown with syntax highlighting, the browser falls behind. Tokens arrive faster than the UI can paint them.
Without handling, you get one of two failures: unbounded buffer growth (eventually crashing the tab) or an unresponsive UI choking on a backlog of queued updates.
Server-side drain handling
When res.write() returns false, the kernel's TCP send buffer is full. This pattern pauses the stream until the client catches up:
import express from "express";
import { ChanlClient } from "@chanl-ai/sdk";
const app = express();
const chanl = new ChanlClient({
apiKey: process.env.CHANL_API_KEY,
model: "claude-sonnet-4-20250514",
});
app.post("/api/chat/stream", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.flushHeaders();
const stream = chanl.chat.stream(req.body.agentId, req.body.messages);
for await (const chunk of stream) {
if (chunk.type !== "token") continue;
const payload = `data: ${JSON.stringify({ type: "token", content: chunk.content })}\n\n`;
const canContinue = res.write(payload);
if (!canContinue) {
// Buffer is full — wait for the client to catch up
await new Promise<void>((resolve) => res.once("drain", resolve));
}
}
res.end();
});Client-side batch rendering
Buffer tokens and flush once per frame. This caps DOM writes at 60/second regardless of token rate:
class TokenRenderer {
private buffer = "";
private frameId: number | null = null;
private element: HTMLElement;
constructor(element: HTMLElement) {
this.element = element;
}
append(token: string) {
this.buffer += token;
if (!this.frameId) {
this.frameId = requestAnimationFrame(() => this.flush());
}
}
private flush() {
if (this.buffer) {
// Single DOM write per frame — 60fps max
this.element.textContent += this.buffer;
this.buffer = "";
}
this.frameId = null;
}
}On a fast connection receiving 100 tokens/second, you batch roughly 1-2 tokens per frame. On a slow device, the buffer absorbs bursts without dropping frames.
Production: load balancers, proxies, and edge cases
Streaming works perfectly on localhost. Then you deploy behind Nginx, Cloudflare, or an AWS ALB, and tokens arrive in batches. The culprit is almost always response buffering.
Nginx configuration for SSE
Nginx buffers upstream responses by default. Every directive here matters — miss one and tokens batch up:
location /api/chat/stream {
proxy_pass http://backend:3000;
# Required for SSE streaming
proxy_buffering off;
proxy_cache off;
# Prevent Nginx from buffering the response
proxy_set_header X-Accel-Buffering no;
# HTTP/1.1 keepalive
proxy_http_version 1.1;
proxy_set_header Connection '';
# Don't timeout long-running streams
proxy_read_timeout 300s;
proxy_send_timeout 300s;
# Disable gzip — it buffers until it has enough data to compress
gzip off;
}Why each one matters:
proxy_buffering off— Stops Nginx from waiting for a "full" response before forwarding.X-Accel-Buffering no— Tells Nginx not to buffer at the application level. Can also be set as a response header from Express.Connection ''— Enables keepalive. Without it, Nginx may close the connection early.gzip off— Gzip buffers data until it has enough to compress. For tiny SSE events, it just adds latency.
Recovering from mid-stream disconnects
When the connection drops at token 47 of 200, the browser's EventSource reconnects automatically — but restarts from scratch, which is wrong for chat. Tag each event with a sequence number so the client can resume:
// Server: tag each event with a sequence number
let seq = 0;
for await (const chunk of stream) {
if (chunk.type === "token") {
res.write(
`id: ${++seq}\ndata: ${JSON.stringify({
type: "token",
content: chunk.content,
seq,
})}\n\n`
);
}
}
// Client: track last received sequence
let lastSeq = 0;
const source = new EventSource(`/api/chat/stream?lastSeq=${lastSeq}`);
source.onmessage = (event) => {
const data = JSON.parse(event.data);
lastSeq = data.seq;
// ... render token
};
// On reconnect, EventSource sends Last-Event-ID header automatically
// Server checks this and skips already-sent eventsTimeout gotchas across your stack
Different layers have different defaults. Any one of them being too short kills your stream mid-response — especially during long tool calls when no tokens flow:
| Layer | Default timeout | Fix |
|---|---|---|
| Nginx | 60s proxy_read_timeout | Set to 300s for long streams |
| AWS ALB | 60s idle timeout | Increase to 300s in target group settings |
| Cloudflare | 100s (Free), 600s (Enterprise) | Send periodic : keepalive\n\n comments |
| Browser | No timeout for SSE | N/A — but HTTP/2 connections may timeout |
| Node.js | 120s server.timeout | server.timeout = 0 to disable |
Cloudflare deserves special attention. It terminates connections that go silent for 100 seconds. If your model is running a tool call that takes 90 seconds, no tokens flow during that window. Send SSE comments as heartbeats:
// Keep Cloudflare alive during tool execution
const heartbeat = setInterval(() => {
res.write(": keepalive\n\n");
}, 30000); // Every 30 seconds
// Clean up when stream ends
stream.on("end", () => clearInterval(heartbeat));Pick SSE unless you need WebSockets
After building both, here's the decision framework:
| Factor | SSE | WebSocket |
|---|---|---|
| Direction | Server to client only | Bidirectional |
| Protocol | Standard HTTP | Upgrade to ws:// protocol |
| Reconnection | Automatic (built into EventSource) | Manual — you implement retry logic |
| Proxy/CDN support | Works everywhere (standard HTTP) | Needs explicit proxy support |
| Browser API | EventSource (built-in, 3 lines) | WebSocket (built-in, more setup) |
| HTTP/2 multiplexing | Yes — multiple SSE streams over one TCP connection | No — each WebSocket is a separate TCP connection |
| Auth | Standard HTTP headers (cookies, Bearer tokens) | Auth in query params or first message (no headers on upgrade) |
| Use case | Chat streaming, notifications, real-time updates | Voice AI, collaborative editing, gaming |
Use SSE when: the user sends a message and the server streams a response. This covers chatbots, AI assistants, code completion, search-as-you-type — roughly 90% of AI streaming use cases. If you've built a RAG pipeline or an agent with tools, SSE is your transport.
Use WebSockets when: you need the client to send events during an active stream. Voice AI with interruption detection is the canonical example — the client streams audio while simultaneously receiving the agent's response. If your architecture involves scoring live conversations and feeding results back during the call, WebSockets give you that bidirectional channel.
Skip HTTP/2 Server Push. It was designed for preloading assets, not event streaming, and most browsers have removed support. HTTP/2's multiplexed streams work great with SSE (multiple SSE connections share one TCP connection), but the streaming protocol on top is still SSE.
For most teams building AI chat products: SSE for streaming, a separate REST endpoint for cancellation if you're using EventSource, and HTTP/2 at the transport layer for multiplexing. That combination handles production traffic with minimal complexity.
Streaming touches every layer of your stack: token generation, SSE/WebSocket protocols, backpressure, proxy buffering, partial JSON parsing, and frontend rendering. Each has its own failure mode, and they compound.
Once you've built it right, though, it's a stable foundation. The SSE server here handles ChatGPT-scale token rates. Backpressure patterns prevent buffer bloat on slow clients. The tool call accumulator handles streaming JSON from any provider. And the React patterns — SDK or manual — keep the UI responsive at any speed.
If you're building agents with tools and MCP servers, streaming is how those tool invocations become visible to users in real time. If you're evaluating agent quality, TTFT and tokens-per-second give you operational visibility batch responses can't match.
Start with SSE. Add WebSockets only when you have a bidirectional use case. Handle backpressure from day one.
Sources
- MDN Web Docs — Server-Sent Events (EventSource API) — Browser API reference for SSE, including reconnection behavior and event format.
- MDN Web Docs — Using Readable Streams — Web Streams API reference for reading streaming fetch responses.
- OpenAI API Reference — Streaming — OpenAI's streaming response format, including tool call deltas and finish reasons.
- Anthropic API Reference — Streaming Messages — Anthropic's streaming event types, including
content_block_deltaandmessage_stop. - Node.js Documentation — Stream Backpressure — Official guide to backpressure in Node.js writable streams, including
drainevent handling. - WHATWG — HTML Living Standard: Server-Sent Events — The specification for SSE protocol, event format, and reconnection rules.
- RFC 6455 — The WebSocket Protocol — Full WebSocket protocol specification covering the upgrade handshake and frame format.
- Nginx Documentation — Module ngx_http_proxy_module — Proxy buffering configuration that's critical for SSE passthrough.
Engineering Lead
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



