LLM Token Streaming in Production: SSE vs WebSocket vs Polling — Hard-Won Lessons (2026)
After shipping streaming for 6 production AI apps, I learned SSE, WebSocket, and polling each win different battles. Here is when to pick which, with real numbers from our Hostinger stack.
When I shipped streaming token output to BizChat Revenue Assistant last quarter — our internal sales-coaching app powered by Claude Sonnet — I assumed it would be a one-day job. Wrap the SDK call in an HTTP response, flush chunks as they arrive, done. Two weeks later, after wrestling with a corporate proxy that buffered the first 4 KB, an Nginx config that swallowed event boundaries, and a Hostinger shared host that aggressively closed long connections at 30 seconds, I had to admit the obvious: streaming sounds simple, and is not.
Across the six AI products I have shipped at Wardigi Digital Teknologi (SmartExam AI Generator, DiabeCheck Food Scanner, BizChat, DocSumm AI Summarizer, ServiceBot AI Helpdesk, and ContentForge AI Studio), three streaming transports came up repeatedly: Server-Sent Events (SSE), WebSockets, and polling in its short, long, and chunked-HTTP variants. Each one looked correct on paper. Only one of them was right for any given app — and the trade-offs were rarely the ones the blog posts predicted.
This guide compares the three transports for production LLM workloads in 2026, with a decision matrix you can use the same week, real numbers from our small-team operations, and a list of pitfalls I have personally walked into so you do not have to.
The Three Transports at a Glance
Before we dive in, a one-paragraph refresher. Server-Sent Events are a one-way HTTP-native streaming protocol — the server pushes data: ... lines down an open response, the client consumes them via the browser EventSource API or any HTTP client. WebSockets open a persistent full-duplex TCP connection upgraded from HTTP, allowing either side to send framed messages at any time. Polling covers everything from short-poll every-N-seconds requests to long-poll (the server holds the connection until data is ready) to chunked HTTP responses without the SSE event framing.
The market has already voted: SSE is the default. OpenAI's stream=true, Anthropic's InvokeWithResponseStream, the Vercel AI SDK, LangChain.js streaming, and the Claude Agent SDK all transmit token chunks over SSE. So why does this comparison still matter? Because what your API does internally is not the same as what your app exposes to a browser, a mobile client, or a downstream service.
Server-Sent Events: The Default for a Reason
SSE wins by default in any chatbot, summarizer, content generator, or single-turn agent. The reason is operational, not technical: SSE is just HTTP. Your existing load balancer, your existing reverse proxy, your existing observability stack, and your existing auth middleware all work without changes.
Here is what an SSE response looks like once you strip the ceremony:
HTTP/2 200
Content-Type: text/event-stream
Cache-Control: no-cache
data: {"delta":"The "}
data: {"delta":"answer "}
data: {"delta":"is 42."}
data: [DONE]
Each data: line is a discrete event. The client buffers up to the blank line, parses the payload, and either appends a token or finishes the stream. The browser EventSource object handles reconnection automatically. There are exactly three things that can go wrong, and I have hit all of them.
The three things that go wrong with SSE
- Proxy buffering. Nginx with default settings buffers responses up to 32 KB before flushing — perfect for compressing static assets, catastrophic for token streams. I had to add
proxy_buffering off;andX-Accel-Buffering: noon the response before our Hostinger VPS would actually emit incremental chunks. - Idle-connection killers. Cloudflare's free plan caps Server-Sent connections at 100 seconds; some corporate proxies and mobile carriers kill anything that does not send a packet within 30 seconds. The fix is sending a comment heartbeat (
: keepalive\n\n) every 15 seconds. The browserEventSourceignores comments but the wire stays open. - HTTP/1.1 connection limit. Browsers allow only six concurrent connections per origin under HTTP/1.1. If you have two open SSE streams plus a normal API call, you are already at half the budget. HTTP/2 fixed this through multiplexing, and as long as your endpoint serves HTTP/2, the problem evaporates. Cheap shared hosting still serves HTTP/1.1 in 2026 — verify with
curl -I --http2before assuming.
When I integrated streaming into DocSumm AI Summarizer (our PDF summarization tool), the measured time-to-first-token over SSE on our Hostinger VPS was around 740 ms — almost identical to a non-streamed POST. The reason: the upstream Claude call itself takes 600 ms to return the first token, and SSE adds essentially no overhead on top.
WebSockets: When Two-Way Becomes Necessary
WebSockets stop being optional the moment the client needs to send anything during the response. Three concrete examples from my work:
- ServiceBot AI Helpdesk live tool execution. The agent calls a tool, the UI surfaces the call mid-stream, and the operator can approve, reject, or modify the arguments before the tool actually runs. That mid-stream back-channel needs WebSockets.
- BizChat collaborative coaching mode. Two sales reps share the same conversation and can both inject context as the AI responds. Multi-writer means duplex.
- SmartExam live grading. While the AI generates feedback for a student answer, an instructor can drag a slider that biases the assessment toward strictness or leniency without restarting the response.
For any single-direction chat, WebSockets are a complexity tax. You lose the HTTP-native auth flow (cookies and bearer tokens move, but middleware that expects a clean request/response cycle does not). You inherit connection pooling, sticky sessions on the load balancer, custom heartbeats, and reconnection logic that EventSource would have given you for free. You also lose easy CDN caching, which matters less for LLM responses but matters a lot for the rest of the surface area.
One operational warning. The WebSocket upgrade handshake skips most CDN caching layers, so request distribution falls on your origin servers. When ServiceBot's WebSocket-mediated agent layer started receiving 80 concurrent sessions during a client demo, our single 1 GB RAM Hostinger VPS hit memory pressure within 90 seconds. The fix was migrating that single service to a 4 GB VPS — but the architectural lesson is that the cost curve for WebSockets is steeper than the request count would suggest, because each open connection costs RAM even when idle.
Polling: Underrated for the Right Workload
Polling is the punchline of streaming articles — and it should not be. For three specific workload shapes, polling is the correct answer:
- Long-running batch jobs. DocSumm processes 50-page PDFs in 25–60 seconds. The user sees a progress bar that ticks
processed 12/47 pagesevery 2 seconds. We use short polling against a Laravel cache key. SSE for that workload would mean holding a 60-second connection open per user — over 200 concurrent users it becomes a connection-count problem before it becomes a CPU problem. Polling is stateless on the request path and scales horizontally without sticky sessions. - Webhook-style integrations. ContentForge AI Studio submits an article-generation request via REST and the client polls a
/status/{job_id}endpoint every 3 seconds. Reason: the requesting client is often a downstream service (an n8n workflow, a WordPress plugin) that does not parse SSE properly. Polling speaks plain JSON over plain GET. - Mobile networks with aggressive idle teardown. Our DiabeCheck mobile app runs on Flutter against a backend that does multi-second OCR-then-LLM pipelines. On 4G we measured a 17% rate of dropped SSE streams when the network handed off between cell towers mid-response. Three-second polling never sees that problem because each request is a fresh TCP setup.
Polling has a perception problem: it sounds primitive. Operationally, polling is the easiest of the three to debug, scale, and observe, and that often matters more than perceived latency. A 1.5-second worst-case polling delay on a 30-second generation is well within what users tolerate, and the trade-off in operational simplicity is real.
Production Trade-offs: The Numbers That Actually Matter
The conventional pitch for streaming is "lower latency" — but the metric that matters to users is time-to-first-token (TTFT), not total response time. Across all six of our products, the upstream LLM provider is responsible for over 90% of TTFT. The transport you choose alters that figure by tens of milliseconds, not seconds.
What does meaningfully differ between transports is operational cost per concurrent user:
| Transport | RAM per idle connection | HTTP-stack reuse | Browser support | Bi-directional |
|---|---|---|---|---|
| SSE | ~8–12 KB (HTTP/2) | Full | EventSource (all modern) | No |
| WebSocket | ~25–40 KB | Partial | WebSocket (all modern) | Yes |
| Short polling | 0 (stateless) | Full | fetch (universal) | Request-response only |
| Long polling | ~5 KB during hold | Full | fetch (universal) | Server push only |
The RAM numbers above are from htop readings on our Hostinger 4 GB VPS during synthetic load tests, with Node.js 22 LTS as the runtime — not from theory. Other runtimes will differ; Bun and Rust-based servers are 2–4× more memory-efficient on idle WebSocket connections in my own benchmarks.
The second number that matters: reconnection cost. SSE reconnects automatically and replays from the last seen id: field if your server cooperates. WebSockets force you to write your own reconnect-and-resync logic — usually 80–150 lines of client code. Polling has no reconnect because every request is a fresh hop, which is exactly why it remains the most reliable transport over flaky networks.
What I Actually Use Across the 6 AI Apps
Stripping out all the qualifying language, here is the picture today across the production stack at Wardigi:
- SmartExam AI Generator — SSE for the student-facing chat layer; polling for the long-running PDF batch grader.
- DiabeCheck Food Scanner — Short polling on Flutter mobile. The OCR→LLM→nutrition lookup pipeline averages 6.4 seconds; the polling overhead is negligible.
- BizChat Revenue Assistant — Mixed. SSE for single-rep coaching, WebSockets for the collaborative two-rep mode.
- DocSumm AI Summarizer — SSE for chat-style "ask the document" queries, polling for full-document summarization.
- ServiceBot AI Helpdesk — WebSockets. Mid-stream human approval of tool calls is the entire product differentiator.
- ContentForge AI Studio — Polling. Most callers are integrations (n8n, WordPress, our own aggregator-site daily-import scripts) and they need plain JSON, not event-stream parsing.
The pattern I would extract from that list: SSE is the default for user-facing chat, polling is the default for batch and integration calls, and WebSockets only enter the picture when the client must talk during the stream.
Decision Matrix: Pick the Right Transport in Under 60 Seconds
Answer four questions in order. The first "yes" decides.
- Does the client need to send data during the server's response? → WebSocket.
- Is the operation longer than 60 seconds, or are most clients non-browser integrations? → Polling.
- Is the network conspicuously unreliable (mobile, corporate VPN, locked-down regions)? → Polling for safety, with optional SSE fallback ladder.
- Default case (browser chat, single-turn, sub-60-second response). → SSE.
I have stress-tested this exact decision tree across 14 projects (the six AI products plus tooling for several Wardigi enterprise clients). It is wrong about 5% of the time — usually when a regulatory requirement (Indonesian financial sector PII handling, for example) forces an unexpected on-prem deployment that breaks an SSE-friendly proxy assumption.
Common Pitfalls I Have Walked Into
1. Forgetting to flush after every chunk
Node.js does not flush by default. In Express, calling res.write() queues bytes; the OS may hold them for tens of milliseconds. Set res.flushHeaders() after sending the SSE headers and avoid any middleware that compresses (compression() middleware will buffer everything, defeating streaming entirely). I lost half a day on this with ContentForge before I noticed that disabling gzip recovered the streaming behaviour.
2. Authentication on EventSource
The browser EventSource API does not let you set custom headers. There is no way to pass a bearer token. You have three options: send the token as a query parameter (works, but ends up in access logs), set an auth cookie before opening the stream (works, but requires same-site), or use the third-party EventSource polyfill that supports headers. I default to cookies for browser clients and headers for server-to-server.
3. JSON event payloads getting chopped across chunks
If your data: line contains a JSON object and the OS flushes mid-payload, the client sees half a JSON document. The fix is sending one complete JSON object per data: line and parsing on the line boundary. I learned this on BizChat when long tool-call payloads occasionally produced Unexpected end of JSON input in production — never in dev, where flushes were perfectly aligned.
4. WebSocket sticky sessions on load balancers
If you run multiple Node processes behind a round-robin load balancer, a reconnecting WebSocket will land on a different worker and lose all session state. Enable sticky sessions (ip_hash in Nginx) or store the session state in Redis. We hit this on ServiceBot when scaling from one to two workers — the reconnection storm took the cluster down for 90 seconds before sticky sessions were added.
5. Polling intervals that hammer the LLM provider
A 1-second polling loop against an in-memory queue is fine. A 1-second polling loop that re-calls the LLM provider on every poll because nobody added job caching is how you burn through your API budget in a weekend. Use a short-lived cache key (5–10 seconds) for the polled endpoint result.
FAQ
Is gRPC streaming a viable alternative?
Yes, for server-to-server. No, for browser clients without a gRPC-Web proxy. I have used gRPC-streaming between two backend services where both ends speak protobuf and the throughput is high enough to justify the binary framing. For the public-facing surface, gRPC-Web adds an extra Envoy proxy and most browser-side tooling assumes JSON. The complexity rarely pays off.
Can I run SSE on shared Hostinger hosting?
Mostly. The shared plans I use for blog hosting impose a 60-second connection cap which limits SSE to short responses. For anything serious, move to their VPS tier (the smallest plan handles 100–150 concurrent SSE connections comfortably on Node 22).
How do I make SSE work behind Cloudflare?
Cloudflare proxies SSE by default but kills idle connections at 100 seconds on free and Pro plans. Send a heartbeat comment every 15 seconds and the connection survives. If you need longer-than-100-second responses, switch to Cloudflare Enterprise or bypass Cloudflare for the streaming endpoint via a subdomain that points directly at origin.
What is the actual time-to-first-token I should target?
On Claude Opus 4.7 or Sonnet 4.6 in 2026, target around 600–900 ms TTFT from a user-facing endpoint. On GPT-5.2 we have measured 500–800 ms. Gemini 2.5 Pro is faster on TTFT (350–600 ms) but slower on inter-token latency once streaming starts. The transport contributes single-digit milliseconds in all three cases — the model and the region you call it from are the dominant factors.
Is HTTP/3 going to change any of this?
Mildly. HTTP/3 over QUIC reduces head-of-line blocking and survives network handoff slightly better, which helps SSE on flaky mobile networks. But the operational story does not change — same buffering pitfalls, same auth quirks. I have not seen HTTP/3 unlock a capability the others do not already offer.
The Recommendation, Stripped Down
If I were starting a new AI product tomorrow and had to make this decision in one sitting:
- Ship SSE for the chat surface. It is what OpenAI, Anthropic, and the major SDKs already produce. Your job is to forward the chunks, not invent a new framing.
- Add polling for the long jobs. Track the job in Redis with a 10-minute TTL, poll the status, return the result when ready.
- Pull in WebSockets only when the product requires the client to type, click, or move sliders mid-stream — and budget extra DevOps time for sticky sessions and reconnection logic.
What I would not do is the thing I see in nearly every "modern AI stack" tutorial: write a WebSocket-everywhere implementation because it sounds more advanced. The implementations are heavier, the failure modes are worse, and the user-perceived latency improvement over SSE is essentially zero for chat workloads. The boring transport is, here as in most of distributed systems, the right one.
If your team is currently rebuilding streaming for an LLM product — particularly if you are migrating from a polling MVP to a real-time experience — start with SSE, instrument the time-to-first-token from the browser, and only escalate to WebSockets when the product feature list explicitly demands client-side mid-stream input. The architecture you save will be your own.
Enjoyed this article?
Get more AI insights — browse our full library of 98+ articles and 373+ ready-to-use AI prompts.