vLLM vs SGLang vs TensorRT-LLM vs Ollama: Self-Hosted Serving 2026
A production-tested comparison of vLLM, SGLang, TensorRT-LLM, and Ollama for self-hosted LLM serving in 2026 — throughput, cold-start, cost math, and decision matrix from running a 4-product AI backend on a shared H100.
When I priced out switching our SmartExam AI Generator from the OpenAI API to a self-hosted Llama 3.3 70B model on an H100 rental, the math was uncomfortable. We were spending roughly $1,840/month on API calls to handle ~120,000 question-generation requests, and a single H100 on a 1-year reserved contract sat around $1,950/month before any utilization. The only way the math turned green was if we picked the right inference engine and pushed concurrent throughput hard enough to amortize the GPU across multiple internal products (SmartExam, DocSumm, BizChat, and ContentForge all run on the same backend).
That month turned into a deep evaluation of vLLM, SGLang, TensorRT-LLM, and Ollama on our actual workload patterns. Here is what we learned — pricing, throughput, cold start, gotchas, and which engine I would now recommend per use case in May 2026.

Why self-hosting got serious in 2026
Two shifts changed the calculus this year. First, open-weight models caught up: Llama 3.3 70B Instruct now sits within roughly 4 points of GPT-4o-mini on MMLU and within 6 points on HumanEval based on the lmsys.org leaderboard I checked last week. Second, NVIDIA H100 rental pricing on RunPod and Lambda dropped from around $2.49/hr in mid-2025 to $1.79–$2.10/hr in May 2026 for on-demand, with reserved 1-year contracts available at $1.49–$1.65/hr.
When per-token API costs hit $0.15–$0.60 per 1M tokens for the cheapest frontier models and $2.50–$15.00 for the premium tiers, any team doing more than roughly 80M tokens/month on premium models can break even on a single H100 — if the inference engine is dialed in. That is the entire point of this comparison. The engine choice is where 30–50% of the cost savings hide.
The four engines and what they actually are
vLLM — the production default
vLLM is the engine I would tell a small team to start with in 2026. Its headline trick is PagedAttention, which manages the KV cache the same way an operating system manages virtual memory pages, so the GPU stops wasting memory on empty slots in batched requests. The current release as of mid-May 2026 is patched on v0.20.0, with full Google Gemma 4 architecture support (MoE, multimodal, reasoning, tool-use) and DeepSeek V4 stabilization rolled in.
What I like in practice: the OpenAI-compatible API server is good enough that swapping our SmartExam app from api.openai.com to our internal vLLM endpoint required exactly four lines of config changes. No client SDK rewrites.
SGLang — RadixAttention for shared prefixes
SGLang is the engine I would pick if your traffic looks like a chatbot, a RAG system, or any agent loop where many requests share the same system prompt or context prefix. Its core innovation is RadixAttention, a KV cache management scheme that stores cached attention activations in a radix tree and reuses them across requests with common prefixes.
The latest stable release, SGLang v0.5.9, added native Anthropic API compatibility and integrated TRT-LLM DSA kernels into its Native Sparse Attention backend for DeepSeek V3.2, claiming 3x–5x speedup on Blackwell hardware. Practical translation: if your team is running an agent loop where the system prompt is 4,000 tokens and you send 50 turns through it, SGLang stops re-attending the prompt on every turn. That matters more than benchmarks suggest in production.
TensorRT-LLM — peak NVIDIA performance, painful setup
TensorRT-LLM is what you reach for if you have exactly one model in long-term production, you own or rent NVIDIA hardware, and total throughput per dollar is the only number that matters. The engine compiles a model-specific binary that squeezes maximum performance out of NVIDIA tensor cores. Once compiled, it leads vLLM at every concurrency level — by about 8% at 1 request, growing to 13% at 50 concurrent requests, and up to 30–50% on total throughput in very high-concurrency scenarios.
The cost: the compilation step in our test took 28 minutes, versus 58 seconds for SGLang and 62 seconds for vLLM. The first time we tried to recompile after switching from Llama 3.3 70B FP16 to a quantized FP8 build, we burned three hours debugging a calibration dataset issue. You commit to TensorRT-LLM; it does not commit to you.
Ollama — the laptop and prototype engine
Ollama is the right choice for one job: getting a model running on a developer laptop or a single-user internal tool in under five minutes. The CLI is clean, the model library is solid, and the GGUF quantization story makes a 70B model fit on consumer hardware (slowly).
It is not a production serving engine. Concurrent throughput is poor — past one or two simultaneous requests, latency climbs steeply because Ollama does not batch the way vLLM and SGLang do. I use it daily on my MacBook for local prototyping; I would never put it behind a production load balancer.
Side-by-side: the numbers that matter
| Dimension | vLLM | SGLang | TensorRT-LLM | Ollama |
|---|---|---|---|---|
| Throughput (Llama 8B, H100) | ~12,500 tok/s | ~16,200 tok/s (+29%) | ~14,100 tok/s (compiled) | ~600 tok/s (1 user) |
| Cold start to ready | ~62 sec | ~58 sec | ~28 min (compile) | ~10 sec |
| Setup difficulty (1–10) | 3 | 4 | 9 | 1 |
| Shared-prefix caching | Prefix caching v2 | RadixAttention | KV reuse (manual) | None meaningful |
| Hardware support | NVIDIA, AMD, Intel, TPU | NVIDIA, AMD (partial) | NVIDIA only | NVIDIA, Apple Silicon, CPU |
| OpenAI API compatibility | Yes (full) | Yes (full + Anthropic) | Via Triton wrapper | Yes (subset) |
| Quantization formats | AWQ, GPTQ, FP8, INT8, BNB | AWQ, GPTQ, FP8 | FP8, INT8, INT4 (compiled) | GGUF (Q4_K_M, Q5, Q8) |
| Best for | General production | Chatbots, RAG, agents | Single model, max throughput | Local dev, single user |
The benchmark figures are aggregated from the H100 SXM5 80GB results published by Spheron, particula.tech, and the n1n.ai inference engine comparison in March 2026. At 70B scale the SGLang-over-vLLM delta shrinks to 3–5% — at that size, prefill is a smaller fraction of total cost, so RadixAttention has less to win.
Decision matrix from production experience
Here is how I would actually pick between them. This is the framework I now use when a Warung Digital Teknologi client asks "should we self-host?" and the answer turns out to be yes.
Pick vLLM if
- You are doing this for the first time and want the shortest path from "GPU rented" to "first token served."
- Your stack is multi-model — you swap between Llama, Qwen, Mistral, and DeepSeek and you cannot afford to recompile each time.
- You need AMD or Intel hardware support, or you might in the next 12 months.
- Your team is two to ten engineers and you do not have a dedicated MLOps person.
Pick SGLang if
- Your workload has heavy prefix overlap. Chatbots with fixed system prompts. RAG systems with templated context. Agent loops with tool definitions repeated every turn.
- You are running Llama 3.x 8B or any small-to-mid model where prefill dominates the budget.
- You are on Blackwell (B100/B200/GB200) and want to ride the kernel improvements that landed in v0.5.x.
- You need both OpenAI and Anthropic API compatibility from the same server.
Pick TensorRT-LLM if
- You serve one specific model with stable weights for 6+ months.
- You own or have a 1-year reserved contract on NVIDIA hardware.
- You have at least one engineer who has shipped TensorRT engines before.
- Throughput per dollar is the metric your VP is measured on.
Pick Ollama if
- It is on a laptop or a developer workstation.
- You are prototyping and concurrent users is one.
- You want a local, air-gapped tool with zero infra (e.g., a paralegal summarizing privileged documents on-prem).

Production gotchas I wish someone had warned me about
1. KV cache memory is the hidden cost driver
On Llama 3.3 70B FP16, every concurrent request at 8K context burns roughly 2.5 GB of KV cache. An H100 with 80GB has maybe 50GB free after weights at FP16, which caps you around 18–20 concurrent 8K-context requests before vLLM starts preempting. Drop to FP8 weights and you double effective concurrency. Across our four internal products sharing a single H100, FP8 quantization was the single biggest lever — bigger than the engine choice itself.
2. Speculative decoding helps less than you think on shared GPUs
vLLM and SGLang both support speculative decoding with a smaller draft model. On our SmartExam workload (generating multiple-choice questions, output mostly structured JSON), speculative decoding gave us about 1.4x speedup on single-user latency but only 1.05x on aggregate throughput at concurrency 20. The draft model eats GPU cycles that batched generation could have used. If your bottleneck is throughput, not latency, speculative decoding is mostly a wash.
3. Continuous batching defeats request-level rate limiting
Both vLLM and SGLang use continuous batching, meaning new requests join the batch on the next token step rather than waiting for the batch to drain. This makes per-request timeout tuning weird. A request that would take 800ms standalone might take 1,400ms when joining a hot batch — not because the engine is slow, but because the model is generating shared tokens across many users. Our DocSumm timeouts at 1.5x of standalone P99 were too tight; we bumped to 2.5x.
4. Tokenizer mismatches will silently corrupt outputs
When we swapped a Mistral 7B variant for Llama 3.3 8B in our ContentForge stack, we forgot that one of our downstream JSON parsers normalized on the Mistral tokenizer's special tokens. Two weeks of subtly broken article outlines passed before someone noticed. Always pin tokenizer and model together, and add an output validation gate. Lesson learned the embarrassing way.
5. Multi-LoRA serving is real and underrated
vLLM's multi-LoRA support lets you serve dozens of fine-tuned adapters from a single base model in the same process. We trained six product-specific LoRAs (one per internal product) on a Llama 3.3 8B base, and now serve all six from one vLLM instance. The alternative — six separate models in six separate processes — would have needed three more H100s. This is the kind of cost lever that does not show up in benchmark tables.
Cost model: when does self-hosting actually beat the API?
The blunt math from our internal sheet, May 2026, 1-year reserved H100 at $1.55/hr:
- H100 monthly: ~$1,131
- Add storage, egress, load balancer: ~$220
- Engineering time (1 day/month maintenance): ~$600 at our blended rate
- Total monthly fully loaded: ~$1,951
At GPT-4o-mini API pricing ($0.15 input / $0.60 output per 1M tokens), and a 1:1 input/output ratio, you need to be doing roughly 5.2 billion tokens/month to break even versus the API. Most teams will not hit that. The break-even gets much friendlier at GPT-4o or Claude Opus pricing — there, roughly 280M tokens/month covers a single H100. If your team is using premium models heavily, self-hosting is closer than you think.
The other dimension that matters: multi-tenancy across your own products. A single H100 running vLLM with multi-LoRA serving 6 internal products amortizes the cost across all of them. That is what made our math work. A single-product team with 40M tokens/month will lose money self-hosting; a 6-product team sharing a GPU at 240M tokens/month wins on cost and gains data control.
What to monitor in production (the metrics that actually matter)
One non-obvious thing I learned the hard way: the metrics your APM tool gives you by default are not the ones that explain self-hosted LLM behavior. After we lost a Saturday morning chasing a phantom slowdown, I built our internal dashboard around these six signals instead:
- Time to first token (TTFT) per request — this is the prefill latency. If it climbs while throughput is stable, you have a context-length problem, not a model problem.
- Tokens per second per request (decode rate) — separate from TTFT. This tells you if the batch is getting crowded.
- Queue depth — requests waiting to enter the batch. vLLM exposes this on the Prometheus endpoint. A queue depth that grows monotonically is the first warning of capacity exhaustion.
- KV cache utilization percentage — once this passes ~85%, preemption kicks in and tail latency explodes. We alert at 75%.
- Cache hit rate on prefix caching / RadixAttention — if this drops below 40% on a workload you expected to be prefix-heavy, your template changed or your cache size is too small.
- GPU SM occupancy from
nvidia-smi dmon— sustained low occupancy (under 60%) means you are compute-starved on something other than the GPU. Usually CPU-side tokenization, sometimes network egress on streaming responses.
I push all of these into Langfuse for the per-request trace view and into a separate Grafana board fed by the vLLM Prometheus exporter for the system-level view. The two views answer different questions; you need both.
Security and data residency
One reason a couple of our enterprise clients are pushing us toward self-hosting harder than the cost numbers alone would justify: data residency and audit. When a client in regulated industries (we have done work for a regional bank, a hospital chain, and an insurance underwriter) asks "where does the prompt text physically go?", "API to OpenAI" is increasingly not an acceptable answer. Self-hosted in a Jakarta-region VPS is.
vLLM, SGLang, and TensorRT-LLM all have decent secrets hygiene — no telemetry phoning home by default, model weights stay on disk, prompt content stays in process memory. Ollama by default sends model pulls through its own registry; if you are in a fully air-gapped environment, you mirror models from Hugging Face and disable the auto-update path explicitly. All four engines are LGPL/Apache/MIT compatible; nothing in the licensing prevents commercial use.
The other side of the same coin: your own logs. If you log full prompts and outputs to a centralized log service for debugging, you have effectively re-created the same data-egress problem you were trying to avoid. Log redaction patterns and PII scrubbing at the logging layer are mandatory in any self-hosted setup that touches user data.
FAQ
Can I run vLLM on a 4090 instead of an H100?
Yes for models up to ~13B at FP16 or up to ~30B at AWQ 4-bit. The 4090's 24GB VRAM is the cap. Throughput per dollar is actually excellent for small models — we run a Qwen 2.5 7B internal classifier on a 4090 box for under $0.40/hr and it handles roughly 60 concurrent requests at 2K context. The catch is no NVLink, so multi-GPU 70B serving across 4090s is not viable.
Does SGLang work on AMD MI300X?
Partially as of v0.5.9. NVIDIA support is the priority path; AMD support is a community port that lags by one or two releases. If AMD ROCm is a hard requirement, vLLM is the safer choice — its ROCm story is more mature.
How long does it take to compile a TensorRT-LLM engine?
On our setup: Llama 3.3 70B FP8 on an H100 SXM5 took 28 minutes for a first build, and roughly 12 minutes for an incremental rebuild after a config tweak. Quantization calibration adds 20–40 minutes depending on dataset size. Budget half a day for the first end-to-end run.
Is Ollama really not production-ready?
It is production-ready for a workload of "one user at a time using a local desktop app." It is not production-ready for "a web service taking concurrent HTTPS requests." The runtime does not batch effectively above 1–2 concurrent generations, and the model loading semantics are tuned for interactive use, not steady-state serving. Use the right tool for the job.
What about MLX, llama.cpp, or LM Studio?
MLX is excellent on Apple Silicon — I run Qwen 2.5 14B on an M3 Max locally via MLX with surprisingly good throughput. llama.cpp is the engine inside Ollama and shares its single-user characteristic. LM Studio is a GUI wrapper around llama.cpp, useful for non-engineers. None of these is a multi-tenant production engine.
Should I wait for the next NVIDIA generation?
If you are planning a 12-month commitment, no — buy/rent now. Blackwell B100/B200 capacity is constrained and pricing is unfavorable until late 2026. H100 supply is healthy and pricing has been stable for two quarters. The performance delta is real but not "wait six months" real, and inference engine improvements (FP4 in vLLM, sparse attention in SGLang) are closing more of the gap than new silicon is.
What I would do today
If I were starting fresh with a small AI product team in May 2026 — say, three engineers, $40K/month infra budget, mixed workload of one user-facing chatbot plus a few internal automation jobs — here is the stack I would put down on day one:
- One H100 on a 6-month reserved contract from Lambda or RunPod (~$1,200/month all-in).
- vLLM v0.20.x with Llama 3.3 8B FP8 as the default model. Multi-LoRA enabled.
- SGLang as a secondary deployment specifically for the chatbot if it shows heavy prefix overlap (test first).
- Skip TensorRT-LLM until you have a single high-traffic model that has not changed in 3 months.
- Ollama only on developer laptops.
- Add Langfuse for observability from day one, because cost surprises in self-hosted are real.
Self-hosting in 2026 is not the heroic project it was two years ago. The engines are mature, the docs are good, and the hardware is rentable by the hour. The decision is no longer "can we?" — it is "is our workload shape worth the operational overhead?" For our four-product backend at Warung Digital Teknologi, the answer turned out to be yes. For a single-product team doing under 80M tokens/month, the answer is almost always no. The engine choice — vLLM by default, SGLang for shared-prefix workloads, TensorRT-LLM only when you have stability and scale, Ollama only on laptops — is the second decision, not the first.
Enjoyed this article?
Get more AI insights — browse our full library of 98+ articles and 373+ ready-to-use AI prompts.