Vapi vs Retell vs ElevenLabs: Voice AI Agents in Production (2026)
Three weeks, 360 simulated calls, $480 in burned credits. Here's what I learned picking a voice agent stack for ServiceBot AI Helpdesk in 2026.
Last quarter I had to choose a voice agent stack for ServiceBot AI Helpdesk — one of the AI products I built at Warung Digital Teknologi. The brief sounded simple: replace tier-1 phone support for a small B2B client handling roughly 600 inbound calls per week. The actual decision took three weeks, two failed prototypes, and about $480 in burned credits across Vapi, Retell AI, and ElevenLabs Conversational AI before I had something I would put my name on in production.
If you are a developer or product owner staring at the same three options in 2026, this is what I learned. No marketing language, no "leading provider" filler — just the tradeoffs that actually matter once you wire one of these into a real phone number, a real CRM, and a real angry customer.
Why these three keep showing up in 2026
The voice AI tooling space exploded in 2024 and contracted hard through 2025. By April 2026, three vendors keep getting shortlisted on every serious project I see: Vapi for developer-first orchestration, Retell AI for telephony-native deployment, and ElevenLabs Conversational AI for voice quality and brand-forward use cases. Bland AI, Deepgram Voice Agent, and Ultravox all show up as honourable mentions, but those three are what end up on the final spreadsheet 90 percent of the time.
The market context matters here. Conversational AI was a $2.4 billion category in 2024 and analysts now project it crossing $47 billion by 2034 at roughly 34.8 percent CAGR. That growth pulls in a lot of vendors making big claims about latency and naturalness, and most of those claims do not survive contact with a real production workload. Pick wrong and you are paying twice — once for the platform, once for the migration.
What I actually tested
For the ServiceBot pilot I ran the same workload across all three platforms over a two-week window. The workload was deliberately representative, not flattering:
- Mixed Indonesian/English support flow with 4 intents (order status, billing question, refund request, escalate-to-human)
- Backend integration with a Laravel API I wrote earlier in the year for the same client's order system
- Real telephony — Twilio inbound number, no in-browser SDK shortcuts
- ~120 simulated calls per platform pulled from anonymized transcripts of the previous quarter
- Manual scoring on three axes: time-to-first-token, completion accuracy, and "did this sound like a person" judged blind by 3 reviewers from the client's CS team
This is the same kind of small-scale benchmark I would recommend any team run before signing an annual contract. Vendor latency claims are measured under perfect conditions — single call, US data center, no LLM tool calls, no interruptions. Your production numbers will be worse. Mine were.
Vapi: developer-first orchestration
Vapi is the option I keep recommending to engineering-led teams, and it is what I picked for ServiceBot. The mental model is straightforward: Vapi is an orchestration layer, not a voice model. You bring your own STT (Deepgram, AssemblyAI, Gladia), your own LLM (OpenAI, Anthropic, Groq, or self-hosted), and your own TTS (ElevenLabs, Cartesia, PlayHT, OpenAI), and Vapi stitches them together with turn-taking, interruption handling, function calling, and webhook hooks.
The headline number is $0.05 per minute for orchestration. That number is misleading on its own, because it excludes every component you actually need. Once you add Deepgram Nova-3 STT (~$0.0043/min), Claude Haiku 4.5 as the LLM (variable but typically $0.04-$0.08/min for support flows), ElevenLabs Flash v2.5 TTS ($0.06-$0.10/min), and Twilio telephony ($0.014/min for US inbound), my real per-minute cost on ServiceBot landed at $0.21 to $0.27 per minute depending on call length and tool-call density. Vapi's documentation is now upfront about this, but the marketing pages still show $0.05 prominently.
What I liked, in order of importance:
- Provider swappability. When Cartesia released a new TTS voice that fit the client's brand better, I changed it from a config file. No code change. This sounds trivial until you need it.
- Tool calling that actually works. Vapi exposes function calling at the LLM layer, so my Laravel order-lookup endpoint plugged in cleanly. I also wired in BizChat Revenue Assistant's recommendation API as a tool — the agent could upsell during a billing call when the lookup showed the customer was on a legacy plan.
- End-of-call analysis. The structured-output post-call summary pipes straight into a webhook. I dump it to a MySQL table and into a small dashboard the client's manager checks every morning. Build time: about 90 minutes.
What hurt:
- Latency variance. P50 was around 480ms, but P95 spiked to 1100ms on calls with multiple tool calls. Anything above 700ms makes the conversation feel awkward; above 1000ms users start talking over the agent. Vapi's published "sub-500ms" claim was technically true for our P50 but useless as a planning number.
- Multi-invoice billing. By month two I had separate invoices from Vapi, Deepgram, OpenAI, ElevenLabs, and Twilio. Reconciling them for the client's finance team was painful enough that I built a small script to normalize them.
- HIPAA is paywalled. Vapi's HIPAA compliance is a $1,000/month add-on. For our use case it was not relevant, but anyone building healthcare voice agents should price this in early.
Retell AI: telephony-native and compliance-friendly
Retell positions itself as a purpose-built voice agent platform, and the difference shows when you start dealing with real phone-system messiness. Their pricing is clean: $0.07 per minute, all-in for the voice pipeline (STT + LLM + TTS), with telephony as a separate but pre-integrated cost. No multi-vendor invoice nightmare.
I ran the same ServiceBot workload through Retell during week two of the bake-off. The integration was the fastest of the three — about 4 hours from signup to first live call, mostly because Retell pre-wires Twilio and provides a built-in agent designer that handles function definitions in JSON. If you are a non-engineer or a small team without the bandwidth to debug WebRTC turn-taking models, Retell is the path of least resistance.
What stood out:
- Warm handoff to humans. This is Retell's killer feature for support use cases. When my agent hit "escalate to human", Retell handled the SIP transfer cleanly with full call context passed via headers. On Vapi I had to build the same thing manually with three webhooks and a Twilio TwiML detour.
- Structured dialog flows. Retell's flow builder is genuinely useful for compliance-heavy domains. For a healthcare or financial-services client where every prompt needs legal sign-off, having a visual flow you can hand to a non-technical reviewer is worth real money.
- HIPAA and SOC 2 are included on standard tiers. No surprise add-on fees.
What pushed me away from Retell for this specific project:
- Latency was the worst of the three. P50 around 620ms, P95 around 1300ms. Retell's own marketing acknowledges sub-600ms as a target rather than a guarantee. For a B2B support workload it was tolerable; for an outbound sales agent calling cold leads, it would feel sluggish.
- Less flexibility on the LLM side. You can pick from a curated list, but you cannot bring a self-hosted Llama 3.3 70B or a fine-tuned model the way you can on Vapi. For us this was not a blocker, but it would be for any team running a custom-trained domain model.
- Voice selection is narrower than ElevenLabs. The included voices are good, not great. If "sounds exactly like our brand" matters, Retell will frustrate you.
ElevenLabs Conversational AI: the voice quality benchmark
ElevenLabs is the voice you have probably already heard in 2025-2026 demos. Their TTS is the reference standard the rest of the industry compares against, and in November 2025 they shipped Conversational AI as a fully-integrated end-to-end voice agent platform with sub-100ms latency on their best tier — currently the lowest published number in the category.
I ran ServiceBot's flow on ElevenLabs for the third week. The voice quality was unmistakable. Blind testing with the client's CS reviewers, ElevenLabs scored 4.6/5 on naturalness vs Vapi's 4.1 and Retell's 3.9. Two of the three reviewers thought one specific call recording was a real person; that did not happen with the other platforms.
What ElevenLabs gets right:
- Voice quality. Obvious, but worth restating. If you are building a brand-forward agent — luxury concierge, dating app voice companion, premium outbound sales — the voice itself is half the product, and ElevenLabs wins this on every comparison I have run.
- Latency. P50 of 280ms in my tests, P95 around 540ms. The lowest of the three. The conversation flow felt almost too snappy at first; we had to dial in turn-taking to add small pauses so the agent did not feel pushy.
- 11,000+ voice library plus instant voice cloning. The cloning quality is genuinely production-ready. I cloned the client's marketing director's voice for a follow-up campaign and they could not pick out the synthetic call from real recordings.
What stopped me from picking it for ServiceBot:
- Pricing. $0.08 to $0.24 per minute depending on tier. The all-in number is similar to Vapi, but the per-minute floor is higher and there is less headroom to optimize. For a 600-call-per-week support workload averaging 4 minutes per call, the ElevenLabs bill projected $920/month vs $560 on Vapi. The voice quality lift was not worth $360/month for B2B support.
- Less mature tool calling. Their function-calling support shipped in early 2026 and works, but the developer ergonomics are still 6-12 months behind Vapi. For an agent that needs to hit 4-5 backend APIs per call, I would still pick Vapi today.
- No native warm handoff. You can build it, but you have to wire it yourself.
Side-by-side comparison
| Dimension | Vapi | Retell AI | ElevenLabs |
|---|---|---|---|
| Headline price | $0.05/min orchestration only | $0.07/min all-in | $0.08-$0.24/min all-in |
| Real production cost | $0.21-$0.27/min (measured) | $0.18-$0.23/min (measured) | $0.28-$0.34/min (measured) |
| P50 latency (my test) | 480ms | 620ms | 280ms |
| P95 latency (my test) | 1100ms | 1300ms | 540ms |
| LLM flexibility | 14+ providers, BYO model | Curated list, no BYO | Curated list, no BYO |
| Voice quality (blind score) | 4.1 / 5 | 3.9 / 5 | 4.6 / 5 |
| Telephony | BYO (Twilio, Vonage) | Native Twilio + SIP | BYO (Twilio) |
| Warm handoff | DIY (3 hooks) | Native, full context | DIY |
| HIPAA / SOC 2 | $1,000/mo add-on | Included on standard | Enterprise tier only |
| Time to first live call | ~6 hours (in my hands) | ~4 hours | ~5 hours |

The 3 things benchmarks always miss
Marketing comparisons fixate on latency and per-minute pricing. After a year of building voice integrations into ServiceBot AI Helpdesk and BizChat Revenue Assistant, here are the three dimensions I have learned to weight more heavily than the headline numbers.
1. Interruption handling under noise
Every platform claims real-time turn-taking. In a quiet test environment they all work. On a real call from a customer in a noisy warehouse, the failure modes diverge sharply. Vapi and Retell both occasionally misinterpreted background noise as user input and cut off the agent mid-sentence. ElevenLabs handled it best — their VAD is tuned aggressively for false-positive suppression, which costs you a hair of perceived snappiness but eliminates the most jarring conversational failures.
Test this with calls from a phone, in a coffee shop, on speaker. The pristine WebRTC-from-laptop demo will lie to you.
2. Tool-call retry semantics
When your backend API hiccups during a function call, what does the agent say? Vapi gives you full control via the response handler — I made mine say "let me check that one more time" and silently retry. Retell defaults to a generic "I'm having trouble accessing that" which sounds robotic. ElevenLabs falls in the middle.
For ServiceBot, the difference between "let me check that one more time" and "I'm having trouble" was measured: the former had a 7 percent escalation rate, the latter had 23 percent. That margin alone justified the platform decision for our workload.
3. Conversation logs that engineers can actually use
You will spend more time debugging a voice agent than writing one. Logs matter. Vapi gives you full transcript plus token-level timing plus tool-call traces in a JSON object you can pipe anywhere. Retell gives you transcript plus high-level events. ElevenLabs gives you transcript plus audio file. For an engineering team trying to reproduce a "the agent said the wrong thing at minute 3" bug, the difference is measured in hours per week.
Decision matrix: which one for what
If I had to summarise the picks I would make today across different project briefs:
- B2B support agent with backend integration → Vapi. The orchestration flexibility and tool-calling ergonomics outweigh the multi-invoice tax. This is what I picked for ServiceBot.
- Outbound sales or appointment-setting at scale → Retell. Native warm handoff and telephony maturity matter more than peak latency, and the all-in pricing makes finance happy.
- Healthcare or financial services with compliance scrutiny → Retell. Built-in HIPAA + SOC 2 and structured dialog flows that a compliance reviewer can actually read.
- Brand-forward consumer product → ElevenLabs. Voice quality is the product. Pay the premium.
- Multilingual deployment with non-English voices → ElevenLabs. 70+ language coverage and the best non-English voice quality I have heard.
- Cost-sensitive prototype → Vapi with Deepgram + Groq + Cartesia. You can land at $0.13/min if you optimize the stack aggressively.
FAQ
Can I switch platforms later?
Sort of. The agent prompt and the function definitions port cleanly. Conversation flow logic, webhook integrations, and any platform-specific features (Retell flow builder, Vapi tool config schema, ElevenLabs voice cloning) do not. Budget 2-3 weeks for a real migration. The lesson I learned the hard way: do your pilot on the platform you actually plan to ship, not the cheapest one.
How much should I budget for the pilot?
Plan for $200-$500 in burned credits across the three platforms during a serious bake-off. My ServiceBot evaluation cost roughly $480 over three weeks running 360 simulated calls plus 40 live ones. Less than the cost of a single bad production decision.
Do I need a separate observability stack?
Yes. None of the three give you what a real APM gives you for HTTP services. I pipe Vapi's webhook events into a small ClickHouse instance and run grafana dashboards over it. For a smaller team, even dumping JSON to PostgreSQL and running daily SQL reports is a meaningful upgrade over the built-in dashboards.
What about Bland AI, Deepgram Voice Agent, or Ultravox?
Bland is genuinely strong for high-volume outbound — if your workload is 10,000+ outbound calls per day, run a pilot there. Deepgram Voice Agent is interesting because it bundles their best-in-class STT with an agent layer, but as of April 2026 it is still earlier in maturity than the big three. Ultravox's open-source angle is compelling for teams who want full self-hosting, but operationally it is more work than I would take on for a client project.
How is voice agent quality going to change in the next 12 months?
Latency floors will keep dropping — sub-200ms P95 will be table stakes by mid-2027. Voice cloning will become indistinguishable to most listeners (some would argue ElevenLabs is already there). The interesting frontier is agent reasoning during the call: when the LLM can think for 200ms mid-conversation without breaking flow, the conversational ceiling lifts dramatically. Watch for that capability to ship in Anthropic and OpenAI's voice-mode APIs over the next two quarters.
Final verdict
There is no single best voice agent platform in 2026. There is a best one for your specific workload, your specific compliance constraints, and your specific tolerance for engineering time vs platform cost.
If you make me pick a default for a developer audience reading this in April 2026, it is Vapi — because the flexibility and ecosystem maturity have compounded faster than the competition's. But the moment your project hits compliance, telephony complexity, or premium voice quality as a hard requirement, the answer changes. Run the bake-off. Use real workloads. Trust your P95 over the marketing P50.
And budget for the migration you are pretending you will not need.
Enjoyed this article?
Get more AI insights — browse our full library of 65+ articles and 373+ ready-to-use AI prompts.