OpenAI vs Anthropic vs Google Batch APIs 2026: 50% Off Real-Time
I shipped LLM batch APIs across three production AI products in 2026 and saved $2,800/month. Here is the head-to-head on OpenAI, Anthropic, and Vertex AI batch — discount math, real turnaround times, and when batch is the wrong answer.
For the first eighteen months of running our AI products at Warung Digital Teknologi, I treated every LLM call the same way: send a request, wait for a response, move on. Synchronous, real-time, expensive. Then DocSumm AI Summarizer hit the point where we were processing 40,000 documents a week, and the OpenAI bill stopped being a rounding error. That is when batch APIs went from "interesting footnote in the docs" to "the single biggest cost win we had not pulled."
This is a hands-on look at the three batch APIs I have actually shipped against in production over the past nine months: OpenAI's Batch API, Anthropic's Message Batches API, and Google's Vertex AI batch prediction. The marketing claim is identical across all three — 50% off the standard rate, 24-hour turnaround — but the operational reality is very different once you run them at scale. I will share what I measured, where each one bit me, and how to decide which workloads belong on batch versus real-time.
Why batch APIs exist (and why most teams ignore them)
The pitch is simple. Instead of one HTTP request per LLM call, you upload a file of thousands of requests, the provider processes them when capacity is available, and you pay roughly half the per-token price. The provider gets to smooth load across off-peak hours. You get a fat discount in exchange for tolerating up to 24 hours of latency.
Most teams I have talked to dismiss batch APIs for one of three reasons: their workload is interactive (chat, search, voice), they have not done the engineering work to decouple production from real-time inference, or they assumed the 24-hour SLA meant 23 hours and 59 minutes by default. None of those reasons survive contact with a finance team once your monthly LLM bill crosses about $4,000. At that point, every dollar that can move to batch is a dollar of pure margin.
Across the six AI-powered products I have built — SmartExam AI Generator, DiabeCheck Food Scanner, BizChat Revenue Assistant, DocSumm AI Summarizer, ServiceBot AI Helpdesk, and ContentForge AI Studio — roughly 60% of total LLM call volume could in principle run on batch. Not all of it does today, because of the operational overhead I will get into below, but the upper bound is much higher than people assume.
The three providers, side by side
Here is the comparison table I keep open in a tab when I am scoping a new ingestion pipeline. Numbers are from my own usage over Q1 and Q2 2026, cross-checked against each provider's current public docs as of May 2026.
| Dimension | OpenAI Batch API | Anthropic Message Batches | Google Vertex AI Batch |
|---|---|---|---|
| Discount vs real-time | 50% off input + output | 50% off input + output | 50% off (Gemini models) |
| SLA turnaround | 24 hours | 24 hours | 24 hours (often hours, not days) |
| Input format | JSONL file via Files API | JSON array, single API call | JSONL on GCS bucket or BigQuery table |
| Max requests per batch | 50,000 | 100,000 (or 256 MB) | No hard limit (bucket-bound) |
| Rate limits | Separate batch token-per-day quota | Counts against batch-specific TPM | Counts against project quota |
| Cancellation | Possible while pending | Possible while pending | Possible, but partial spend not refunded |
| Stacks with prompt caching | Yes, both discounts apply | Yes, both discounts apply | Limited (implicit caching only) |
| Median actual turnaround (my data) | 38 minutes | 11 minutes | 22 minutes |
| Worst case I have hit | 9 hours | 2.5 hours | 4 hours |
Two numbers in that table matter more than anything else: the median and the worst case. The 24-hour SLA is just a promise that bounds the tail. In practice, all three finish much faster — and Anthropic is consistently the fastest by a comfortable margin in my workloads. That has shifted how I architect new pipelines.
OpenAI Batch API: the workhorse, with one annoying quirk
OpenAI was the first batch API I integrated, back when DocSumm was still a side project running on a single Hostinger VPS. The flow is well-documented: build a JSONL file where each line is a complete chat completion request with a custom id, upload via the Files API, create a batch, poll for status, download the output file.
The thing I wish someone had told me before I shipped the first integration: the batch quota is separate from your standard tier quota. If you are a Tier 2 customer with 2 million tokens per minute on the standard endpoint, your batch limit is its own ceiling — and on launch day my queue spent six hours stuck in "validating" before I realized I had blown past the daily batch token cap. Check the rate limits page for your tier before scoping a big run, not after.
What I like about OpenAI Batch:
- The 50% discount stacks with prompt caching. On DocSumm's summarization workload, where I have a ~3,000-token system prompt that is reused across every document, the cached input drops to 25% of the standard rate. Combined with batch on the cached portion, my effective cost on input tokens is about 12.5% of the headline number. That is the deal that made the unit economics work.
- Failed requests inside a batch are isolated. If 12 out of 50,000 requests fail validation, you still get 49,988 successful completions back. No all-or-nothing behavior.
- The output JSONL preserves your custom IDs, which makes the join back to your source records trivial. I generate IDs that look like
docsumm-doc-{uuid}and a simplejqpipeline reassembles results.
What I do not like:
- The Files API has a 200 MB per-file limit, which means a single JSONL cannot be larger than that. For huge runs I split into multiple batches, but the orchestration overhead grows.
- There is no way to set per-request timeouts inside the batch payload. If a single request would have hung on real-time, it will eat its full token budget here.
Anthropic Message Batches: the surprise favorite
I was a late convert to Anthropic's batch API because for a long time it lagged the real-time API in supported features. That gap closed in early 2026, and Message Batches now supports tool use, vision, system prompts with prompt caching, and the extended thinking mode on Claude Sonnet 4.6.
The operational ergonomics are noticeably better than OpenAI's. You do not upload a file — you POST a JSON array directly to the batches endpoint with up to 100,000 requests (or 256 MB, whichever you hit first). That sounds minor, but it means batch creation is one HTTP call instead of upload-then-create, and it removes a whole class of "the file got truncated mid-upload" errors that bit me twice in OpenAI integrations on flaky Indonesian residential internet.
The number that surprised me most: Anthropic's median turnaround in my workload is 11 minutes. That is two-and-a-half times faster than OpenAI for batches in the 5,000-to-20,000-request range that I run nightly for ContentForge. At that latency, the line between "batch" and "real-time-ish" gets blurry. I have moved several ContentForge pipelines that were running on real-time Claude to batch and the user-perceived freshness barely changed because we were already running them on an hourly cron.
Things I have run into:
- Batches expire after 29 days, so if you launch one and forget about it past that window, you cannot retrieve results. Set a reminder.
- The per-request error structure is good but not great — you get a top-level
result.typeofsucceeded,errored,canceled, orexpired, which means your downstream code needs to handle four cases per row, not two. - Prompt caching on the system prompt works inside batches, but cache hits across batches are not guaranteed. I assume zero cross-batch cache reuse when forecasting, and I am usually pleasantly surprised.
Google Vertex AI batch prediction: the enterprise option
Vertex AI batch is a different beast operationally. Instead of a JSON or JSONL payload sent over HTTPS, you stage your input as either a JSONL file in a Google Cloud Storage bucket or as rows in a BigQuery table. Output is written back to GCS or BigQuery. The whole flow assumes you already live inside Google Cloud.
For BizChat, where the customer is a mid-market Indonesian retailer who already runs on Google Workspace and has a BigQuery warehouse for their transaction data, this is the right answer almost by default. Their data analyst can fire a batch job from a SQL query against the warehouse and have summarized customer-conversation insights land back in BigQuery without anything ever leaving the cloud perimeter. Compliance team loves it. I love it slightly less, because debugging a failed batch means digging through Cloud Logging and that is its own skill.
Where Vertex AI batch shines:
- It handles enormous inputs gracefully. I have run batches of 400,000+ rows against Gemini 2.5 Flash that I would not attempt on either OpenAI or Anthropic without splitting into multiple jobs.
- Pricing on Gemini Flash is already aggressive at standard rates, and the batch discount pushes it into "cheaper than self-hosting" territory for many classification workloads.
- The native BigQuery integration means I can write a SQL query that says "give me every product review from the last 7 days that has not been classified yet" and that becomes the batch input directly, no Python glue.
Where it bites:
- The 50% batch discount is only available on a subset of Gemini models. Check the current pricing table before assuming your model is covered — I have been caught when a model I assumed was discounted was actually billed at full rate.
- Cancellation refunds are partial. If you kill a batch that has already processed 30% of its rows, you pay for that 30%. OpenAI and Anthropic refund cleaner.
- The error reporting from a failed row is more verbose but also less actionable than Anthropic's structured per-request error object. Expect to write a parser.
Cost math that actually moved the needle for us
Specifics, since "save 50%" is not a useful planning number on its own. Here is what shifting workloads to batch did across our products in the first quarter we ran the migration:
- DocSumm AI Summarizer: 38,000 summaries per week. Pre-batch monthly LLM cost was roughly $2,100. After moving to OpenAI Batch with prompt caching stacking, monthly cost dropped to about $620. That is a 70% reduction because the cache discount stacks on top of the batch discount on the system prompt.
- ContentForge AI Studio: SEO article generation pipeline. We were running it on real-time Claude Sonnet because we wanted to inspect drafts as they came back. Moved to Anthropic Message Batches because the 11-minute median was acceptable. Monthly cost on that pipeline went from $890 to $445. No quality difference detectable in our human-review sampling.
- BizChat Revenue Assistant: Daily customer conversation classification — sentiment, intent, escalation flag. Moved from real-time GPT-4.1 mini to Vertex AI batch on Gemini Flash. Cost per million classified messages dropped from about $32 to about $4.50. The model switch contributed most of that, but the batch discount took it the rest of the way.
The aggregate is that across the three products, batch migration reduced our monthly LLM bill by about $2,800. The engineering investment to get there was roughly two weeks of one developer's time, mostly writing the polling, retry, and result-joining glue. Payback was inside the first month.
When you should NOT use batch
I will be specific because most articles will not. Batch is wrong for:
- Anything user-facing where the user is waiting. Even an 11-minute median is too slow for chat, search, or autocompletion. Do not get clever here — your support inbox will tell you.
- Workloads where the request body depends on the previous request's response. Batch has no chaining. If your pipeline is "summarize, then extract entities from the summary, then classify each entity," you cannot pack that into one batch. You either run two sequential batches (doubling worst-case latency) or you accept the loss.
- Anything with hard freshness SLAs under one hour. Even though median turnaround is usually fast, the worst case is the worst case. If your worst-case latency budget is 30 minutes, you cannot put it on batch.
- Workloads small enough that the discount does not matter. If you are spending $200/month on LLM calls, the engineering time to add batch will not pay back inside the year. Spend that time on something else until your bill grows.
The hidden win: prompt caching plus batch
The single best dollar-per-token I have ever paid was on DocSumm: cached system prompt + batch input + batch output on the GPT-4.1 family. That stack is roughly 12.5% of the headline standard rate for the cached portion of input. When your system prompt is large (mine is around 3,000 tokens with a few-shot rubric) and reused across every request, this is the cost structure to chase.
To make caching work inside batch:
- Order the messages so the cacheable prefix is the absolute first thing in the request. Any variation before the cacheable block invalidates the cache.
- On OpenAI, send batches in clusters that finish within the cache TTL window of each other. The published TTL is 5-10 minutes for the implicit cache and longer for the prompt-cache feature, but I plan around 5 minutes to be safe.
- On Anthropic, the explicit
cache_controlmarker still works inside Message Batches. Set the marker on the system prompt block. - On Vertex AI, implicit caching is the available mechanism and it is harder to predict hit rates. I do not rely on it for forecasting; I treat any savings as a bonus.
An honest production setup
Here is the pattern I use across the three products that run on batch today. The same skeleton works on all three providers with minor adapter differences.
- A scheduler (in our case, a simple Laravel cron on the Hostinger VPS plus an SSH-triggered Python worker) collects pending rows from a "needs LLM" queue table in MySQL.
- Once the queue exceeds a threshold (I use 200 rows for ContentForge, 500 for DocSumm), or a max-wait timer fires, the worker builds the batch payload, posts it, and stores the batch ID against each queued row.
- A separate polling worker checks pending batches every two minutes. When a batch completes, it streams results back, joins on the custom request IDs, and writes results to the source table along with token usage.
- Anything that errored — JSON parse failure, refusal, length truncation — goes onto a retry queue that gets folded into the next batch. Three failed attempts and a row is flagged for human review.
- A nightly summary email reports batch throughput, average turnaround, cost per row, and any rows stuck on retry. This is the operational dashboard.
The whole thing is roughly 600 lines of Python plus a MySQL table. There is no Airflow, no Temporal, no Kubernetes. I am not against those tools for bigger shops, but for a portfolio of seven sites on Hostinger, plain cron and Python beat the operational tax of a heavier orchestrator every time.
What I would tell my past self
If I could go back to the day I first read about batch APIs and re-do the decision, I would:
- Move every classification, summarization, and tagging workload to batch on day one, before they cost real money. Retrofitting is more painful than building it in.
- Default new pipelines to Anthropic Message Batches unless there is a specific reason to use the others. The 11-minute median and the single-call submission flow are operational wins that compound.
- Negotiate batch as part of the initial architecture conversation with the customer, not as an optimization later. BizChat's compliance team would have signed off on Vertex AI batch much faster if I had positioned it as the default rather than a cost-saving switch six months in.
- Treat the batch discount as table stakes, not a bonus. The companies that figure this out are the ones that can charge less per seat and still run healthy margins.
Batch APIs are not glamorous. They will not show up in a vendor's keynote because they make the per-token revenue line look smaller. But they are, dollar for dollar, the highest-impact change I have made to an AI product's economics in the last two years. If you are not running on batch where you can, you are subsidizing other people's real-time workloads.
Quick decision matrix
- Live inside Google Cloud, BigQuery warehouse, huge data volumes? Vertex AI batch on Gemini Flash. The native BigQuery integration is worth the worse error reporting.
- Already on OpenAI, large reused system prompts, willing to wait an hour? OpenAI Batch with prompt caching stacked. This is the lowest effective cost per token I have measured.
- Need speed-of-batch but with the lowest operational friction? Anthropic Message Batches. The 11-minute median has changed what I consider eligible for batch.
- Workload is small or user-facing? Stay on real-time. The savings are not worth the engineering tax.
The discount is real. The 24-hour SLA is theatre — actual turnaround is usually a small fraction of that. The hidden gem is stacking batch with prompt caching for any workload where the system prompt is large and reused. Run the math on your own bill and you will probably find what I found: there is more margin on the table than there is in another round of growth-hacking.
Enjoyed this article?
Get more AI insights — browse our full library of 98+ articles and 373+ ready-to-use AI prompts.