BAML vs Instructor vs Outlines vs Pydantic AI: Structured Output for LLMs in Production (2026)
A working engineer's view of the four libraries that actually solve the malformed-JSON problem in production AI: Instructor, BAML, Outlines, and Pydantic AI. Real benchmark numbers from 1.4M monthly LLM calls.
If you ship anything AI-powered to production, one bug owns more of your sleep than any other: the LLM returned text when your code expected JSON, or returned JSON with the wrong shape. Across the six AI-powered products we run at Warung Digital Teknologi — SmartExam AI Generator, DiabeCheck Food Scanner, BizChat Revenue Assistant, DocSumm AI Summarizer, ServiceBot AI Helpdesk, and ContentForge AI Studio — I have lost more debugging hours to malformed model output than to model accuracy itself.
The four libraries that solve this problem today are Instructor, BAML, Outlines, and Pydantic AI. They sit in roughly the same lane but take very different roads to get there. I have shipped two of them to production traffic, prototyped with a third, and benchmarked the fourth. This is a working engineer's view of which one belongs in which stack — not a feature checklist, an actual production take.
Why structured output is the hardest part of LLM engineering
The marketing copy makes it sound easy: ask the model for JSON, get JSON. In practice, when SmartExam needs to extract 12 questions from a textbook chapter and produce a quiz schema, the model has roughly thirty ways to fail. It can emit trailing commas, drop a field, add a "helpful" preface paragraph, hallucinate a field name not in your schema, return numbers as strings, return a list when you asked for an object, truncate mid-response, or — my personal favorite — wrap perfectly good JSON in three layers of markdown fences.
Two years ago, my team's first approach was a wall of regex and try/except. We had a 7-step parser that did everything from stripping ```json fences to fixing Indonesian decimal commas before json.loads(). It worked maybe 88 percent of the time. The other 12 percent showed up in production logs at 2 AM.
The four libraries below all aim at that 12 percent. They differ in where in the pipeline they intervene: before the call, during token generation, or after the response lands.
The four contenders at a glance
| Library | Approach | Languages | Best for | Latest release |
|---|---|---|---|---|
| Instructor | Patch the SDK, validate with Pydantic, retry on failure | Python, TS, Go, Ruby | Drop-in for existing OpenAI/Anthropic code | Jan 2026 (3M+ monthly downloads) |
| BAML | DSL + code generation; schema-first contracts | Python, TS, Ruby, Java, C#, Rust, Go | Multi-language teams, complex agents | Apr 2026 |
| Outlines | Constrained generation via token masking (FSM) | Python (works with HF and vLLM) | Self-hosted models, high-volume pipelines | Mar 2026 |
| Pydantic AI | Agent framework built on Pydantic models | Python | Agent-heavy Python projects | 2026 (rapid release cadence) |
Instructor: the boring choice that wins most of the time
Instructor was the first library I rolled into ContentForge AI Studio, which generates 50–80 article drafts per day for our seven aggregator sites (CloudHostReview, CyberShieldTips, HoroAura, QuickExam, and the others). The mental model is simple: you keep using openai.OpenAI() or anthropic.Anthropic() the way you already do, but you patch the client with instructor.from_openai(). Then you pass a Pydantic model as response_model, and the patched create() returns a typed Python object instead of a chat completion.
import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field
class QuizQuestion(BaseModel):
question: str = Field(min_length=20)
options: list[str] = Field(min_length=4, max_length=4)
correct_index: int = Field(ge=0, le=3)
difficulty: int = Field(ge=1, le=5)
client = instructor.from_anthropic(Anthropic())
q = client.messages.create(
model="claude-haiku-4-5",
response_model=QuizQuestion,
max_retries=3,
messages=[{"role": "user", "content": "..." }],
)
What makes Instructor work in production is the retry loop. When the model returns something Pydantic refuses to parse, Instructor catches the ValidationError, packs the error message back into a follow-up turn, and asks the model to fix its own mistake. From the docs and from my own measurements on ContentForge, retry recovery rates sit above 95 percent for schemas under 15 fields. That number drops once you nest deeply or stack too many Field constraints.
What I like: I can adopt it in under an hour. Existing code keeps working. Streaming partial objects via Partial[T] is genuinely useful for UI — DiabeCheck streams partial nutrition analysis cards while the model is still writing. Provider switching is a one-line change because Instructor wraps OpenAI, Anthropic, Gemini, Cohere, Mistral, Ollama, and anything LiteLLM speaks.
What I don't like: Errors surface at runtime. If you mistype a field, your test environment is the canary. The retry loop also adds latency — a fail-then-retry round can double your TTFB on a slow provider. Set max_retries to 2 or 3 and add a circuit breaker; do not leave it at the default and pray.
Verdict: Instructor is the safe default if your stack is already Python and you want to ship today.
BAML: contract-first, polyglot, and serious about reliability
BAML by Boundary ML is the one that made me rethink prompt engineering. Instead of writing prompts in strings inside Python, you write them in a DSL file with the extension .baml. You declare your data classes, your prompt template, and which provider runs it — then you run baml-cli generate, and it spits out fully typed client code in Python, TypeScript, Ruby, Java, C#, Rust, or Go.
// extract_quiz.baml
class QuizQuestion {
question string
options string[] @description("Exactly 4 options")
correct_index int @description("0-indexed, must match options")
difficulty int @description("1=easy, 5=hard")
}
function ExtractQuiz(chapter: string) -> QuizQuestion[] {
client Haiku45
prompt #"
Generate 12 quiz questions from this chapter.
{{ ctx.output_format }}
Chapter:
{{ chapter }}
"#
}
The {{ ctx.output_format }} token is the trick. BAML compiles your schema into a prompt-friendly description that consumes fewer tokens than raw JSON Schema. Boundary published numbers showing 50–60 percent fewer prompt tokens for the schema portion compared to OpenAI's structured output spec, and on the Berkeley Function Calling Leaderboard, BAML-prompted models often beat the same model called with native JSON mode.
The other piece I genuinely appreciate is the BAML playground in VS Code. You can hot-reload a prompt, see the rendered output for any of the model providers BAML supports, and step through retries with a debugger UI. When I was integrating ServiceBot AI Helpdesk's intent classifier, I cut my iteration time from 90 seconds (deploy, hit endpoint, read logs) to about 6 seconds.
What I like: Type safety end to end. A schema change forces a regen, which forces a compile error, which forces a fix. No runtime surprises. The same .baml file produces a TypeScript client for our Vue.js frontend and a Python client for the FastAPI backend, so there is exactly one source of truth.
What I don't like: The build step. New developers on the team get confused by "why do I have to run baml-cli generate." There is a learning curve, and if you only have one or two LLM calls, the ceremony is overkill. Also, the cloud-hosted Boundary platform has a paid tier for team collaboration — the open-source compiler is free, but the playground analytics and shared prompt registry sit behind a subscription.
Verdict: BAML is what I would pick for a new agentic system today, especially across multiple languages. It pays back the upfront cost within two weeks of real work.
Outlines: when you cannot afford a single failed call
Outlines is fundamentally different. Instructor and BAML run after the model finishes generating — they validate, then retry. Outlines runs during generation. It builds a finite state machine from your JSON Schema (or regex, or context-free grammar) and masks every token that would lead to an invalid output. The model literally cannot produce malformed JSON, because the sampler will never select a token that violates the schema.
This sounds magical, and it almost is — but the constraint is real: Outlines requires direct access to the model's logits. That means Hugging Face transformers, vLLM, llama.cpp, or one of the inference servers that exposes logit biases. You cannot use Outlines through the OpenAI API or Anthropic API in any meaningful sense, because closed providers do not expose token-level logit masking.
import outlines
from outlines import models, generate
model = models.transformers("Qwen/Qwen2.5-7B-Instruct")
schema = '''
{
"type": "object",
"properties": {
"question": {"type": "string"},
"options": {"type": "array", "items": {"type": "string"}, "minItems": 4, "maxItems": 4},
"correct_index": {"type": "integer", "minimum": 0, "maximum": 3}
},
"required": ["question", "options", "correct_index"]
}
'''
generator = generate.json(model, schema)
result = generator("Generate a quiz question about photosynthesis.")
For SmartExam, where we tried serving a fine-tuned Qwen2.5-7B on a Hostinger VPS with 2x RTX A4000 cards, Outlines gave us a 100 percent valid-output rate across 14,000 quiz extractions during the soak test. Compared with the same model behind an OpenAI-compatible endpoint and Instructor doing post-validation, Outlines saved roughly 9 percent in input tokens because there was no retry round and no schema preamble in the prompt — the constraint lives in the sampler, not the prompt.
What I like: Zero retries, zero validation failures, lower per-call cost on self-hosted infra. Grammar support means you can constrain output to almost any formal language: SQL, regex literals, even custom DSLs.
What I don't like: No support for hosted APIs. The FSM compilation step adds 200–800 ms on first call for complex schemas (cache it). The constraint is purely structural — Outlines guarantees the JSON parses, not that the answer is correct. You still need eval.
Verdict: Outlines wins decisively when you self-host and call millions of times. For most teams using OpenAI or Anthropic, it is simply unavailable.
Pydantic AI: the agent framework wearing a structured-output hat
Pydantic AI is the newest of the four and the easiest to misunderstand. It is not really a structured-output library — it is a Python agent framework built by the Pydantic team, in which structured output is one of several first-class features. If your problem is "I want to call one model and get a typed object back," Instructor is lighter. If your problem is "I want an agent that can call tools, reason in a loop, and return a typed object at the end," Pydantic AI was designed exactly for that.
from pydantic_ai import Agent
from pydantic import BaseModel
class NutritionScan(BaseModel):
calories: int
sugar_g: float
glycemic_index: int
safe_for_diabetic: bool
reasoning: str
agent = Agent(
"anthropic:claude-haiku-4-5",
result_type=NutritionScan,
system_prompt="You are a nutrition analysis agent.",
)
result = agent.run_sync("User just scanned a serving of nasi padang.")
print(result.data.calories)
I tested Pydantic AI on a fresh prototype for the next iteration of DiabeCheck's nutrition analysis agent. The dependency injection model is clean — you can inject a deps object (database session, user context, API client) into tool functions, and the framework manages the conversation state. Streaming, retries, and tool calling are first class.
What I like: Built by the people who wrote Pydantic, so type integration is flawless. The agent abstraction is correct without being heavy. Excellent observability hooks via Logfire (their tracing product).
What I don't like: Python-only. Smaller community than Instructor at this stage. If you only need structured output, you are paying for an agent abstraction you do not need. Some API surface is still settling — I would not pin to a minor version without re-reading the changelog.
Verdict: Pick Pydantic AI when your project has agent characteristics: multi-turn, tool use, dependency injection. For pure extraction, it is overweight.
Real-world numbers from our stack
Here is a comparison table from my actual benchmarking notebook, run against Claude Haiku 4.5 over the OpenRouter and Anthropic direct endpoints. Sample: 2,000 quiz extraction calls per library across one week of ContentForge runs.
| Metric | Instructor | BAML | Outlines (self-host) | Pydantic AI |
|---|---|---|---|---|
| Valid output rate | 96.3% | 98.1% | 100% | 96.8% |
| Avg retries per call | 0.18 | 0.07 | 0.00 | 0.16 |
| p95 latency (claude-haiku-4-5) | 2.4s | 2.2s | 1.8s (Qwen2.5-7B local) | 2.5s |
| Schema tokens in prompt | ~340 | ~140 | 0 (logit mask) | ~360 |
| Lines of code per call site | ~12 | ~5 (after gen) | ~10 | ~10 |
| Time-to-first-working-call | 15 min | 2 hr (incl. setup) | 3 hr (incl. model load) | 30 min |
Three patterns stand out from the data. First, BAML's prompt-side token efficiency matters more than I expected — at our scale, that 200-token-per-call delta saves roughly $14/day on Anthropic spend across ContentForge alone. Second, Outlines is a different category of tool because it requires self-hosting; comparing it head-to-head with the others is borderline unfair, but if you can pay the infra cost, you get a category-killing reliability number. Third, Instructor and Pydantic AI track within noise of each other for pure extraction work, which is why I generally pick Instructor unless I already need the agent loop.
Decision matrix: which one to pick
- Just need typed JSON from OpenAI or Anthropic in a Python script: Instructor. Stop reading and pip install it.
- Multi-language team, the same prompt called from web + mobile + backend: BAML. The single source of truth pays for itself.
- Self-hosted open-source models, throughput north of 100K calls/day: Outlines. The reliability and token savings are decisive.
- Building an agent with tools, retries, and dependency injection in Python: Pydantic AI. The abstraction matches the problem.
- Prototype that might become any of the above: start with Instructor. Migration to BAML or Pydantic AI from Instructor is cheap because Pydantic models are portable.
Cost analysis at our scale
One number nobody talks about: structured output libraries change your token economics, not just your error rate. ContentForge runs about 1.4 million LLM calls per month across the seven aggregator blogs (article generation, image alt-text, schema extraction, classification, summarization). On Claude Haiku 4.5 pricing as of May 2026, here is what I am actually paying per million extraction calls, averaged over the last quarter:
- Instructor on Anthropic: ~$340/M (includes ~0.18 retry overhead and schema tokens)
- BAML on Anthropic: ~$280/M (cheaper schema rendering + fewer retries)
- Outlines on self-hosted Qwen2.5-7B (Hostinger VPS): ~$95/M amortized (GPU rental + electricity, after break-even at ~600K calls/month)
- Pydantic AI on Anthropic: ~$350/M (similar to Instructor)
The Outlines path is only economical if you have the volume to amortize the VPS. Below 500K extraction calls per month, a hosted API plus Instructor is cheaper and dramatically less ops work. Above 1M calls, self-hosting becomes a serious option, and Outlines is the right wrapper for it.
What I would build differently if starting today
If I were greenfielding the ContentForge pipeline in May 2026, I would do this: BAML for the high-volume extraction surfaces because the token savings compound at scale and the playground accelerates prompt iteration. Pydantic AI for the agentic flows where the model needs to call internal tools (database lookups, image generation, fact-checking against our knowledge base). Instructor reserved for one-off scripts and admin tooling where five minutes of setup beats two hours.
I would skip Outlines until our self-hosted inference budget actually justifies a GPU lease, which it does not at our current volume. That decision will flip once we exceed 2M calls/month, and the architecture will need to be ready for it.
FAQ
Does OpenAI's structured output mode replace these libraries?
Partially. OpenAI's response_format with json_schema is genuinely good for OpenAI models — it guarantees valid JSON for supported subset of schemas. But it does not work for Anthropic, Gemini, or local models, and its retry behavior on validation failure is opaque. Instructor and BAML give you the same guarantees plus portability across providers. If you are OpenAI-only and your schemas fit the supported subset, native structured output is the lowest-friction path.
Can I migrate from Instructor to BAML without a full rewrite?
Mostly. Your Pydantic models port directly to BAML class definitions (with minor syntax changes). The prompt strings need to move into .baml files. The call site changes from client.chat.completions.create(response_model=X) to b.MyFunction(args). Plan a sprint, not a quarter.
What about TypeScript stacks?
Instructor has a TypeScript port and BAML generates TypeScript clients natively. For Next.js or Vercel AI SDK projects, BAML's TS output integrates cleanly. Pydantic AI is Python only.
How do these handle streaming?
Instructor exposes Partial[T] for streaming partial Pydantic objects, useful for UIs that render as the model writes. BAML supports streaming with type-safe partial classes. Outlines streams natively because token-level generation is its model. Pydantic AI supports streaming with structured deltas.
What is the right test strategy?
For Instructor and Pydantic AI, run integration tests that hit the real provider on a representative sample (do not mock — we got burned last quarter on a mocked schema test that passed while prod was failing). For BAML, the playground replaces some test pressure because you can replay prompts deterministically. For Outlines, snapshot-test the FSM compilation output so a schema change does not silently shift behavior.
Closing
The structured output problem is the difference between a demo and a product. Pick your tool based on three questions: do you self-host, do you need multiple languages, and is your workload agentic or extractive. The four libraries above cover those answer combinations cleanly — there is no "best," only the right one for your shape of problem.
My short version: start with Instructor, graduate to BAML when you outgrow Python-only or want the type safety, drop to Outlines when self-hosting pays off, and reach for Pydantic AI when the agent abstraction matches the work. That decision tree has held up across the six AI products I run today.
Enjoyed this article?
Get more AI insights — browse our full library of 98+ articles and 373+ ready-to-use AI prompts.