Comparisons

DSPy vs TextGrad vs GEPA: Automatic Prompt Optimization in 2026

A hands-on 2026 comparison of DSPy, TextGrad, and GEPA for automatic prompt optimization — what each one optimizes, the published benchmarks, real production costs, and a decision matrix from running all three on live AI products.

By Fanny Engriana · May 30, 2026 · 10 min read · 👁 46 views

DSPy vs TextGrad vs GEPA: Automatic Prompt Optimization in 2026

If you are still hand-tuning prompts by editing strings and re-running them in a notebook, you are doing the most expensive version of a job that three open-source frameworks now automate. DSPy, TextGrad, and GEPA each take a different route to the same destination: stop guessing at prompt wording and let an optimizer search for it against a metric you actually care about. I have run all three against real workloads on our internal stack at Warung Digital Teknologi (wardigi.com), and the differences are not cosmetic — they change how you structure the whole pipeline.

This is a hands-on comparison written for engineers who ship, not for people collecting framework names. I will cover what each tool optimizes, the published benchmark numbers, where each one breaks down in production, and a decision matrix you can use this week. By the end you should know which one to reach for given your data, your budget, and your tolerance for moving parts.

Developer code on screen representing automatic prompt optimization with DSPy, TextGrad and GEPA

Why automatic prompt optimization stopped being optional

When I built the first version of SmartExam AI Generator — our exam-question generator that runs on the OpenAI API — the prompt was a 600-word block of instructions, three few-shot examples, and a formatting schema. It worked until it didn't. Every time we swapped models or added a question type, the prompt drifted and accuracy fell. I was spending more time babysitting prompt strings than building features.

That is the exact problem these frameworks solve. They treat the prompt as a parameter to be searched, not a sentence to be wordsmithed. You define a program, a metric, and a small dataset, and the optimizer proposes, scores, and keeps the prompts that win. The payoff is concrete: published results from the GEPA paper show a DSPy ChainOfThought program jumping from 67% to 93% accuracy on the MATH benchmark — a 26-point gain — from instruction refinement alone, with no few-shot examples and no fine-tuning. That is the kind of lift you do not get by staring at the prompt harder.

The three tools below represent three philosophies: DSPy compiles whole pipelines, TextGrad backpropagates natural-language feedback through a computation graph, and GEPA evolves instructions with a reflective genetic search. Let us take them in turn.

DSPy: the compiler for LLM pipelines

DSPy (from Stanford NLP) is the most established of the three. The core idea is that you write a program out of modules — dspy.Predict, dspy.ChainOfThought, dspy.ReAct — and declare what each step does with a signature like question -> answer. You never write the actual prompt. Instead, an optimizer (DSPy calls them "teleprompters") compiles your program against a training set and a metric, bootstrapping few-shot demonstrations and rewriting instructions to maximize the score.

DSPy 3, released in 2025, cleaned up the API, made async a first-class citizen, added native support for tool calls, and — most importantly — shipped the GEPA optimizer alongside the existing workhorse, MIPROv2.

MIPROv2, the default optimizer

MIPROv2 (Multi-prompt Instruction PRoposal Optimizer, version 2) is DSPy's primary optimizer. It jointly tunes two things for every predictor in your program: the instruction text and the few-shot demonstration set. It proposes candidate instructions, samples demonstrations from your data, and uses Bayesian search to find the combination that scores best. For most teams starting out, MIPROv2 is the sensible default — it is well documented and battle-tested.

Where DSPy shines and where it hurts

DSPy is the right answer when you are building a multi-step pipeline that has to work reliably across many inputs and you have labeled data to compile against. When I rebuilt SmartExam's question pipeline as a DSPy program — a generation module feeding a validation module feeding a difficulty-scoring module — I stopped editing prompts entirely. Re-compiling against 80 labeled examples gave me a more reliable system than the hand-tuned prompt it replaced, and swapping the underlying model became a re-compile instead of a rewrite.

The pain is the learning curve. DSPy asks you to think in signatures and modules, which feels alien if you are used to f-strings. The abstraction also hides the final prompt, so debugging means inspecting compiled artifacts rather than reading a single template. And you need data — without at least a few dozen labeled examples and a metric, there is nothing for the compiler to optimize against. If you cannot write a metric function, DSPy has nothing to bite on.

TextGrad: autograd, but for text

TextGrad takes the deep-learning analogy literally. Published in Nature in 2025 (Yuksekgonul et al., "Optimizing generative AI by backpropagating language model feedback," Nature vol. 639, pp. 609–616), it builds a computation graph where variables are pieces of text — prompts, code, even molecule descriptions — and "gradients" are natural-language critiques generated by an LLM. You define a loss as a textual evaluation, call .backward(), and an optimizer LLM rewrites each variable using the feedback that flowed back through the graph.

The framing is elegant. Where DSPy optimizes a pipeline at compile time against a dataset, TextGrad optimizes individual instances at inference time using LLM-generated feedback. It is built to squeeze maximum performance out of a single hard problem — the kind of instance-level refinement that suits coding tasks, scientific question answering, and the unusual domains the Nature paper highlighted, including radiotherapy treatment plans and molecule generation.

The production tradeoff I measured

TextGrad's instance-level loop is also its cost problem. Each optimization step is multiple LLM calls: one to evaluate, one to produce the textual gradient, one to apply it. On a DocSumm AI Summarizer experiment — our document-summarization product — I pointed TextGrad at a batch of hard legal-document summaries. The quality lift on the difficult instances was real, but the per-document call count roughly tripled versus a single forward pass. For a high-traffic endpoint that math does not close. For a low-volume, high-stakes task where each output matters more than its cost, it does.

My honest read: TextGrad is a research-grade scalpel. Use it to refine the hard 5% of cases, or to optimize a system offline where latency and token cost do not matter. Do not put its inference-time loop on a hot path serving thousands of requests a day.

Source code editor showing prompt optimization pipeline configuration

GEPA: reflective evolution that beats reinforcement learning

GEPA (Genetic-Pareto) is the newest of the three and, on paper, the most impressive. From the paper "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (Agrawal et al., 2025, arXiv:2507.19457), accepted as an oral presentation at ICLR 2026, it treats prompt optimization as an evolutionary search guided by reflection. Instead of brute-forcing candidates, GEPA reads the traces of failed runs, reflects in natural language on why they failed, and mutates instructions accordingly — keeping a Pareto front of diverse high-performers rather than collapsing to a single candidate.

The numbers that make GEPA worth a look

Three published figures stand out:

+13% aggregate over MIPROv2 across all tested tasks and models — a meaningful margin against DSPy's own default optimizer.
+20% over GRPO (a reinforcement-learning method) while using 35x fewer rollouts. That rollout efficiency is the headline: RL approaches are notoriously sample-hungry, and GEPA reaches better results with a fraction of the budget.
93% on MATH vs 67% baseline for a ChainOfThought program — the 26-point lift mentioned earlier, achieved with instruction-only optimization.

Crucially, GEPA focuses on instruction-only optimization — it does not bootstrap few-shot demonstrations the way MIPROv2 does. That keeps the resulting prompts shorter and cheaper to run at inference, which matters more than people admit when your prompts ship to production thousands of times a day.

How to actually use it

GEPA ships two ways. It is integrated into DSPy as dspy.GEPA, so if you already have a DSPy program you can swap your optimizer line and re-compile. It also ships as a standalone library: pip install gepa (or pip install "gepa[confidence]" for the confidence-interval extras), which lets you optimize text artifacts outside the DSPy module system. That dual packaging is smart — you can adopt GEPA without buying into the full DSPy abstraction.

Head-to-head comparison

Dimension	DSPy (MIPROv2)	TextGrad	GEPA
Optimization unit	Whole pipeline, compile-time	Single instance, inference-time	Instructions, evolutionary search
Mechanism	Bayesian instruction + few-shot search	Textual gradients via backprop	Reflective genetic-Pareto evolution
Optimizes few-shot demos?	Yes (instructions + demos)	No (refines the variable text)	No (instruction-only)
Needs labeled dataset?	Yes — few dozen+ examples	Minimal; works per-instance	Yes — for the metric/traces
Relative cost to run	Moderate (compile once)	High (multi-call per instance)	Low rollouts vs RL; efficient
Published headline	DSPy's tested default	Nature 2025 publication	+13% vs MIPROv2, ICLR 2026 oral
Best for	Reliable multi-step systems	Hard single-instance refinement	Maximum metric lift, cheap prompts
Install	`pip install dspy`	`pip install textgrad`	`pip install gepa` or `dspy.GEPA`

A decision matrix for real teams

Frameworks are easy to list and hard to choose between. Here is the logic I actually use when scoping a project at wardigi.com:

You have labeled data and a multi-step pipeline → start with DSPy and MIPROv2. It gives you structure, reproducibility, and a clean re-compile path when models change. This is the safest default for a team shipping a product.
You have a DSPy program and want more accuracy for the same prompt length → swap MIPROv2 for dspy.GEPA. The instruction-only output keeps inference cheap, and the published +13% margin is worth a re-compile that costs you an afternoon.
You have a hard, low-volume, high-stakes task and no real training set → reach for TextGrad. Per-instance refinement is its whole point, and the extra token cost is justified when one output matters more than throughput.
You are not in the DSPy ecosystem but want to optimize a single critical prompt → use standalone GEPA. pip install gepa, define a metric, and evolve the instruction without adopting modules and signatures.
You cannot write a metric function → none of these will help yet. Build the evaluation first. Every one of these tools is only as good as the metric you give it; a vague "make it better" optimizes toward nothing.

Setting up your first optimization run

The frameworks differ, but the workflow that gets results is the same across all three. Here is the sequence I follow on every new project, learned the hard way after burning a weekend optimizing against a broken metric on BizChat Revenue Assistant.

Build the evaluation set before the program. Collect 30–80 examples with known-good outputs. This is the slowest step and the one teams skip — and skipping it is why their optimization "doesn't work." Without ground truth, the optimizer has no signal.
Write a metric that resists gaming. Exact-match is tempting and almost always wrong. For SmartExam I moved to a rubric metric that scored relevance, difficulty calibration, and format separately, then summed them. The moment the metric stopped being trivially exploitable, the optimized prompts stopped being clever-but-useless.
Split your data. Hold out a validation set the optimizer never sees. Prompt optimizers overfit just like model training does — a prompt tuned to 100% on the training examples can crater on unseen inputs.
Start cheap, then escalate. Run MIPROv2 first to get a baseline number. If you want more, swap in GEPA and compare against the same held-out set. Only reach for TextGrad's per-instance loop on the specific cases that still fail.
Re-compile, do not re-write, when the model changes. The entire value of this approach is that swapping from one model to another becomes a re-run of the optimizer, not a manual prompt rewrite. Treat the compiled prompt as a build artifact, not source code you edit by hand.

Two pitfalls catch almost everyone. The first is optimizing against too few examples — under about 20, the search overfits and the gains evaporate on real traffic. The second is forgetting to version your compiled prompts; when a re-compile produces a regression, you want the previous artifact to roll back to. I keep optimized prompts in Git alongside the program, tagged with the dataset hash they were compiled against, so I can always reproduce which data produced which prompt.

The lesson nobody puts in the README

Across the AI products we run — SmartExam, DocSumm, BizChat Revenue Assistant, ServiceBot AI Helpdesk — the single biggest predictor of whether prompt optimization paid off was not which framework I picked. It was whether I had a trustworthy metric. When I gave DSPy a sloppy exact-match metric on SmartExam, it happily optimized the prompt to game that metric and produced worse questions. When I rewrote the metric to score on a rubric the optimizer could not trivially exploit, the same DSPy run produced a genuinely better pipeline.

So the order of operations is: build the evaluation set and metric first, pick the framework second. A mediocre optimizer with a sharp metric beats a brilliant optimizer with a fuzzy one every time. That is the part that does not show up in the benchmark tables, and it is the part that decided every one of my projects.

Frequently asked questions

Can I use DSPy and GEPA together?

Yes — GEPA is integrated into DSPy as dspy.GEPA, so it is a drop-in optimizer for a DSPy program. You write the program once and choose your optimizer at compile time. GEPA also ships standalone via pip install gepa if you want it outside the DSPy module system.

Is TextGrad too expensive for production?

On a high-traffic endpoint, usually yes — its inference-time loop makes several LLM calls per instance, which roughly tripled my per-document call count in a DocSumm test. It earns its cost on low-volume, high-stakes tasks or as an offline refinement pass on the hardest cases, not on a hot path.

Do I really need labeled data?

For DSPy and GEPA, effectively yes — they optimize against a metric, and a metric needs ground truth or a scoring rubric to compute. TextGrad can refine a single instance with minimal data, but even then you need some way to evaluate whether an output got better.

What makes GEPA's results stand out?

Two things: it beat DSPy's own MIPROv2 by 13% aggregate, and it beat the reinforcement-learning method GRPO by 20% while using 35x fewer rollouts. The rollout efficiency matters because it means you reach strong results without the enormous sample budgets RL usually demands. It was accepted as an ICLR 2026 oral.

Which one should a solo developer start with?

Start with DSPy plus MIPROv2 if you have any labeled data, because the structure pays off as your project grows. If you are optimizing a single prompt and want the cleanest path to a measurable lift, standalone GEPA is the lighter commitment.

Final take

These three tools are not really competitors — they are complementary stages of a maturing practice. Compile your pipeline with DSPy, switch the optimizer to GEPA when you want maximum lift for minimal prompt length, and keep TextGrad in your back pocket for the handful of hard instances that need bespoke refinement. The frameworks have moved prompt engineering from an art into something closer to a measurable, repeatable discipline.

The most important shift is mental: stop editing prompt strings by hand. Define the program, write an honest metric, and let the optimizer search. In my experience running seven content and AI sites, that one change saved more engineering time than any model upgrade. The tools are ready — the question is whether your evaluation is.

Sources: DSPy documentation (dspy.ai); GEPA — Agrawal et al., arXiv:2507.19457 (ICLR 2026 oral); TextGrad — Yuksekgonul et al., Nature vol. 639, pp. 609–616 (2025).

🏷 Tagged: #dspy #textgrad #gepa #prompt-optimization #llm

Enjoyed this article?

Get more AI insights — browse our full library of 103+ articles and 373+ ready-to-use AI prompts.