Duplicate 3 Layers in a 24B LLM and Logical Deduction Jumps from 0.22 to 0.76 — No Training Required

Duplicate 3 Layers in a 24B LLM and Logical Deduction Jumps from 0.22 to 0.76 — No Training Required

I was halfway through debugging a fine-tuning script last night when I saw this pop up on Hacker News: someone replicated an existing research method, duplicated three specific layers in a 24-billion parameter model, and boosted logical deduction scores from 0.22 to 0.76. No training. No weight changes. Just routing hidden states through the same circuit twice.

I read it three times because I genuinely did not believe it the first two.

How Duplicating Transformer Layers Boosts LLM Reasoning

The project is called LLM Circuit Finder, built by a researcher named Alain who extended David Ng's RYS (Repeat Your Steps) method. The core idea is almost stupidly simple: certain contiguous blocks of layers inside transformers act as indivisible cognitive units — think of them as reasoning circuits. If you duplicate those specific blocks in the forward pass, the model gets a second pass through its reasoning pipeline.

Same weights. Same parameters. No gradient updates. No merging. Just... run the same three layers twice.

The results on Devstral-24B (a 40-layer model) are frankly ridiculous:

  • BBH Logical Deduction: 0.22 → 0.76 (+245%)
  • GSM8K (strict math): 0.48 → 0.64 (+33%)
  • MBPP (code generation): 0.72 → 0.78 (+8%)
  • GSM8K (flexible): 0.82 → 0.86 (+5%)
  • Nothing degraded across any benchmark

That last point matters a lot. We are used to tradeoffs in ML — you improve one capability and something else gets worse. Here, the average improvement across all metrics was +8% with zero degradation. That almost never happens.

Neural network layers visualization showing duplicate transformer reasoning circuits

The Science Behind Reasoning Circuits in Transformers

During training, transformers apparently organize themselves into functional circuits — multi-layer processing units that perform complete cognitive operations. These circuits are not something engineers designed intentionally. They emerged. The model figured out on its own that layers 12 through 14 (in Devstral's case) work together as a reasoning unit.

And the boundaries are surprisingly sharp. Shift the duplicated block by one layer in either direction and the improvement mostly disappears. It is not that any three layers will do — it has to be the right three layers.

Different models have their reasoning circuits in different places:

  • Devstral-24B (40 layers): reasoning circuit at layers 12-14
  • Qwen2.5-32B (64 layers): reasoning circuit at layers 7-9

Finding the right layers requires probing — the Circuit Finder toolkit includes tools to scan for these blocks. But once you find them, applying the trick takes minutes.

Two Consumer GPUs, One Evening

Here is the part that made my jaw drop. This was not done on some H100 cluster at a research lab. Alain ran everything on two AMD consumer GPUs — an RX 7900 XT and an RX 6950 XT — in a single evening. The whole thing. Discovery, implementation, benchmarking.

My friend Tom, who has been trying to squeeze better performance out of local models for months, messaged me at midnight: Are you telling me I could have been getting 33% better math scores this whole time by just... copying three layers?

Yes, Tom. That is exactly what I am telling you.

The toolkit validates results using the standard lm-evaluation-use at n=50, so these are not cherry-picked examples. And it includes custom probe suites for reasoning (causal + logic + navigation), emotional intelligence, and math. On the combined reasoning probes, the score jumped from 76.5% to 94.1% — a 23% improvement.

Why This Matters for the AI Industry

We spend billions of dollars training larger and larger models. We build massive clusters with thousands of GPUs. We argue about whether scaling laws are hitting a wall or whether we just need more data. And then someone comes along and says: actually, your model already knows how to reason better, you just need to let it think twice.

This connects to something Mistral has been exploring with Leanstral — the idea that AI reasoning capabilities might be more about architecture and inference strategy than raw parameter count. If you can get a 24B model to perform like a much larger one on reasoning tasks just by duplicating three layers, what does that say about our approach to scaling?

It also raises questions about chain-of-thought prompting and test-time compute. OpenAI's o1 and o3 models already use extended thinking at inference time. But those approaches require special training. The circuit duplication method works on off-the-shelf models with zero modification to the weights.

The Limitations Nobody Should Ignore

Before everyone starts duplicating layers in production, a few caveats. The benchmarks were run at n=50, which is a decent sample size for initial validation but not exactly large-scale. The method has been tested on two models so far — Devstral-24B and Qwen2.5-32B. We do not know if it generalizes to GPT-class models, Llama variants, or anything trained with RLHF.

There is also the latency question. Duplicating layers means more compute per forward pass. On a model that already takes 200ms per token, adding three extra layers might push it to 230-240ms. For some applications that matters. For others, a 245% improvement in logical deduction is worth an extra 40ms.

And crucially, this does not make the model better at everything. Causal judgment and instruction following did not improve at all. The gains are concentrated in logical reasoning, math, and code — which are exactly the capabilities where LLMs still struggle the most. That makes this feel less like a universal trick and more like a targeted intervention for specific cognitive tasks.

How to Try LLM Circuit Finder Right Now

The entire toolkit is open source on GitHub and runs on consumer hardware. You need:

  • Python 3.8+ with PyTorch
  • A GPU with enough VRAM to load your target model (the 24B model needs around 48GB across two cards with quantization, or a single 80GB card)
  • The lm-evaluation-use for benchmarking

The workflow is: load model → run the circuit finder to identify reasoning blocks → duplicate those blocks → benchmark. The whole process took the author one evening, and most of that was waiting for benchmarks to complete.

If you have been spending money on API calls to larger models for reasoning tasks, this might be worth testing. Running a modified 24B model locally could give you reasoning performance closer to much larger models at a fraction of the cost.

What Comes Next

The research community is going to run with this. I expect to see papers within weeks testing circuit duplication on Llama 3, Gemma, Phi, and whatever else people have lying around. Someone will probably try it on a 7B model and see if you can get smaller models to punch above their weight class.

The really interesting question is whether this trick can be combined with other inference-time techniques — chain of thought, self-consistency, tree of thoughts. If duplicating reasoning circuits works independently of prompting strategy, stacking them could produce even larger gains.

For now, I am going to finish that fine-tuning script I was working on. But I am also going to set aside this weekend to try circuit duplication on a few models I have been testing. Because a 245% improvement in logical deduction from zero training is not something you just scroll past.

For more on AI model improvements, see our Leanstral formal proof comparison and Anthropic 1M context announcement.

Found this helpful?

Subscribe to our newsletter for more in-depth reviews and comparisons delivered to your inbox.