Last Thursday, I was scrolling through my bookmarks — the graveyard where interesting links go to die — when my former labmate Sam sent me a message: "Elena, drop everything. SkyPilot just published the autoresearch scaling results."
I did not drop everything. I was eating pad thai and was not about to let it get cold. But I did open the tab, and by the time I finished eating, my pad thai was cold anyway because I had been reading for twenty minutes straight.
Here is the short version: they took Andrej Karpathy's autoresearch project — a system where a coding agent autonomously improves a neural network training script — and gave it access to 16 GPUs on a Kubernetes cluster. Over 8 hours, it submitted approximately 910 experiments and reduced validation loss (val_bpb) from 1.003 to 0.974. A 2.87% improvement over baseline.
That does not sound dramatic until you realize the sequential baseline would have taken 72 hours to reach the same result. The parallel version got there in 8. That is a 9x speedup.
But the really interesting part is not the speedup. It is how the agent changed its behavior when it had more resources.
What Is Autoresearch, and Why Should You Care?
Autoresearch is deceptively simple. It has three files:
- prepare.py — downloads data, trains a tokenizer, provides the dataloader. The agent cannot touch this.
- train.py — the GPT model, optimizer, and training loop. This is the only file the agent modifies.
- program.md — instructions telling the agent what to change, how to evaluate, and when to keep or discard changes.
The constraint: a fixed 5-minute wall-clock training budget per experiment. The agent's job is to minimize validation bits per byte within that window. Architecture, hyperparameters, optimizer settings, batch size, model depth — everything in train.py is fair game.
In Karpathy's first overnight run with one GPU, the agent found roughly 20 improvements that stacked to an 11% reduction in time-to-GPT-2 on the nanochat leaderboard. Impressive for an unattended overnight session.
But one GPU means one experiment at a time. The agent spends most of its time waiting.
The Sequential Bottleneck (Or: Why One GPU Is Like Cooking Dinner With One Burner)
A typical cycle with one GPU looks like this:
- Agent edits train.py (~30 seconds)
- Training runs (~5 minutes)
- Agent reads the result, plans next experiment (~30 seconds)
Steps 1 and 3 are fast. Step 2 dominates. And during step 2, the agent is idle. It could be preparing the next experiment, or the next ten, but it has nowhere to run them.
My colleague Josh, who does hyperparameter optimization for a biotech startup in Boston, put it this way over coffee: "It is like having a chef who can plan ten dishes simultaneously but only has one burner. By the time the risotto is done, they have forgotten what they wanted to do with the salmon."
The bigger problem is combinatorial. Say the agent discovers that lower weight decay helps. It also finds that a different Adam beta helps. It wants to test them together. With sequential execution, that is another 5-minute wait. With 16 GPUs, it tests the combination alongside a dozen other combinations in the same wave.
What 16 GPUs Actually Changed
SkyPilot gave the Claude Code agent access to a Kubernetes cluster with 16 GPUs — a mix of H100s and H200s. The agent used SkyPilot to launch and manage jobs, which it learned by reading a skill file (basically documentation that teaches the agent how to use the tool).
Here is what happened:
1. Factorial Grids Instead of Greedy Search
With one GPU, the agent does greedy hill-climbing: try one thing, check, repeat. With 16, it ran factorial grids of 10–13 experiments per wave. In one early wave, it tested six different model widths simultaneously, saw the trend in a single round, and zeroed in on the winner.
Sequential search would have required six rounds for the same insight. That is 30 minutes versus 5.
2. It Discovered Hardware Arbitrage on Its Own
This is the part that made me put down my fork. The agent noticed it had access to both H100s and H200s. Without being told to, it developed a strategy: screen ideas on cheaper H100s, then promote winners to H200s for validation.
Nobody programmed this. Nobody even hinted at it. The agent figured out that H200s were faster, reasoned that it should save them for validation runs where accuracy matters most, and used H100s for exploratory experiments where speed-per-dollar is the priority.
I have worked with human ML engineers who do not think about GPU-cost optimization this carefully. I am not proud to admit that, but it is true.
3. Interaction Effects That Sequential Search Misses
The biggest finding from the parallel runs: model width matters more than any single hyperparameter. But this was only visible because the agent could test multiple dimensions simultaneously and see how they interacted.
In sequential mode, the agent might have found that increasing width helps, and separately that adjusting learning rate helps, but never tested the combination in the same wave to see the interaction. Parallelism changes the topology of the search — it is not just faster, it is qualitatively different.
The Numbers
| Metric | Sequential (1 GPU) | Parallel (16 GPUs) |
|---|---|---|
| Total experiments | ~110 (estimated) | ~910 |
| Time to best val_bpb | ~72 hours (estimated) | ~8 hours |
| Best val_bpb | 0.974 (projected) | 0.974 (achieved) |
| Experiments per hour | ~12 | ~114 |
| Key discovery | Width helps | Width matters most + interaction effects |
| Hardware strategy | Use what you have | Self-discovered H100/H200 arbitrage |
What This Means If You Are Not Running a GPU Cluster
I know what you are thinking. "Cool, Elena. I do not have 16 GPUs. I have a 3060 Ti that I bought during the mining crash and a dream."
Fair. But there are a few takeaways that matter even for smaller setups:
Cloud GPU Costs Are Dropping
SkyPilot supports spot instances across AWS, GCP, and Azure. You can run this on spot H100s for roughly $2/GPU/hour. Sixteen GPUs for 8 hours at spot rates: maybe $150–200. That is not pocket change, but for a research lab or a startup doing serious model development, it is an evening's worth of compute for results that would otherwise take three days.
The Agent Framework Matters
Autoresearch works because the task is structured: a clear objective (val_bpb), a fixed budget (5 minutes), and a small search space (one file). Not every research question fits this mold. But for hyperparameter sweeps, architecture search, and training recipe optimization? This pattern is immediately applicable.
You Can Start With Two GPUs
You do not need 16. Even two GPUs let the agent test pairs of hypotheses simultaneously, catching interactions that greedy search misses. The marginal value of each additional GPU decreases, but going from 1 to 2 is the biggest jump.
If you have a machine with two GPUs — and plenty of people have dual-GPU workstations (not sure if yours qualifies? try CanIRun.ai) — you can run a simplified version of this today by forking autoresearch and pointing it at a local SkyPilot cluster.
How to Actually Try This
If you want to replicate the setup (or a scaled-down version of it):
- Clone autoresearch:
git clone https://github.com/karpathy/autoresearch - Install SkyPilot:
pip install skypilot-nightly[aws,gcp,kubernetes] - Configure your cloud credentials — SkyPilot's quickstart walks you through this in about five minutes.
- Point Claude Code at the repo with the SkyPilot skill enabled.
- Start small: 2–4 GPUs, spot instances, overnight run. Check results in the morning.
My friend Ravi, who teaches ML at a mid-tier university in Ohio, ran a 4-GPU version over spring break and got results in about 18 hours that would have taken his students the entire semester to sweep manually. He described it as "the most productive thing I have done with AWS credits since I stopped accidentally leaving p4d instances running."
The Uncomfortable Implication
There is a thing nobody in the AI research community wants to say out loud, so I will: this kind of automated research loop, scaled to serious compute, could generate more experimental results in a weekend than a PhD student produces in a year. We have already seen similar efficiency gains in model architecture — like when researchers discovered you can duplicate 3 layers in a 24B LLM and get a 245% reasoning boost with no training.
That does not make PhD students obsolete — someone still needs to form hypotheses, interpret results, and decide what to try next at the strategic level. But the tactical work of "run experiment, check result, tweak hyperparameter, repeat"? An agent with access to a cluster already does that better than a human can.
I do not think this replaces researchers. I think it turns every researcher into a research director. And that shift is happening faster than most academic incentive structures are prepared for.
What to Watch For
- Autoresearch v2: Karpathy has hinted at expanding the scope beyond training recipes to architecture design. That is when things get really interesting.
- Cost-aware agents: The H100/H200 arbitrage behavior was emergent. Future agents with explicit cost objectives could be even more efficient.
- Multi-objective optimization: val_bpb is one metric. What about latency, model size, and inference cost simultaneously?
If you are building ML infrastructure, working on AI research, or just curious about what happens when you let agents run unsupervised with real compute — this is the most interesting paper-that-is-not-a-paper published this month. And if you are wondering what tools enterprises are using to secure all this AI infrastructure, check out the enterprise AI security stack that actually works in 2026. And unlike most ML blog posts, it comes with code you can actually run.
Now if you will excuse me, I need to go see if my AWS credits are still valid.