AutoResearch: Karpathy Taught an AI to Do ML Research. Overnight. On One GPU.

Andrej Karpathy just open-sourced something that quietly broke my brain.

It's ~630 lines of Python. It runs on a single GPU. It does more useful ML research in one night than most people accomplish in a week. And the most intellectually demanding part - the part where you actually think about what experiments to run - is written in a Markdown file.

He called it AutoResearch. Let's break down what it is, why it works, and why it probably matters more than most things released this year.

The Problem With How We Do ML Research

Here's the dirty secret of machine learning research: a huge fraction of it is vibes-driven trial and error.

You have a hypothesis - "maybe ReLU² activations are faster than GeLU here", you write the code change, kick off a training run, wait hours, squint at a loss curve, say "hm, maybe," and try something else. Repeat for months. This is what people at frontier labs do. Brilliant, expensive people. Doing something that is, at its core, a combinatorial search problem over hyperparameter space.

The community has a name for this process: "grad student descent." It's a joke. But Karpathy stopped finding it funny.

The thing is, the intuition part, knowing which experiment to run next, knowing when something smells wrong, knowing which 2025 papers are worth stealing from - that's genuinely hard, and that's where human expertise earns its keep. But the execution part? Running the experiment, checking if val_loss went down, committing the result, moving on? That's just a loop. And loops are what computers are for.

So the question became: what's the minimal scaffolding needed to make that loop run by itself?

The Architecture: Three Files, One Idea

The answer turned out to be remarkably simple. Three files. One principle.

The principle: separate what's fixed (the scientific apparatus) from what's variable (the hypothesis being tested) from what's intentional (the research strategy).

autoresearch/
├── prepare.py    ← LOCKED. The lab apparatus. Agent never touches this.
├── train.py      ← MUTABLE. The hypothesis space. Agent rewrites this.
└── program.md    ← YOURS. Plain English research strategy. You write this.

prepare.py is the fixed apparatus. It downloads data, trains the BPE tokenizer, and sets up the evaluation pipeline. The agent is strictly forbidden from touching it and this constraint is load-bearing. If an optimizer can touch the thing that measures success, it will find ways to make the metric look good without actually improving anything. This is called reward hacking, and it's both a real RL failure mode and, honestly, a pretty good description of some academic papers that have been published. So prepare.py is sacred. Off limits.

train.py is the mutable hypothesis. It's a ~630-line nano-GPT style decoder-only transformer, model definition, optimizer configuration, training loop, logging. Every single experiment is a diff on this one file. The agent might change the attention pattern, swap an activation function, adjust weight decay, or try a new LR schedule. One change at a time. Like a real scientist.

program.md is the interesting one. This is where you live as a researcher. It's a plain Markdown file that tells the agent what you care about, which papers to draw from, what counts as an acceptable experiment, what your aesthetic standards are. One of Karpathy's favorite lines from his own version: "A small improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome." That's not code. That's a philosophy. And the agent actually internalizes it.

The Loop (The Beautiful Mechanical Heart)

Here's what happens, over and over, all night long:

# The agentic research loop — simplified pseudocode
while True:
    # Agent reads everything it knows
    context = (
        read("program.md")          # your research strategy
        + git_log_summary()         # what's been tried before
        + read("train.py")          # current state of the code
        + f"best val_bpb so far: {best_bpb}"
    )

    # Agent proposes a change
    proposal = LLM_call(context,
        prompt="Propose one improvement that will lower val_bpb. "
               "Output a clean git patch. Follow program.md exactly."
    )

    # Apply the change and train for exactly 5 minutes
    apply_patch(proposal)
    output = run_with_timeout("uv run train.py", timeout=300)  # hard wall

    # Parse the one metric that matters
    new_bpb = extract_float(r"val_bpb: (\d+\.\d+)", output)

    if new_bpb < best_bpb:
        # Success — immortalize it
        git_commit(f"[AutoResearch] {proposal.summary} | Δbpb={delta:.4f}")
        best_bpb = new_bpb
        append_to_results_tsv(proposal, new_bpb, "success")
    else:
        # Failure — erase it like it never happened
        git_checkout("HEAD", "train.py")
        append_to_results_tsv(proposal, new_bpb, "rejected")

The agent reads program.md, looks at the git history to understand what's already been tried, reads current train.py, and proposes a change. It writes the diff, applies it, and runs training.

Every single run gets exactly five minutes. Not four, not six. Five. Hard wall-clock limit.

At the end of five minutes, the script reads exactly one number: val_bpb. Bits per byte. How many bits does the model need to encode the next byte of text? Lower is better. It's information-theoretically clean, tokenizer-independent, and ruthless.

If val_bpb went down? git commit. Git becomes the lab notebook. If val_bpb didn't improve, or if the run crashed, or if it finished in 4:30 instead of 5:00? git checkout HEAD -- train.py. Revert. Try something else.

This loop runs at about 10–12 experiments per hour. Leave it overnight and you get ~100 fully documented, reproducible, honestly-evaluated experiments by morning.

Why Five Minutes? (And Why This Actually Works)

Wall-clock time →
|----1min----|----2min----|----3min----|----4min----|----5min----|
             ↑                                      ↑
         Signal visible                        Hard cutoff
      (better arch shows                    (git commit or revert)
       up early in loss
          trajectory)

The five-minute constraint seems aggressively arbitrary, but it's doing real work.

First, it normalizes for efficiency, not just raw accuracy. An architecture that needs 10x more compute per step to get marginally better gradients will fail inside the budget. The metric rewards models that are fast and good, which is almost always what you actually want in the real world.

Second, it creates enough signal-to-noise ratio to distinguish good ideas from bad ones. This only works because the baseline code is already insanely fast - it's descended from the NanoGPT speedrun community project, which compressed the time to train GPT-2 from 45 minutes (early 2024) to under 90 seconds (early 2026) through a cascade of radical improvements. Because the starting point is already blazing, five minutes is a substantial fraction of a full training run. You see real convergence behavior, not just noise.

Third, it forces the agent to reason about the trajectory of learning, not just terminal loss. A better initialization or a more stable optimizer shows up in the first few minutes. The agent learns to read these early signals.

The Optimizer Under the Hood (For the Technically Curious)

The baseline train.py uses a hybrid optimizer that's worth understanding, because the agent often discovers its best improvements around it.

For most weight matrices, Karpathy uses Muon : short for Momentum Orthogonalized by Newton-Schulz. Here's the intuition for why it's interesting.

Standard Adam treats every weight in a matrix as an isolated scalar. It tracks that scalar's gradient history individually and adjusts its learning rate accordingly. This works reasonably well, but it ignores something important: weight matrices aren't bags of scalars. They're geometric objects that transform vector spaces. When you optimize them element-wise, you tend to produce updates that collapse the matrix into a low-rank subspace - like a 100×100 matrix that's only doing the real work of a 5×5 one. You're burning parameters on nothing.

Adam update on a weight matrix W:
──────────────────────────────────
W (100×100) → treated as 10,000 isolated scalars
Each weight updated independently based on its own gradient history

Problem: updates tend to align along a few dominant directions
Result: effective rank collapses - you waste most of W's capacity

┌─────────────────────────────┐
│  W (full rank: 100)         │
│  After Adam for many steps  │
│  Effective rank: ~5         │  ← wasted capacity
└─────────────────────────────┘

Muon says: what if the update direction was always an orthogonal matrix? Mathematically, if your gradient momentum is M = UΣVᵀ (its SVD), the ideal update throws away Σ entirely and just uses UVᵀ. Equal magnitude in every direction. Full use of the matrix's dimensional capacity. No collapse.

def muon_update(M):
    # Step 1: Scale so spectral norm ≤ 1 (required for NS convergence)
    X = M / max(1.0, torch.linalg.matrix_norm(M, ord='fro'))

    # Step 2: Newton-Schulz iterations — each one doubles the precision
    # This drives X toward the orthogonal polar factor of M
    # i.e., the U @ V.T from M's SVD — without ever computing the SVD!
    for _ in range(5):  # typically 2-5 iterations is enough
        X = 0.5 * X @ (3 * torch.eye(X.shape[1]) - X.T @ X)

    # X is now approximately orthogonal: X.T @ X ≈ I
    return X  # apply this as the weight update direction

The catch is that computing a full SVD at every training step is prohibitively expensive. So Muon uses the Newton-Schulz iteration instead - a numerical trick from 1950s quantum chemistry that approximates the orthogonal polar factor using only matrix multiplications, the exact operation modern GPUs are purpose-built to do at massive throughput. The overhead ends up being about 0.7% of total compute. Essentially free, for a meaningfully better update.

For 1D parameters (biases, LayerNorm scales, embeddings), standard AdamW is retained, because Muon's orthogonalization math doesn't make geometric sense for vectors. The hybrid deployment looks like this:

# Hybrid optimizer setup in train.py
muon_params  = [p for p in model.parameters() if p.ndim == 2]  # matrices
adam_params  = [p for p in model.parameters() if p.ndim != 2]  # vectors

optimizer = [
    Muon(muon_params, lr=0.02, momentum=0.95),      # for attention/MLP weights
    AdamW(adam_params, lr=3e-4, betas=(0.9, 0.95))  # for embeddings, biases
]

What Actually Happened (The Results)

In Karpathy's production setup, he uses a depth-12 proxy model to find improvements, then verifies they transfer to a depth-24 target. The agent found and stacked several non-obvious wins over ~700 experiments across two days:

A QK-norm scaler multiplier that sharpened attention. A value-embedding regularization term. Less conservative banded attention patterns. AdamW beta adjustments. Weight-decay schedule tweaks during warmup. None individually spectacular. Each a small step. But they compounded.

Net result on the "Time to GPT-2" leaderboard: 11% faster. 2.02 hours down to 1.80 hours on identical hardware and data. Real, transferable, additive algorithmic insights that would have taken weeks to find manually. The agent found them overnight and left a clean git history to read in the morning.

Tobi Lütke (Shopify CEO) ran a similar experiment on a proprietary query-expansion model - 37 experiments overnight. Result: 19% quality improvement, with an optimized 0.8B model outperforming the previous hand-tuned 1.6B baseline. Cheaper and better. The agent figured out the extra parameters weren't earning their keep.

The Philosophical Part (Bear With Me)

It's worth being honest about what AutoResearch is and isn't.

It is not an AI that invents new paradigms. It's a greedy hill-climber in a fixed search space. It won't discover the Transformer architecture, or Mamba, or whatever conceptual leap comes next. For that kind of discontinuous breakthrough, you need to cross a valley, accept temporarily worse performance to reach a higher peak. A loop that only accepts improvements will get stuck in local optima. Future versions of program.md will likely need to instruct the agent on something like simulated annealing: "occasionally try something structurally radical even if it makes things worse in the short run."

What AutoResearch is, is a proof-of-concept for something bigger. ML research has two components: (1) having good taste : knowing what's worth trying and (2) execution : actually running the experiments. This framework fully automates component 2. And it turns out component 2 was the bottleneck far more often than anyone admitted.

The residual role for humans is writing program.md. That's the new job description. You encode your research taste, your theoretical priors, your sense of what's elegant versus ugly. You define the objective and the constraints. The agent handles the rest.

How to Run It Yourself

git clone https://github.com/karpathy/autoresearch
cd autoresearch

# uv is a fast Python package manager — install it if you don't have it
uv sync

# One-time setup: downloads data, trains tokenizer, sets up eval
uv run prepare.py

Then write your program.md. This is the part that matters most, be specific. Tell it what papers to draw from. Tell it what you consider bad taste. Tell it to ablate one variable at a time. Something like:

# Research Program

## Goal
Minimize val_bpb on FineWeb-EDU using a single H100 with 5-min experiments.

## Strategy
- Ablate ONE variable per commit. No bundled changes.
- Draw from 2025–2026 transformer papers (Muon, QK-norm, RoPE variants).
- Complexity penalty: if a change adds >20 lines for <0.001 bpb gain, reject it.
- Removing something and matching performance is a great result.

## Hard Constraints
- Never touch prepare.py
- Every run must complete in 300s ± 15s or it's invalid
- Only commit if val_bpb strictly improves over current best

Point a strong coding agent (Claude 3.5 Sonnet in Cursor works great) at the repo, say "follow program.md exactly and begin autonomous research," and go to sleep.

In the morning, run git log --oneline and read what happened.

The Bigger Picture

Every frontier lab is going to run something like this internally. The ones that figure out how to write good program.md equivalents - how to encode research taste and strategic direction into agent instructions, will have a meaningful edge. It's a new skill, and it's not really coding. It's closer to being a good PI: writing clear research proposals, setting smart constraints, defining what counts as evidence.

The most interesting second-order effect is democratization. A researcher with one GPU and a well-crafted program.md can now run 100 experiments overnight. The gap between "I have an idea" and "I have evidence for my idea" shrinks from weeks to hours. That changes which questions are even worth asking.

The long-term vision Karpathy gestures at, which is a bit sci-fi but not as far off as it sounds - is something like SETI@home for AI research: thousands of agents, each with a GPU, each running a slightly different program.md, sharing results via git branches and GitHub Discussions. A distributed, autonomous research collective that discovers SOTA recipes while the humans sleep.

We're not there yet. But AutoResearch is the first working prototype of that idea. And it already works today, on consumer hardware, for about $1 per experiment.

The repo is at github.com/karpathy/autoresearch. Read the program.md carefully - it's the most interesting file in there. Then write your own.

The future of ML research is not going to be written in Python. It's going to be written in plain English. Get good at that.