Autoresearch, explained

What is autoresearch?

Autoresearch is an autonomous loop where an agent forms a hypothesis, runs an experiment, measures the result against a baseline, and keeps only the changes that actually prove out — then repeats. EVO turns that loop into infrastructure for your whole stack.

The short answer

Autoresearch replaces guesswork with measured iteration. Instead of asking one agent for one big change, an autoresearch system runs many small experiments in a loop, keeps the ones that measurably improve a metric, and throws away the rest — compounding real gains over time.

01

Hypothesize

An agent studies your system, reads what's failed before, and proposes a concrete change to try — not one big rewrite, many small measured attempts.

02

Measure against a baseline

Every attempt is scored by a benchmark you define — speed, accuracy, cost, eval score — so a result is a number, not an opinion.

03

Keep only verified wins

Changes that beat the baseline and pass your safety checks are kept. Everything else is discarded. The baseline rises, and the loop runs again.

How autoresearch works

A loop that compounds with every run.

  1. 1

    Define what "better" means

    Pick the metric to move and the direction — higher accuracy, lower latency, cheaper tokens. A benchmark turns that into a number the loop can chase.

  2. 2

    Propose a change

    An agent reads the codebase, prior traces, and discarded ideas, then forms a hypothesis and makes one concrete, isolated change to test.

  3. 3

    Run the experiment

    The change runs in its own sandbox or worktree against the benchmark, scored independently so attempts never collide or corrupt your main branch.

  4. 4

    Gate for safety

    Regression tests and invariants run as gates. An experiment that breaks something is discarded — even if its score went up.

  5. 5

    Keep or discard

    Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded.

  6. 6

    Raise the baseline and repeat

    A kept win becomes the new baseline, and the loop runs again — exploring multiple directions and compounding verified progress.

Autoresearch vs. a plain hill-climb

CriterionPlain autoresearchEVO
Search strategyGreedy hill-climb on a single branchTree search — multiple directions fork from any committed win
ParallelismOne experiment at a timeParallel subagents, each in its own isolated git worktree
SafetyNo guardrails — a faster wrong answer can winGates discard any change that breaks a test, even if the score rises
Shared learningEach attempt starts coldFailure traces and discarded hypotheses are shared across agents
ObservabilityRead the logs yourselfLive dashboard tracks every experiment, result, and decision
SetupHand-build your own harnessDiscover explores the repo and instruments the benchmark for you

Frequently asked questions

Is autoresearch the same as AutoML?

No. AutoML automates model and hyperparameter selection for machine learning. Autoresearch is broader — it optimizes anything you can score with a benchmark, including code, agents, prompts, configs, and infrastructure, by running measured experiments in a loop.

Where did the term come from?

The idea was popularized by Andrej Karpathy's "autoresearch," where an LLM runs experiments autonomously to beat its own best score. EVO builds a structured version on top of that idea — adding tree search, parallel agents, shared state, and safety gates.

What can EVO optimize?

Anything with a measurable target: code performance, agent task success, model eval scores and cost, configuration and runtime settings, website A/B conversion, and outreach response rates.

Is it safe to run on my codebase?

Yes. Your main branch is never touched — setup and every experiment run in isolated copies of the repo. Changes that fail your tests or safety gates are discarded, and winning changes are kept as diffs you review before merging.

Which coding agents does EVO work with?

EVO runs on Claude Code, Codex, Cursor, OpenClaw, Hermes, Opencode, and Pi. Experiments run locally in git worktrees or on remote sandboxes like Modal, E2B, Daytona, AWS, and Azure.

Is EVO open source?

Yes. The EVO plugin and CLI are open source under Apache-2.0, installable with two commands, and the project is citable with a DOI.

Further reading

Where this goes next

Ship improvements.
Not guesses.

Join teams turning every change into
measurable, verified progress.