Autoresearch, explained

What is autoresearch?

Autoresearch is an autonomous loop where an agent forms a hypothesis, runs an experiment, measures the result against a baseline, and keeps only the changes that actually prove out — then repeats. EVO turns that loop into infrastructure for your whole stack.

Get Early Access View EVO on GitHub

The short answer

Autoresearch replaces guesswork with measured iteration. Instead of asking one agent for one big change, an autoresearch system runs many small experiments in a loop, keeps the ones that measurably improve a metric, and throws away the rest — compounding real gains over time.

Hypothesize

An agent studies your system, reads what's failed before, and proposes a concrete change to try — not one big rewrite, many small measured attempts.

Measure against a baseline

Every attempt is scored by a benchmark you define — speed, accuracy, cost, eval score — so a result is a number, not an opinion.

Keep only verified wins

Changes that beat the baseline and pass your safety checks are kept. Everything else is discarded. The baseline rises, and the loop runs again.

How autoresearch works

A loop that compounds with every run.

1
Define what "better" means
Pick the metric to move and the direction — higher accuracy, lower latency, cheaper tokens. A benchmark turns that into a number the loop can chase.
2
Propose a change
An agent reads the codebase, prior traces, and discarded ideas, then forms a hypothesis and makes one concrete, isolated change to test.
3
Run the experiment
The change runs in its own sandbox or worktree against the benchmark, scored independently so attempts never collide or corrupt your main branch.
4
Gate for safety
Regression tests and invariants run as gates. An experiment that breaks something is discarded — even if its score went up.
5
Keep or discard
Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded.
6
Raise the baseline and repeat
A kept win becomes the new baseline, and the loop runs again — exploring multiple directions and compounding verified progress.

Autoresearch vs. a plain hill-climb

Criterion	Plain autoresearch	EVO
Search strategy	Greedy hill-climb on a single branch	Tree search — multiple directions fork from any committed win
Parallelism	One experiment at a time	Parallel subagents, each in its own isolated git worktree
Safety	No guardrails — a faster wrong answer can win	Gates discard any change that breaks a test, even if the score rises
Shared learning	Each attempt starts cold	Failure traces and discarded hypotheses are shared across agents
Observability	Read the logs yourself	Live dashboard tracks every experiment, result, and decision
Setup	Hand-build your own harness	Discover explores the repo and instruments the benchmark for you

Frequently asked questions

Is autoresearch the same as AutoML?

No. AutoML automates model and hyperparameter selection for machine learning. Autoresearch is broader — it optimizes anything you can score with a benchmark, including code, agents, prompts, configs, and infrastructure, by running measured experiments in a loop.

Where did the term come from?

The idea was popularized by Andrej Karpathy's "autoresearch," where an LLM runs experiments autonomously to beat its own best score. EVO builds a structured version on top of that idea — adding tree search, parallel agents, shared state, and safety gates.

What can EVO optimize?

Anything with a measurable target: code performance, agent task success, model eval scores and cost, configuration and runtime settings, website A/B conversion, and outreach response rates.

Is it safe to run on my codebase?

Yes. Your main branch is never touched — setup and every experiment run in isolated copies of the repo. Changes that fail your tests or safety gates are discarded, and winning changes are kept as diffs you review before merging.

Which coding agents does EVO work with?

EVO runs on Claude Code, Codex, Cursor, OpenClaw, Hermes, Opencode, and Pi. Experiments run locally in git worktrees or on remote sandboxes like Modal, E2B, Daytona, AWS, and Azure.

Is EVO open source?

Yes. The EVO plugin and CLI are open source under Apache-2.0, installable with two commands, and the project is citable with a DOI.

Where this goes next

Product

Autoresearch as infrastructure

For teams who own an eval suite and hand-tune a complex harness. EVO runs the experiments in parallel and ships only measurable wins.

Claude Code

Autoresearch on Claude Code

EVO instruments your existing Claude Code prompts, tools, and subagent graphs as the search space — and proves which changes beat baseline.

Ship improvements.
Not guesses.

Join teams turning every change into
measurable, verified progress.

What is autoresearch?

The short answer

Hypothesize

Measure against a baseline

Keep only verified wins

How autoresearch works

Define what "better" means

Propose a change

Run the experiment

Gate for safety

Keep or discard

Raise the baseline and repeat

Autoresearch vs. a plain hill-climb

Frequently asked questions

Where this goes next

Autoresearch as infrastructure→

Autoresearch on Claude Code→

Ship improvements.Not guesses.

Autoresearch as infrastructure

Autoresearch on Claude Code

Ship improvements.
Not guesses.