Autoresearch

Autoresearch, as infrastructure for your stack

Your team already runs autoresearch by hand — proposing a harness tweak, running the eval, reading the scoreboard, keeping what wins. EVO turns that loop into infrastructure: agents propose changes, prove which ones beat your baseline, and ship only what's measurably better — across code, agents, models, configs, and prompts. Continuously, with a receipt for every change.

Get Early Access View EVO on GitHub

What autoresearch gives your team

If you own an eval suite and hand-tune a complex harness, autoresearch is the difference between iterating on intuition and iterating on evidence. EVO runs the experiments you don't have time to run — in parallel, gated by your safety checks — and only the changes that move your metric survive.

More experiments than you can run by hand

Parallel subagents explore many directions at once, each in its own isolated git worktree. You stop rationing experiments to whatever fits in a sprint.

Every win is gated and provable

A change only lands if it beats your baseline and clears your regression tests and invariants. A faster wrong answer never wins.

A receipt for every change

Each kept improvement comes with the experiment, the score delta, and the discarded alternatives — so progress is auditable, not a black box.

How EVO runs autoresearch

The same loop you run manually — automated, parallelized, and safety-gated.

1
Define what "better" means
Pick the metric to move and the direction — higher accuracy, lower latency, cheaper tokens. A benchmark turns that into a number the loop can chase.
2
Propose a change
An agent reads the codebase, prior traces, and discarded ideas, then forms a hypothesis and makes one concrete, isolated change to test.
3
Run the experiment
The change runs in its own sandbox or worktree against the benchmark, scored independently so attempts never collide or corrupt your main branch.
4
Gate for safety
Regression tests and invariants run as gates. An experiment that breaks something is discarded — even if its score went up.
5
Keep or discard
Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded.
6
Raise the baseline and repeat
A kept win becomes the new baseline, and the loop runs again — exploring multiple directions and compounding verified progress.

Hand-tuned harness vs. autoresearch as infrastructure

Criterion	Your DIY harness	EVO
Throughput	One experiment at a time, between other work	Parallel subagents, each in its own isolated git worktree
Search strategy	Greedy hill-climb on whatever you tried last	Tree search — multiple directions fork from any committed win
Safety	You remember to run the regression suite — usually	Gates discard any change that breaks a test, even if the score rises
Shared learning	Each attempt and dead-end lives in someone's head	Failure traces and discarded hypotheses are shared across agents
Scope	You tune the one layer you have time for	Compounds across code, agents, models, configs, prompts, and A/B tests
Setup	Hand-build and babysit your own harness	Discover explores the repo and instruments the benchmark for you

Ship improvements.
Not guesses.

Join teams turning every change into
measurable, verified progress.

Autoresearch, as infrastructure for your stack

What autoresearch gives your team

More experiments than you can run by hand

Every win is gated and provable

A receipt for every change

How EVO runs autoresearch

Define what "better" means

Propose a change

Run the experiment

Gate for safety

Keep or discard

Raise the baseline and repeat

Hand-tuned harness vs. autoresearch as infrastructure

Further reading

What is autoresearch?

Autoresearch on Claude Code

Ship improvements.
Not guesses.

Autoresearch, as infrastructure for your stack

What autoresearch gives your team

More experiments than you can run by hand

Every win is gated and provable

A receipt for every change

How EVO runs autoresearch

Define what "better" means

Propose a change

Run the experiment

Gate for safety

Keep or discard

Raise the baseline and repeat

Hand-tuned harness vs. autoresearch as infrastructure

Further reading

What is autoresearch?→

Autoresearch on Claude Code→

Ship improvements.Not guesses.

What is autoresearch?

Autoresearch on Claude Code

Ship improvements.
Not guesses.