Autoresearch

Autoresearch, as infrastructure for your stack

Your team already runs autoresearch by hand — proposing a harness tweak, running the eval, reading the scoreboard, keeping what wins. EVO turns that loop into infrastructure: agents propose changes, prove which ones beat your baseline, and ship only what's measurably better — across code, agents, models, configs, and prompts. Continuously, with a receipt for every change.

What autoresearch gives your team

If you own an eval suite and hand-tune a complex harness, autoresearch is the difference between iterating on intuition and iterating on evidence. EVO runs the experiments you don't have time to run — in parallel, gated by your safety checks — and only the changes that move your metric survive.

01

More experiments than you can run by hand

Parallel subagents explore many directions at once, each in its own isolated git worktree. You stop rationing experiments to whatever fits in a sprint.

02

Every win is gated and provable

A change only lands if it beats your baseline and clears your regression tests and invariants. A faster wrong answer never wins.

03

A receipt for every change

Each kept improvement comes with the experiment, the score delta, and the discarded alternatives — so progress is auditable, not a black box.

How EVO runs autoresearch

The same loop you run manually — automated, parallelized, and safety-gated.

  1. 1

    Define what "better" means

    Pick the metric to move and the direction — higher accuracy, lower latency, cheaper tokens. A benchmark turns that into a number the loop can chase.

  2. 2

    Propose a change

    An agent reads the codebase, prior traces, and discarded ideas, then forms a hypothesis and makes one concrete, isolated change to test.

  3. 3

    Run the experiment

    The change runs in its own sandbox or worktree against the benchmark, scored independently so attempts never collide or corrupt your main branch.

  4. 4

    Gate for safety

    Regression tests and invariants run as gates. An experiment that breaks something is discarded — even if its score went up.

  5. 5

    Keep or discard

    Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded.

  6. 6

    Raise the baseline and repeat

    A kept win becomes the new baseline, and the loop runs again — exploring multiple directions and compounding verified progress.

Hand-tuned harness vs. autoresearch as infrastructure

CriterionYour DIY harnessEVO
ThroughputOne experiment at a time, between other workParallel subagents, each in its own isolated git worktree
Search strategyGreedy hill-climb on whatever you tried lastTree search — multiple directions fork from any committed win
SafetyYou remember to run the regression suite — usuallyGates discard any change that breaks a test, even if the score rises
Shared learningEach attempt and dead-end lives in someone's headFailure traces and discarded hypotheses are shared across agents
ScopeYou tune the one layer you have time forCompounds across code, agents, models, configs, prompts, and A/B tests
SetupHand-build and babysit your own harnessDiscover explores the repo and instruments the benchmark for you
Further reading

Further reading

Ship improvements.
Not guesses.

Join teams turning every change into
measurable, verified progress.