More experiments than you can run by hand
Parallel subagents explore many directions at once, each in its own isolated git worktree. You stop rationing experiments to whatever fits in a sprint.
Your team already runs autoresearch by hand — proposing a harness tweak, running the eval, reading the scoreboard, keeping what wins. EVO turns that loop into infrastructure: agents propose changes, prove which ones beat your baseline, and ship only what's measurably better — across code, agents, models, configs, and prompts. Continuously, with a receipt for every change.
If you own an eval suite and hand-tune a complex harness, autoresearch is the difference between iterating on intuition and iterating on evidence. EVO runs the experiments you don't have time to run — in parallel, gated by your safety checks — and only the changes that move your metric survive.
Parallel subagents explore many directions at once, each in its own isolated git worktree. You stop rationing experiments to whatever fits in a sprint.
A change only lands if it beats your baseline and clears your regression tests and invariants. A faster wrong answer never wins.
Each kept improvement comes with the experiment, the score delta, and the discarded alternatives — so progress is auditable, not a black box.
The same loop you run manually — automated, parallelized, and safety-gated.
Pick the metric to move and the direction — higher accuracy, lower latency, cheaper tokens. A benchmark turns that into a number the loop can chase.
An agent reads the codebase, prior traces, and discarded ideas, then forms a hypothesis and makes one concrete, isolated change to test.
The change runs in its own sandbox or worktree against the benchmark, scored independently so attempts never collide or corrupt your main branch.
Regression tests and invariants run as gates. An experiment that breaks something is discarded — even if its score went up.
Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded.
A kept win becomes the new baseline, and the loop runs again — exploring multiple directions and compounding verified progress.
| Criterion | Your DIY harness | EVO |
|---|---|---|
| Throughput | One experiment at a time, between other work | Parallel subagents, each in its own isolated git worktree |
| Search strategy | Greedy hill-climb on whatever you tried last | Tree search — multiple directions fork from any committed win |
| Safety | You remember to run the regression suite — usually | Gates discard any change that breaks a test, even if the score rises |
| Shared learning | Each attempt and dead-end lives in someone's head | Failure traces and discarded hypotheses are shared across agents |
| Scope | You tune the one layer you have time for | Compounds across code, agents, models, configs, prompts, and A/B tests |
| Setup | Hand-build and babysit your own harness | Discover explores the repo and instruments the benchmark for you |
The category, defined: an autonomous loop that proposes changes, scores them against a baseline, and keeps only what measurably wins.
Claude CodeEVO instruments your existing Claude Code prompts, tools, and subagent graphs as the search space — and proves which changes beat baseline.
Join teams turning every change into
measurable, verified progress.