Autoresearch on the workflows you already run
EVO instruments your existing Claude Code agents, prompts, and subagent graphs as the search space — no rewrite, no separate harness. The workflow you ship is the thing being optimized.
If your team builds dynamic agent workflows on Claude Code, you already tune them by hand — rewriting prompts, reordering subagents, swapping tools, then eyeballing whether it got better. EVO turns that into autoresearch: agents propose changes to your Claude Code workflows, prove which ones beat your baseline, and keep only what's measurably better — gated by your safety checks, with a receipt for every change.
Claude Code makes it easy to build dynamic, multi-step agent workflows. It does not tell you which version is actually better. EVO runs that experiment for you — porting autoresearch onto the workflows you already ship — so prompt, tool, and orchestration changes compete on evidence instead of intuition.
EVO instruments your existing Claude Code agents, prompts, and subagent graphs as the search space — no rewrite, no separate harness. The workflow you ship is the thing being optimized.
A proposed change to a prompt or orchestration step only survives if it beats your baseline on your eval and clears your regression checks. A faster wrong answer never ships.
Each kept improvement comes with the experiment, the score delta, and the discarded alternatives — so you can hand a prospect a clear before/after, not a vibe.
The same loop your team runs by hand on a Claude Code workflow — automated, parallelized, and safety-gated.
Connect the Claude Code workflow you want to improve and the eval that defines "better" — task success, latency, or token cost. EVO maps prompts, tools, and subagent steps as the search space.
An agent reads the workflow, prior runs, and discarded ideas, then makes one concrete change — a tightened prompt, a reordered subagent, a swapped tool — to test against the baseline.
Each candidate runs in its own isolated git worktree against your eval, scored independently so dynamic workflows never collide or corrupt your main branch.
Your regression tests and invariants run as gates. A change that breaks a tool call or a guardrail is discarded — even if its score went up.
Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded for the next round.
A kept win becomes the new baseline, and the loop forks again — compounding verified improvements across your Claude Code workflows continuously.
| Criterion | Hand-tuning in Claude Code | EVO autoresearch |
|---|---|---|
| How changes are tested | Edit a prompt or step, rerun, eyeball the difference | Every change runs against your eval and is scored against the baseline |
| Throughput | One workflow variant at a time, between other work | Parallel subagents, each in its own isolated git worktree |
| Search strategy | Greedy tweaks on whatever you tried last | Tree search — multiple directions fork from any committed win |
| Safety | You remember to recheck guardrails — usually | Gates discard any change that breaks a test, even if the score rises |
| Scope | You tune the one prompt or step you have time for | Compounds across prompts, tools, subagent graphs, and configs |
| Evidence to share | "It feels better" — no audit trail | A receipt per change: experiment, score delta, discarded alternatives |
The category, defined: an autonomous loop that proposes changes, scores them against a baseline, and keeps only what measurably wins.
ProductFor teams who own an eval suite and hand-tune a complex harness. EVO runs the experiments in parallel and ships only measurable wins.
Join teams turning every change into
measurable, verified progress.