EVO × Claude Code

EVO ports autoresearch onto Claude Code dynamic workflows

If your team builds dynamic agent workflows on Claude Code, you already tune them by hand — rewriting prompts, reordering subagents, swapping tools, then eyeballing whether it got better. EVO turns that into autoresearch: agents propose changes to your Claude Code workflows, prove which ones beat your baseline, and keep only what's measurably better — gated by your safety checks, with a receipt for every change.

What EVO gives your Claude Code workflows

Claude Code makes it easy to build dynamic, multi-step agent workflows. It does not tell you which version is actually better. EVO runs that experiment for you — porting autoresearch onto the workflows you already ship — so prompt, tool, and orchestration changes compete on evidence instead of intuition.

01

Autoresearch on the workflows you already run

EVO instruments your existing Claude Code agents, prompts, and subagent graphs as the search space — no rewrite, no separate harness. The workflow you ship is the thing being optimized.

02

Only measurable wins land

A proposed change to a prompt or orchestration step only survives if it beats your baseline on your eval and clears your regression checks. A faster wrong answer never ships.

03

A receipt for every change

Each kept improvement comes with the experiment, the score delta, and the discarded alternatives — so you can hand a prospect a clear before/after, not a vibe.

How EVO runs autoresearch on Claude Code

The same loop your team runs by hand on a Claude Code workflow — automated, parallelized, and safety-gated.

  1. 1

    Point EVO at your workflow

    Connect the Claude Code workflow you want to improve and the eval that defines "better" — task success, latency, or token cost. EVO maps prompts, tools, and subagent steps as the search space.

  2. 2

    Propose a change

    An agent reads the workflow, prior runs, and discarded ideas, then makes one concrete change — a tightened prompt, a reordered subagent, a swapped tool — to test against the baseline.

  3. 3

    Run it in parallel

    Each candidate runs in its own isolated git worktree against your eval, scored independently so dynamic workflows never collide or corrupt your main branch.

  4. 4

    Gate for safety

    Your regression tests and invariants run as gates. A change that breaks a tool call or a guardrail is discarded — even if its score went up.

  5. 5

    Keep or discard

    Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded for the next round.

  6. 6

    Raise the baseline and repeat

    A kept win becomes the new baseline, and the loop forks again — compounding verified improvements across your Claude Code workflows continuously.

Hand-tuning a Claude Code workflow vs. autoresearch on it

CriterionHand-tuning in Claude CodeEVO autoresearch
How changes are testedEdit a prompt or step, rerun, eyeball the differenceEvery change runs against your eval and is scored against the baseline
ThroughputOne workflow variant at a time, between other workParallel subagents, each in its own isolated git worktree
Search strategyGreedy tweaks on whatever you tried lastTree search — multiple directions fork from any committed win
SafetyYou remember to recheck guardrails — usuallyGates discard any change that breaks a test, even if the score rises
ScopeYou tune the one prompt or step you have time forCompounds across prompts, tools, subagent graphs, and configs
Evidence to share"It feels better" — no audit trailA receipt per change: experiment, score delta, discarded alternatives
Further reading

Further reading

Ship improvements.
Not guesses.

Join teams turning every change into
measurable, verified progress.