Hypothesize
An agent studies your system, reads what's failed before, and proposes a concrete change to try — not one big rewrite, many small measured attempts.
Autoresearch is an autonomous loop where an agent forms a hypothesis, runs an experiment, measures the result against a baseline, and keeps only the changes that actually prove out — then repeats. EVO turns that loop into infrastructure for your whole stack.
Autoresearch replaces guesswork with measured iteration. Instead of asking one agent for one big change, an autoresearch system runs many small experiments in a loop, keeps the ones that measurably improve a metric, and throws away the rest — compounding real gains over time.
An agent studies your system, reads what's failed before, and proposes a concrete change to try — not one big rewrite, many small measured attempts.
Every attempt is scored by a benchmark you define — speed, accuracy, cost, eval score — so a result is a number, not an opinion.
Changes that beat the baseline and pass your safety checks are kept. Everything else is discarded. The baseline rises, and the loop runs again.
A loop that compounds with every run.
Pick the metric to move and the direction — higher accuracy, lower latency, cheaper tokens. A benchmark turns that into a number the loop can chase.
An agent reads the codebase, prior traces, and discarded ideas, then forms a hypothesis and makes one concrete, isolated change to test.
The change runs in its own sandbox or worktree against the benchmark, scored independently so attempts never collide or corrupt your main branch.
Regression tests and invariants run as gates. An experiment that breaks something is discarded — even if its score went up.
Only changes that beat the baseline and clear every gate survive. The rest are thrown away, with the reasoning recorded.
A kept win becomes the new baseline, and the loop runs again — exploring multiple directions and compounding verified progress.
| Criterion | Plain autoresearch | EVO |
|---|---|---|
| Search strategy | Greedy hill-climb on a single branch | Tree search — multiple directions fork from any committed win |
| Parallelism | One experiment at a time | Parallel subagents, each in its own isolated git worktree |
| Safety | No guardrails — a faster wrong answer can win | Gates discard any change that breaks a test, even if the score rises |
| Shared learning | Each attempt starts cold | Failure traces and discarded hypotheses are shared across agents |
| Observability | Read the logs yourself | Live dashboard tracks every experiment, result, and decision |
| Setup | Hand-build your own harness | Discover explores the repo and instruments the benchmark for you |
No. AutoML automates model and hyperparameter selection for machine learning. Autoresearch is broader — it optimizes anything you can score with a benchmark, including code, agents, prompts, configs, and infrastructure, by running measured experiments in a loop.
The idea was popularized by Andrej Karpathy's "autoresearch," where an LLM runs experiments autonomously to beat its own best score. EVO builds a structured version on top of that idea — adding tree search, parallel agents, shared state, and safety gates.
Anything with a measurable target: code performance, agent task success, model eval scores and cost, configuration and runtime settings, website A/B conversion, and outreach response rates.
Yes. Your main branch is never touched — setup and every experiment run in isolated copies of the repo. Changes that fail your tests or safety gates are discarded, and winning changes are kept as diffs you review before merging.
EVO runs on Claude Code, Codex, Cursor, OpenClaw, Hermes, Opencode, and Pi. Experiments run locally in git worktrees or on remote sandboxes like Modal, E2B, Daytona, AWS, and Azure.
Yes. The EVO plugin and CLI are open source under Apache-2.0, installable with two commands, and the project is citable with a DOI.
For teams who own an eval suite and hand-tune a complex harness. EVO runs the experiments in parallel and ships only measurable wins.
Claude CodeEVO instruments your existing Claude Code prompts, tools, and subagent graphs as the search space — and proves which changes beat baseline.
Join teams turning every change into
measurable, verified progress.