Original repair tasks
Tasks are written for this corpus rather than lifted from merged patches, which keeps the answer out of the visible prompt.
Measuring AI coding agents on long-horizon Nix repository repair tasks with hidden shell evaluators and concrete worktree diffs.
| Run | Effort | Pass@1 | Score | Agent time | Failed |
|---|---|---|---|---|---|
| GPT-5.5 via Codex CLI26-task corpus · 20260625T072711Z-e484ea0f | low | 81% | 2100 / 2600 | 19m 25s | 526/26 |
| GPT-5.5 via Codex CLI26-task corpus · 20260625T073226Z-3fce189c | medium | 73% | 1900 / 2600 | 22m 15s | 726/26 |
| GPT-5.5 via Codex CLI26-task corpus · 20260625T073227Z-167ae812 | high | 73% | 1900 / 2600 | 25m 59s | 726/26 |
| GPT-5.5 via Codex CLI26-task corpus · 20260624T182835Z-4ad8b555 (+2) | xhigh | 85% | 2200 / 2600 | 41m 03s | 426/26 |
| GPT-5.4 via Codex CLI26-task corpus · 20260625T073231Z-84de082a | low | 77% | 2000 / 2600 | 17m 45s | 626/26 |
| GPT-5.4 via Codex CLI26-task corpus · 20260625T073227Z-76c2964d | medium | 81% | 2100 / 2600 | 20m 55s | 526/26 |
| GPT-5.4 via Codex CLI26-task corpus · 20260625T073228Z-a5a4a383 | high | 85% | 2200 / 2600 | 28m 16s | 426/26 |
| GPT-5.4 via Codex CLI26-task corpus · 20260624T190640Z-fa04a19c (+2) | xhigh | 81% | 2100 / 2600 | 40m 02s | 526/26 |
| GPT-5.4 mini via Codex CLI26-task corpus · 20260624T194359Z-268b0abe (+2) | xhigh | 73% | 1900 / 2600 | 39m 46s | 726/26 |
| Claude Opus 4.826-task corpus · 20260624T202141Z-881ef1e9 (+2) | default | 81% | 2100 / 2600 | 25m 24s | 526/26 |
Current rows are local artifacts from results/. The model comparison is summarized in run notes.
The benchmark gives agents a copied starter tree, a prompt, and no access to the hidden evaluator. It rewards final worktree behavior, not a fluent explanation of what the code should do.
Tasks are written for this corpus rather than lifted from merged patches, which keeps the answer out of the visible prompt.
The corpus covers flakes, modules, overlays, derivations, fetchers, Home Manager, shell escaping, and package contracts.
Each task has a shell evaluator that checks behavior with small fake package sets and libraries instead of relying on LLM judging.
Every run records logs, timings, pass state, score JSON, and the final diff so failures can be inspected after the benchmark ends.
Task examples
Keep module outputs separated instead of leaking options across systems.
Repair Python/CUDA packaging without falling back to generic Linux path guesses.
Build paths with Nix values while avoiding string interpolation traps.
Explain the observed NixOS service behavior without chasing a plausible but wrong network diagnosis.
Use Home Manager file and XDG options rather than imperative setup.
Preserve the fixed-output fetcher contract with a commit pin and SRI hash.
easy tasks for syntax, lookup, stale options, and small contracts
medium repairs across flakes, containers, issue reports, overlays, and packaging
hard tasks for modules, overlays, and Python/CUDA package inputs
Methodology
Starter files and the prompt enter a clean temporary workdir.
The agent reads NIXBENCH_PROMPT.md and modifies only local files.
A hidden shell evaluator scores the final tree after the agent exits.
Logs, timing, score JSON, and the final diff are written under results/.
Run your agent
Use the local harness to run a CLI agent against the same copied worktree contract and hidden evaluator shape.