NixBench | Agentic Nix Benchmark

Run plot

26 tasks · updated June 25, 2026

NixBench score

pass rate by agent time

Run	Effort	Pass@1	Score	Agent time	Failed
GPT-5.5 via Codex CLI26-task corpus · 20260625T072711Z-e484ea0f	low	81%	2100 / 2600	19m 25s	526/26
GPT-5.5 via Codex CLI26-task corpus · 20260625T073226Z-3fce189c	medium	73%	1900 / 2600	22m 15s	726/26
GPT-5.5 via Codex CLI26-task corpus · 20260625T073227Z-167ae812	high	73%	1900 / 2600	25m 59s	726/26
GPT-5.5 via Codex CLI26-task corpus · 20260624T182835Z-4ad8b555 (+2)	xhigh	85%	2200 / 2600	41m 03s	426/26
GPT-5.4 via Codex CLI26-task corpus · 20260625T073231Z-84de082a	low	77%	2000 / 2600	17m 45s	626/26
GPT-5.4 via Codex CLI26-task corpus · 20260625T073227Z-76c2964d	medium	81%	2100 / 2600	20m 55s	526/26
GPT-5.4 via Codex CLI26-task corpus · 20260625T073228Z-a5a4a383	high	85%	2200 / 2600	28m 16s	426/26
GPT-5.4 via Codex CLI26-task corpus · 20260624T190640Z-fa04a19c (+2)	xhigh	81%	2100 / 2600	40m 02s	526/26
GPT-5.4 mini via Codex CLI26-task corpus · 20260624T194359Z-268b0abe (+2)	xhigh	73%	1900 / 2600	39m 46s	726/26
Claude Opus 4.826-task corpus · 20260624T202141Z-881ef1e9 (+2)	default	81%	2100 / 2600	25m 24s	526/26

Current rows are local artifacts from results/. The model comparison is summarized in run notes.

NixBench exists because plausible Nix often fails at evaluation time.

The benchmark gives agents a copied starter tree, a prompt, and no access to the hidden evaluator. It rewards final worktree behavior, not a fluent explanation of what the code should do.

contamination

Original repair tasks

Tasks are written for this corpus rather than lifted from merged patches, which keeps the answer out of the visible prompt.

scope

Nix-specific failure surfaces

The corpus covers flakes, modules, overlays, derivations, fetchers, Home Manager, shell escaping, and package contracts.

verification

Hand-written checks

Each task has a shell evaluator that checks behavior with small fake package sets and libraries instead of relying on LLM judging.

artifacts

Diff-backed runs

Every run records logs, timings, pass state, score JSON, and the final diff so failures can be inspected after the benchmark ends.

Task examples

Twenty-six small repositories, one hidden evaluator each.

All 26 tasks

Respect NixOS, Home Manager, and nix-darwin boundaries

Keep module outputs separated instead of leaking options across systems.

Patch Python CUDA package inputs

Repair Python/CUDA packaging without falling back to generic Linux path guesses.

Compose module paths from arguments

Build paths with Nix values while avoiding string interpolation traps.

Debug network symptoms without false leads

Explain the observed NixOS service behavior without chasing a plausible but wrong network diagnosis.

Manage home files declaratively

Use Home Manager file and XDG options rather than imperative setup.

Pin a GitHub source fetcher

Preserve the fixed-output fetcher contract with a commit pin and SRI hash.

easy tasks for syntax, lookup, stale options, and small contracts

medium repairs across flakes, containers, issue reports, overlays, and packaging

hard tasks for modules, overlays, and Python/CUDA package inputs

Methodology

The agent edits a worktree. The evaluator scores the result.

Run guide

copy
Starter files and the prompt enter a clean temporary workdir.
edit
The agent reads NIXBENCH_PROMPT.md and modifies only local files.
check
A hidden shell evaluator scores the final tree after the agent exits.
record
Logs, timing, score JSON, and the final diff are written under results/.

Run your agent

Add another row to the benchmark.

Use the local harness to run a CLI agent against the same copied worktree contract and hidden evaluator shape.

Open run command