On nine discovery-heavy tasks from CodeScaleBench, Sourcegraph's benchmark for coding agents on large multi-repository codebases, Claude Sonnet 4.6 with the Sourcegraph MCP server and no source code on disk scored 0.698. Claude Fable 5, working from the whole repository checked out locally with no retrieval at all, scored 0.568 and cost almost twice as much for each point of quality. The cheaper model with the tool won six of the nine tasks.
Fable 5 is the first generally available model in Anthropic's Mythos class, the tier that sits above Opus, and at release it posted state-of-the-art numbers on most public benchmarks, including the Cursor and Cognition coding evaluations. That is the model a cheaper, faster Sonnet 4.6 with Sourcegraph came out ahead of on this work, on the kind of task where finding the right code across a sprawling codebase is most of the job.
One caveat on scope: this is nine tasks, not the full CodeScaleBench. Fable 5 access was restricted within days of release, so these nine are what finished with valid Fable runs before that window closed. Treat the result as an early signal, not a settled benchmark.
A cheaper model with retrieval scored higher and cost less
| Setup |
Mean reward |
$/quality point |
| Sonnet 4.6 + Sourcegraph MCP server |
0.698 |
$1.02 |
| Fable 5, local checkout |
0.568 |
$1.83 |
Sonnet 4.6 with Sourcegraph carries no source code on disk and reaches for the codebase through search, symbol resolution, and reference following. Fable 5 has the whole repository checked out locally and reads it directly. The cheaper model came out ahead on score and on cost, and the margin shows up most on the tasks where finding the right code across repositories is the hard part.
The gap is cross-repository discovery
The margin is not spread evenly across the nine tasks. It concentrates on the tasks that turn on finding where a symbol is actually used across repositories, with vulnerability tracing and cross-organization dependency following carrying most of it. Tasks that both setups already solve, a config trace or a clean migration, tie near the top, since there is no room to gain on a task that already scores well. On one of the cross-organization dependency tasks the frontier model on a full local checkout came out ahead, which tells you the comparison is measuring something real rather than leaning one way by construction.
| Task |
Sonnet + Sourcegraph |
Fable, no tool |
Gap |
| ccx-crossorg-217 |
0.695 |
0.104 |
+0.591 |
| ccx-vuln-remed-135 |
0.611 |
0.231 |
+0.380 |
| ccx-migration-289 |
0.681 |
0.574 |
+0.107 |
| ccx-agentic-223 |
0.625 |
0.537 |
+0.088 |
| ccx-migration-274 |
1.000 |
0.928 |
+0.072 |
| ccx-vuln-remed-126 |
0.759 |
0.735 |
+0.024 |
| ccx-config-trace-010 |
1.000 |
1.000 |
0.000 |
| ccx-incident-145 |
0.172 |
0.172 |
0.000 |
| ccx-crossorg-288 |
0.735 |
0.833 |
-0.098 |
Read the runs yourself
The results explorer puts the runs for any task side by side, the cheaper model with Sourcegraph and the frontier model on a local checkout, each with the full prompt, the conversation, and every tool call, so you can watch where one run finds the file that another never reaches.
Open the results explorer →
The practical read, for anyone wiring up an agent against a budget: a cheaper, faster model with good code retrieval beats a more expensive frontier model without it, at least on the cross-repository work this benchmark is built from. Where that stops holding, on tasks that need deep single-file reasoning rather than navigation, is the next thing worth measuring. If you are building or evaluating agent tooling on real codebases, I'd love to chat! You can reach me at [email protected].