An alternate title for this post could have been 'existing benchmarks are awful at evaluating how well agents can perform development tasks, not to mention universally misinterpreted, and I mostly can't use them to evaluate software development capabilities, and neither can you, so I had to build my own benchmark from scratch, UGH' but that's kind of long. In a post in January, I wrote (ranted?) about my many issues with coding agent benchmarks and my research journey (here's my paper library) to try to figure out a better way to approach them.
My problem with benchmarks
Unfortunately, there isn't anything that meets all of my criteria.
What I Want in an Enterprise-Scale Coding Agent Benchmark:
Has at least some huge (ideally ~1M lines of code+) codebases
Multiple coding languages (I love Python, but it's really a data analysis / ML scripting language; banks aren't building their legacy codebases with dynamically typed languages like that, but the vast majority of SWE-Bench style repos just use Python)
Has tasks that require navigating across multiple repositories
Has tasks that cover the full software development lifecycle, not just one narrow part of it (looking at you bunch of just bug-fix benchmarks)
So I made a benchmark that covers enterprise-scale tasks across the software development lifecycle and organizational use cases. CodeScaleBench is a living benchmark (I'm still working on it) with 370 software engineering tasks (and counting) divided into two parts. (So far) CodeScaleBench-SDLC has 150 software engineering tasks spanning the full SDLC; it uses a patch-based verifier method popularized by SWE-Bench and also includes a corresponding ground_truth.json file produced by a curator agent for context-retrieval metrics. (So far) CodeScaleBench-Org has 220 software engineering tasks, separated into development tasks that require organization-level codebase navigation and understanding. This benchmark subset uses what I call an 'artifact' verifier. The agent produces an answer.json file that is then compared with the curator agent's solution.
And by the way, building a benchmark for coding agents while using coding agents is a fun way to find new failure modes. We all know agents are sneaky and manipulative genies, and that's also why I think benchmark results should ship with full agent transcripts for auditing (more on that later, I know I'm asking a lot of you, but I promise if you like benchmarks, this is interesting and also explains why you read this far).
Side note: I'm going to mostly call the agent runs that used the Sourcegraph MCP 'MCP'; but I want to make it clear that this isn't commentary on the impact of MCP generally, but rather an investigation of the effect of code understanding and navigation tools on software development tasks completed by coding agents.
The setup
The same agent (starting with Claude Code + Haiku 4.5) runs the same task under two conditions:
Baseline: Full local source code. Standard tools (grep, file, read, etc.). No MCP.
Sourcegraph MCP-augmented: Source code isn't there. The agent gets 13 Sourcegraph MCP tools (semantic search, symbol resolution, dependency tracing, cross-repo navigation, etc.) and has to use them to find what it needs. To make this work, I mirrored all benchmark repos to a GitHub organization at pinned commits (~180 mirrors), so Sourcegraph indexes the exact version each task targets (it took me an embarrassingly long time to realize this would be necessary to have valid comparison results and initially I was just pointing the tools to the repo HEAD; it took trace detective work to realize my mistake)
Giving the baseline agent access to all code locally makes this a conservative test. In real enterprise settings, the agent wouldn't have full local access to every relevant repo or the entire tens of millions of lines of a monolithic monster. But this benchmark tests whether differences in context-retrieval approaches, given access to the same information, affect SDLC task outcomes. A future post will cover tasks that a baseline agent can't do at all without the Sourcegraph MCP. Though I also found examples where local tools were insufficient, even with all the local code available. The tasks were only possible with these retrieval tools. A little later, I show examples of agents without these tools getting lost in massive codebases like Kubernetes or confused about refactoring in Java repos.
I documented the full methodology, evaluation layers, and information retrieval analysis pipeline in a draft technical report.
What I used (and what I threw out)
Most of CSB-SDLC and all of CSB-Org's tasks are original and not pulled from an existing benchmark. However, each one is grounded in a real repository at a pinned commit, targeting a real development scenario pulled from GitHub issues, PRs, and codebase analysis. I designed the CSB-Org tasks using a custom use-case registry and artifact-evaluation setup for cross-repository code intelligence; check out the technical report for more details on the 'direct' SWE-bench-style verifier mode for code modifications vs. an 'artifact' answer.json approach.
I also created an agentic benchmark checklist pipeline (inspired by this paper) to audit each task before it enters a suite. It runs automated checks across three dimensions: Task Validity, Outcome Validity, and Reporting, and flags issues as PASS/FAIL/WARN/SKIP with severity-aware grading (A-F) based on critical and essential criteria. It catches many structural and verifier-quality problems; it's complementary to a separate preflight runtime validation check I put in place in my (arguably semi-futile) attempts to eliminate all failure modes (more on that in the QA section).
Some benchmark highlights
Understanding impact in a large codebase: The baseline agent hit its nearly 2-hour timeout navigating the Kubernetes monorepo and couldn't complete the task. MCP completed it in 89s with a reward of 0.90/1.0. The MCP agent used 8 keyword searches, 6 semantic searches, and 1 find_references call to map the DRA allocation impact chain across cross-package dependencies. This task was infeasible with only local retrieval tools.
A refactor task: Hard cross-file Java refactoring in the Strata finance library. Both configs took ~17 min. Baseline made minimal changes (6 lines added, 6 removed across 2 files), reward 0.32. MCP identified all affected files for a complete refactoring (725 lines added) that passed all verifier tests; reward: 0.80.
Another hard cross-file refactoring: Baseline made 96 tool calls over 84 min (including 6 backtracks) for a reward of 0.32. MCP made 5 tool calls in 4.4 min, earning a reward of 0.68. The MCP agent searched for RecordAccumulator and related symbols, read 3 files, and completed the task with over double the reward score.
Where Sourcegraph MCP doesn't seem to help
All of the other SDLC and Org task suites with negative deltas were effectively flat (the error bars cross 0, so the effect isn't distinguishable, and the deltas are all smaller than 0.05 anyway). However, there is more to look into there: the task counts may not be sufficient for adequate power; I need to dig further into the traces for more detective work. Codebase size and the MCP preamble are additional factors to look into further.
Improving context retrieval also isn't the bottleneck for every software development situation. Codebase size, harness, language, task type, and prompt content all contribute. The technical report covers the full per-suite breakdown.
Retrieval differences
I built an information retrieval evaluation pipeline alongside task scoring to measure how agents find information across codebases and whether they use it (or don't) to complete their tasks (or not).
These preliminary results show that MCP retrieval substantially improves the amount and quality of context retrieved compared to the baseline. Across the combined dataset, file recall increases from 0.127 to 0.277, Precision@5 rises from 0.140 to 0.478, and F1@5 improves from 0.099 to 0.262, indicating that MCP retrieves both more relevant files and a larger fraction of the ground truth within the top results.
Auditable results (transcripts!)
I mentioned earlier that benchmark results should ship with full agent transcripts. Here's how I approached it for this benchmark framework.
Every task run in CodeScaleBench produces two artifacts beyond the score: a structured result.json with task metadata, pass/fail status, rewards, and timing, plus a full tool-usage transcript showing how the agent interacted with tools, including MCPs. These transcripts are how I found the git history bypass hack, what Claude Code called MCP death spirals, verifier failures, and every other issue in this post. Without them, those issues could persist, undermining the validity of the results.
All results described here, including full traces, tool breakdowns, and IR metrics, are published in the repo here.
The results explorer
In addition to being able to navigate the results via markdowns, if you clone the repo and run:
python3 scripts/export_official_results.py --serve
You get a local results explorer that lets you browse every task run. It shows task results across all suites, configs, and runs.
The Official Results Browser lets you filter by suite, task run, config, and status. Every row links to the task's repo, benchmark definition, trajectory, and audit trail.
Drilling into a specific task, here's a baseline run of an onboarding task from CSB-Org where the agent needs to map data flow across the Python libraries numpy, pandas, and scipy.
And an example MCP-augmented run in the CSB-SDLC Fix suite. The agent resolves a bug in the massive Kubernetes repo, earning a reward score of 0.74 with ~99s of agent task time.
Each task detail view includes expandable sections for the tool breakdown, context metrics/IR analysis, and the complete conversation history. You can verify not only whether the agent succeeded, but also how it approached the task, what tools it used, and where it went right or wrong.
Benchmark QA is SUPER IMPORTANT
To mitigate the fragility in my own setup, I had Claude develop some QA and other benchmarking helper skills and built an agentic benchmark checklist pipeline (the one I mentioned earlier, inspired by this paper). Automated validation across six dimensions also processes every run before promoting it to official status. It detects instruction contamination, broken verifiers, reproducibility issues, ghost runs, misclassified errors, and tool effectiveness problems. I send the run outputs directly to staging and promote the runs to 'official' once they pass several quality gates.
The six dimensions:
- Task Validity -- instruction quality, Dockerfile correctness, task metadata
- Outcome Validity -- verifier soundness, scoring accuracy, fail2pass checks
- Reporting --
result.json completeness, metrics extraction, audit trail
- Reproducibility -- deterministic environments, pinned commits, verifier idempotence
- Tool Effectiveness -- MCP adoption rates, zero-tool detection, death spiral flagging
- Statistical Validity -- sufficient sample size, paired comparison integrity, CI coverage
A couple QA-detective highlights:
MCP Preamble: This is the instruction text prepended to each task telling the MCP agent about its tools. I went through five iterations:
V1 and V2 were too subtle, so the agent never used MCP tools. At that time, I was also putting the local code into the Docker container along with the MCP, which further confused it. V3 overcorrected a bunch of all caps reinforcement text, which got 90%+ adoption but caused what Claude Code coined the "MCP death spiral": when a mirror was broken, or a repo name was wrong, the agent would spend its entire context window retrying failed MCP queries, scoring 0.0 on tasks where the baseline scored 1.0. This was primarily due to environmental issues separate from the preamble. For V4, I reverted to "soft guidance," and adoption dropped to 40%. V5 finally produced acceptable results with: "These files are not present locally, you must use MCP tools to access source code," which forced adoption without mandating a specific workflow, though largely driven by me removing local code entirely.
Honestly, I still don't think the preamble is perfect, and I think one of the main takeaways for folks using the Sourcegraph MCP is that you need to experiment with which agent prompt works best in your codebase. The value you get from it can depend strongly on that setup.
Claude Gaming the System with Git Treachery: There are more examples of Claude doing weird stuff that I had to do detective work to find, but one particularly memorable one was a git history bypass bug. I discovered that Claude was being sneaky, gaming the truncation I had set up in the MCP Docker environments. Claude figured out that git show HEAD: filename could recover the complete source from the git history, completely defeating the experimental setup. The fix (recommitting the truncated state as a new commit so git show HEAD: returns empty files) was straightforward enough, but finding it required actually reading through agent transcripts (or rather, asking another Claude to). A reminder that systematic QA during benchmark design, especially with AI-generated infrastructure, is non-negotiable.
Takeaways
I built CodeScaleBench because I couldn't find a benchmark that measured what actually matters in enterprise software development: large codebases, multi-repo work, and tasks beyond bug fixing. What the data says so far:
MCP helps most on cross-repo, high-complexity work.
The clearest gains are in org-scale tasks like incident debugging and security, where context is fragmented across repos and symbol chains.
Tasks with the MCP are completed faster and cheaper.
MCP runs are consistently faster in both wall time and agent execution time, and are cheaper on average per task. This especially matters when running background agents in bulk.
Retrieval quality improved a lot with the MCP, but retrieval alone isn't everything.
Better recall/precision/F1 doesn't always translate into a huge reward jump. Harness behavior, verifier design, codebase shape, and task type still matter.
Agents mostly do lexical search unless pushed.
Even with better tools available, they default to keyword search and underuse semantic/deep retrieval. Tool capability is one thing; tool adoption is another.
Benchmark QA is extremely time-consuming and extremely important.
Most time went into finding silent failures, verifier bugs, instruction contamination, and agent exploit paths. Without aggressive QA + transcript audits, benchmark numbers are easy to overtrust.
This is directional evidence, not a final verdict.
I still need more runs, more balanced task distributions, and more harnesses. But the current signal is strong: retrieval tooling matters most where enterprise development is hardest.
The technical report includes the full methodology, statistical analysis, and details of the evaluation pipeline.
What's next
Many, but not all, of the tasks have multiple runs, and I have more tasks to run to better account for agent non-determinism. I'm expanding the benchmark framework to support six agent harnesses (Claude Code, Codex, Cursor, Gemini, Copilot, OpenHands). Running the full suite across multiple agents will separate the MCP tool's effectiveness from agent-specific strengths.
I'm also planning Deep Search-focused and other MCP tool-combination-focused experiments, SCIP-indexed codebase comparisons (compiler-accurate code navigation vs. text search), and evaluations of alternative MCP providers, such as the GitHub MCP server. The benchmark is provider-agnostic, and the standardized MCP protocol means you can swap providers with just a config change.
If you're building or evaluating tools for agents working on software development (or just interested in that stuff) and want to check out the benchmark, the repo is public. I'd love to get thoughts and feedback from folks (you can email me at [email protected]). I'll write up another post soon linking to the white paper with the finalized benchmark design, data, and interpretations.