Why coding agents fail in large codebases (and what to do about it)
Data from 1,281 agent runs across 40+ large open source repos reveals five repeatable failure patterns in coding agents, and the infrastructure fixes for each.

Data from 1,281 agent runs across 40+ large open source repos reveals five repeatable failure patterns in coding agents, and the infrastructure fixes for each.
A coding agent is tasked with analyzing how a specific resource allocation mechanism propagates decisions across multiple packages in the Kubernetes source code. This requires tracing the Dynamic Resource Allocation (DRA) system through 1.4 million lines of Go spread across 22,000 files.
The agent starts working. It greps for the DRA allocator. Hundreds of results across dozens of directories. It reads a file. That file imports from another package. It navigates there. That package references three more. Each step is locally reasonable. The whole trajectory leads nowhere.
At 6,000 seconds (over an hour and a half), the agent has produced nothing. No analysis. No output. Zero score.
Now the same agent gets the same task with one difference: instead of local file access, it has code search tools backed by a proper index that include keyword search, semantic search, and find-references. Eight keyword searches and six semantic searches identify the relevant packages. One find-references call maps the cross-package dependency chain. In 89 seconds, the agent produces a comprehensive analysis and scores 0.90 out of 1.0.
Same model. Same task. Different context infrastructure. The difference between complete failure and near-perfect completion wasn't intelligence, it was efficient access to context.
The failure patterns described here are drawn from CodeScaleBench, a benchmark of software engineering tasks evaluated across 40+ of the largest open source repositories spanning 9 programming languages. The observations come from 1,281 scored agent runs, enough to distinguish systematic patterns from noise and to assign specific causes to each failure mode.
Understanding these patterns matters because each one maps to a specific infrastructure intervention. If you skip the diagnosis, you build the wrong thing.
Agent failures in large codebases are not random. They cluster into five repeatable patterns. Knowing which one you're looking at determines what to fix.
The most common failure mode. The agent navigates endlessly, following references, reading files, but never converges on a plan or produces output. It burns its entire timeout on exploration.
The mechanism is straightforward. The agent's basic strategy is to read a file, follow its imports, and read the next file. This works fine when the search space is bounded, but in a codebase with 22,000 files, this strategy produces an exploration tree that branches faster than the agent can prune it. Every file references three more. The agent has no way to identify which branches matter without reading them all.
The CodeScaleBench data shows this correlates directly with codebase size. Agents with only local tools (grep, file read, glob) begin to struggle systematically when codebases exceed roughly 400,000 lines of code. The reward delta when code intelligence tools are added:
| LOC Range | Reward Delta | What's happening |
|---|---|---|
| < 400K | −0.080 | Tools add overhead; grep typically works fine |
| 400K–2M | +0.259 | Strongest positive effect |
This agent doesn't experience a reasoning failure with increasing codebase complexity; the agent's per-step reasoning is correct. It's a search infrastructure failure. The agent lacks tools to narrow the search space before committing to a traversal path.
The agent finds code matching its search terms but selects the wrong result. In a large codebase, common symbol names appear in dozens of files. A grep for allocate in Kubernetes returns hundreds of matches across test files, deprecated code, utility functions, and the actual allocation logic.
# Lexical search: 47 matches, no ranking by structural relevance
$ grep -rn "allocate" pkg/ | wc -l
47
# Structural navigation: 1 definition, 12 call sites, ranked by role
find-references --symbol "DRAAllocator.allocate" --scope "pkg/"
→ Definition: pkg/scheduler/framework/plugins/dynamicresources/allocator.go:142
→ 12 call sites across 4 packages
In the benchmark data, keyword search was the most-used tool modality: 7,993 calls across all runs, compared to 2,449 semantic search calls and 57 deep-search calls. Agents overwhelmingly prefer the simplest tool that works. But lexical search alone can't distinguish a symbol's definition from its 47 call sites, test mocks, and documentation mentions. When every package has a handler.go or every module has an __init__.py, text search produces many results with no way to rank them by structural relevance.
The fix is structural code navigation: go-to-definition, find-references, and type-hierarchy resolution that leverage the compiler's understanding of the code to distinguish definitions from call sites.
The agent finds some of the relevant code but misses the rest. In a cross-file refactoring task in the Strata finance library, the baseline agent modified 2 of 7 affected files and scored 0.32. The agent using the Sourcegraph MCP identified all affected files and produced a complete refactoring, scoring 0.80.
The baseline agent wasn't wrong about the files it found. Its changes were locally correct. It simply didn't find the other five files that needed to be changed. In a tightly coupled codebase, a partial refactoring is often worse than no refactoring because it leaves the code in an inconsistent state where the API contract at the changed site diverges from the unchanged call sites.
This is the subtlest failure mode because it appears to be somewhat successful. The problem compounds in multi-repository tasks, where the CodeScaleBench data shows a +0.209 F1 delta improvement compared to +0.085 in single-repository tasks. When the code an agent needs spans organizational repository boundaries, the likelihood of finding it all drops substantially without specialized tooling.
The agent makes excessive tool calls and repeatedly backtracks through trial-and-error. In one refactoring task, the baseline agent made 96 tool calls over 84 minutes (including 6 complete reversals of approach) and scored 0.32. The agent with the Sourcegraph MCP used 5 targeted search calls in 4.4 minutes and scored 0.68.
The 96-to-5 ratio is not an outlier. When an agent lacks efficient search tools, its strategy for finding code degenerates: grep for a term, read the result, realize it's wrong, grep for a different term, read another file, backtrack, try a different directory. Each cycle consumes tokens and time.
MCP-augmented agents, those with structured search interfaces backed by proper indexes, are 30% cheaper ($0.51 per task vs. $0.73 baseline) and 38% faster on average. This cost and time reduction comes almost entirely from eliminating thrashing. Since reasoning isn't the bottleneck for retrieval, this also has implications for which models companies can use to achieve sufficient retrieval outcomes for downstream agentic tasks.
Tool thrashing isn't just slower. It's structurally worse. Each backtrack leaves residue in the conversation history, file contents that are no longer relevant but still consume context. By the time the agent finds the right files, it may have less context to produce output than it would have had if it had found them on the first try.
The agent reads too much irrelevant code and loses focus. Even when the agent finds relevant files, it often reads them in their entirety, resulting in hundreds of lines of irrelevant code diluting the signal.
Here's the counterintuitive finding: providing agents with more tools sometimes made this worse, likely because they weren't given sufficient strategic information on when and how to use them. On certain tasks, agents with code search tools took longer than agents with only local tools. In these instances, the agent used search to find additional code to read, spent time understanding it, and still had to do the same local work as the baseline agent.
This has a well-established cognitive mechanism. Research on long-context language models shows they struggle to use information in the middle of long contexts. When an agent stuffs its context with search results, the most relevant information may end up in the worst position for the model's attention. More retrieved code doesn't improve performance if the model can't effectively access it. We're actively working to improve tool instructions to steer agents toward more effective use of their available capabilities.
The solution is not a bigger context window; it's a smarter selection of what goes into the window, which is entirely controllable through code intelligence tooling.
A common reaction to this list is to ask when models will get good enough that these problems disappear. Most of them won't, because they aren't model capability problems.
Consider "lost in the codebase." The agent's per-step reasoning is correct; it reads a file, identifies relevant imports, and follows them. The problem is that this strategy has exponential branching in large codebases. A smarter model would follow the same strategy more efficiently, but it would still face the same combinatorial explosion. The solution isn't a smarter agent; it's a search index that lets the agent skip exploration and go directly to relevant code.
Or consider tool thrashing. The 96 tool calls aren't 96 mistakes. They are 96 attempts to find information using the only tools available. Each individual search is reasonable, but those tools can't distinguish structural relevance from textual co-occurrence, so the agent tries multiple approaches. A smarter model would still be constrained by the same tools.
This distinction matters for how you invest engineering effort. If you believe the problem is model capability, you wait for the next model release. If you understand it as infrastructure, you incorporate code search systems, structured indexes, and retrieval pipelines. The latter approach produces compounding returns: the infrastructure you build for today's agents makes tomorrow's agents better, too.
Each failure mode points to a specific infrastructure investment.
Lost in the codebase and wrong file, wrong symbol: Code search and indexing infrastructure. The +0.259 reward delta at 400K-2M LOC is the empirical case for investing here. Structural navigation (go-to-definition, find-references, type hierarchies) gives agents what lexical search cannot: the compiler's understanding of which code is structurally related.
Partial completion: Retrieval coverage systems. Hybrid retrieval pipelines that combine keyword, semantic, and structural search maximize the probability of finding all affected files, not just the obvious ones.
Tool thrashing and context overflow: Context management. Task-type-aware retrieval that matches strategy to task structure, with empirical thresholds for when more context helps versus hurts.
The common thread: reliable coding agents in production aren't built by just choosing the right model. They're built by solving the engineering problems between the model and the codebase. Code search, structured indexes, retrieval pipelines, and proper evaluation infrastructure are what turn a model that can reason about code into an agent that reliably does.

With Sourcegraph, the code understanding platform for enterprise.
Schedule a demo