Lessons from Building AI Coding Assistants: Context Retrieval and Evaluation

Jan Hartman

February 20, 2025

This work was published as an industry paper at RecSys ‘24 and is freely accessible on arXiv.

Intro: from LLM to coding assistant with context

Large language models have proven extremely useful as the cornerstone of the new generation of coding assistants. Developments in this area are happening at a breakneck pace and lately, it's all but guaranteed that a new, better LLM for coding is always just around the corner. However, a regular LLM will not have any knowledge of your codebase out of the box - you'll need to provide that yourself!

To go from LLM to coding assistant, context is key. Any AI coding assistant should be able to provide highly relevant, contextual answers that are grounded in your codebase. Simply put, without context, an LLM can only provide generic responses based on its training data. With proper context, it can understand and reason about your specific code, architecture, and development practices.

A crucial part of Sourcegraph’s chat functionality that gives it the ability to provide responses tailored to your codebase is the context engine. As the name implies, it provides context to the LLM, which can then extract the necessary information to provide a satisfactory answer via in-context learning.

In-context learning is one of the most powerful capabilities of modern LLMs: by including relevant information in the prompt, we can guide the model to solve new tasks without additional training. Think of it as giving the LLM a mini-reference manual for each query. For example, if you ask about error handling in your application, we can include snippets showing your error handling patterns, logging setup, and error types in the prompt. The LLM can then use this information to provide specific, accurate answers about how error handling works in your codebase, rather than giving generic best practices. This ability to learn from context at inference time is what makes LLMs so versatile—they can adapt to your specific use case just by seeing examples in the prompt. However, this also means that the quality of the response heavily depends on our ability to find and provide the right context.

That's where the context engine comes in as our specialized search tool within the broader AI assistant architecture. The assistant can leverage this search tool to actively retrieve relevant information when needed, similar to how a human developer might search through documentation or the codebase. While other components handle tasks like planning, reasoning, and code generation, the context engine focuses specifically on finding relevant information from your codebase. In this post, we explain how the context engine works and detail some of the challenges of evaluating it.

‍

Context engine

The context engine is conceptually very simple: given a user's query, find enough relevant contextual snippets of code or text (we call these context items) to provide a high-quality response. When a user asks a question such as "How does our authorization system work?", the context engine should provide snippets that contain relevant information—these might involve implementations of authorization middleware, documentation about backend architecture, client handling of unauthorized responses, etc.

The LLM receives these snippets as context in the prompt. Assuming the contents contain the relevant information for the query, the LLM can extract it and reply with a correct and helpful response. For example, if we provide context about authorization middleware, the LLM might explain how requests are validated, what happens when authorization fails, and how tokens are processed.

The context engine operates in two stages: retrieval and ranking. This two-stage architecture is a tried-and-tested approach used throughout the industry (examples: Spotify, YouTube, Facebook) in many large-scale information retrieval systems. This pattern is powerful because it combines the strengths of different techniques: in the retrieval stage, we cast a wide net using fast, approximate methods to find as many potentially relevant context items as possible from various sources. The ranking stage then uses more sophisticated (and computationally intensive) techniques to filter down to only the most relevant items that will fit in our token budget. This separation of concerns allows us to optimize each stage independently.

Something to keep in mind for the remainder of this post: in real life, we're constrained by latency and costs. This means we can only fit so much context inside the prompt before the LLM response becomes too slow, not to mention costly. To make sure the user experience is smooth, we define a latency SLA for context fetching and a maximum token budget for context size and take care not to exceed them.

‍

Retrieval

The first stage of the context engine is retrieval: finding as many relevant items as possible. To find items, we need to be able to access many different sources of information, which we call context sources. The number of sources of context we can use in a coding assistant is vast: local and remote code, source control history, code review tools, CI results, editor state, terminal, documentation, chats, internal Wikis, ticketing systems, observability dashboards, etc. These have wildly different properties such as ease of use, frequency of updating, and durability.

At Sourcegraph, we’re especially focused on sources heavily connected to code. Some approaches we’ve tried include:

Keyword retriever: Sometimes, simple is better. We use Zoekt, our blazingly fast trigram-based search engine that excels at finding exact matches and close variations of keywords in the code.
Embedding-based retriever: We use specialized code embedding models to convert chunks of code into numerical vectors. This allows us to perform semantic search—finding code that is conceptually similar to the query, even if it doesn't share the exact same keywords.
Graph-based retriever: This approach uses static analysis to build and traverse a dependency graph of code. It helps us understand how different parts of the codebase are connected—for example, finding all the places where a specific function is called, or tracking down where a particular class is implemented.
Local context: We also look at what's immediately relevant to the developer—their current editor state (open files, cursor position), recent git history, and other local context that might be pertinent to their query.

What makes a multi-pronged approach particularly effective is that retrievers are complementary, meaning each retriever tends to surface different types of relevant information. For example, keyword search might find direct references to a function name, while semantic search could surface conceptually related code that uses different terminology. The code graph retriever might identify important dependencies that neither of the other approaches would catch.

We're also working on extending our retrieval capabilities beyond just code. Modern software development involves lots of different types of documentation—from GitHub issues to internal wikis—and being able to pull context from these sources can provide information that might not be apparent from the code alone.

‍

Ranking

While the retrieval stage focuses on finding as many potentially relevant items as possible (optimizing for recall), the ranking stage has a different goal: filtering down to only the most relevant items (optimizing for precision). This is crucial because we have strict token budget constraints for what we can feed into the LLM. The retrieval stage might surface thousands of context items, while we would want a much smaller number for the final prompt.

Our ranking approach differs from other similar systems. In most retrieval applications, the ranked items are shown directly to users - think of web search results or video suggestions on YouTube. This means they can collect direct user feedback on the quality of the rankings: clicks, purchases, watch time, etc. In our case, while users can inspect the context items if they choose to, they are not the primary focus—the LLM response is what matters most. This creates an interesting challenge for evaluation and training. Even though users can see which context items were used and potentially provide feedback, most users focus on whether the final response was helpful rather than scrutinizing the individual context items that led to it. This makes it harder to know if we ranked the context items optimally. We discuss more on this in the next section.

A unique aspect of our ranking problem is that the order of the final selected items doesn't matter as much. We're more concerned with picking the right set of items that fit within our token budget. It's more like solving a knapsack problem: how do we select the most valuable items (most relevant context) while staying within our size constraint (token budget)?

Ranking is oftentimes done with heavyweight ML models and ours is no exception: we use a transformer (encoder) model trained to predict whether a given context item is relevant to the user’s query. This is also known as pointwise ranking and is one of the simplest ranking approaches. Once we have a score for each item, we can simply rank them according to their scores and select the top N that fit within our budget.

The ranking layer acts as a way to merge context items from different retrievers as well as the final filter, helping ensure that the context we provide to the LLM is as relevant and focused as possible. This is crucial for getting high-quality responses - if we feed in irrelevant context, we're not only wasting our precious token budget, but we might also confuse the LLM and get worse responses.

Evaluation

Evaluating an AI coding assistant is a complex challenge, especially when it comes to the context engine. But that’s not a reason to avoid it! In general, we need to consider two distinct aspects: how well we're selecting context, and the quality of the final LLM responses. These are related but separate concerns - great context selection doesn't guarantee a good response, and sometimes the LLM can provide helpful answers even with imperfect context. Evaluation can thus be component-specific (retrieval + ranking) or end-to-end (chat response).

One of our biggest challenges in evaluation is the lack of ground truth data for relevant context. What does "good" context look like for a given query? While there exist automated and semi-automated methods, the highest quality ground truth is obtained through manual annotation by experts who understand both the codebase and the user's intent. This is expensive and time-consuming, making it impractical for large-scale evaluation. We've created some small-scale annotated datasets internally to act as a starting point for any evaluation. However, as avoiding repetitive human effort is also important, we also test against open-source benchmark datasets (although these are often low quality) and have found early success with generating large-scale synthetic datasets.

One might think of using “online” (real user interaction) data to evaluate our system. It turns out that this is a significant challenge as well! Since users primarily interact with the LLM's responses rather than the context items themselves, it's hard to know if context retrieval is making a difference. We might get feedback that a response was unhelpful, but was that because of poor context retrieval or because the LLM just couldn’t produce a good response? Furthermore, we would need to distinguish between the effects of the retrieval and ranking stages to be able to judge each one separately. These feedback loops are difficult to establish but can provide a lot of value if set up correctly.

We can also define certain checks we can perform which are more narrowly scoped. The end goal is to ensure we're providing value to developers, which means different things in different situations. For code generation, we can check if the code compiles and passes tests. For questions about existing code, we can verify that referenced symbols actually exist. For architectural discussions, we might need to ensure responses align with documented design patterns and best practices. These can also be implemented as guardrails in real-time to prevent users from seeing bad responses!

This multi-faceted approach to evaluation helps us improve both the context engine and the overall AI chat experience, even without perfect ground truth data.

‍

Conclusion

While the challenges of evaluating context retrieval and ranking are significant, solving them is essential for advancing AI-assisted software development. Our goal remains clear: to build tools that genuinely enhance developer productivity by providing relevant, accurate, and contextual assistance.

In this post, we've explored how context engines work as specialized search components within AI coding assistants, breaking down the key stages of retrieval and ranking, and discussing the unique challenges of evaluation. For a deeper technical dive, check out our RecSys ‘24 paper on arXiv. We hope sharing these insights helps push the field forward as we collectively work to make AI coding assistants more useful for developers.

Subscribe for the latest code AI news and product updates

Ready to accelerate
how you build software?

Use Sourcegraph to industrialize your software development