Context Engineering: A Practical Guide for AI Agents (2026)
A practical guide to context engineering for AI agents: the four pillars, how it differs from prompt engineering, and how to build context pipelines that hold up in production.

A practical guide to context engineering for AI agents: the four pillars, how it differs from prompt engineering, and how to build context pipelines that hold up in production.
By the middle of 2025, many experienced AI engineers had figured out that prompt wording was no longer the main bottleneck. The problem they kept hitting was how to feed an agent the right files, the right tool definitions, the right slice of conversation history, and the right retrieved facts at every turn, while keeping the context window from collapsing under its own weight. That problem has a name now: context engineering.
This guide is for the ML engineers, platform engineers, and agent builders who are past the chatbot demo stage and want their agents to hold up in production. We'll define context engineering, contrast it with prompt engineering, lay out the four pillars, walk through a real context pipeline, and look at where coding agents in particular force the discipline to get concrete.
Context engineering is the practice of deliberately designing what a large language model sees on every inference call. That includes the system prompt, the user input, the retrieved documents, the relevant conversation history, the tool definitions, and whatever the AI agent has stored in long-term memory between sessions. Anthropic has a good one-liner for it: "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts." If prompt engineering is about a sentence, context engineering is about the whole pipeline that produces that sentence and everything around it.
The shift matters because AI agents are not chatbots. A chatbot answers a question with whatever context fits in one turn. An AI agent runs in a loop, uses different tools, gathers more state, and tries to make a decision at step 47 with the residue of steps 1 through 46 still in its model's context window. The token budget is finite, the attention budget is finite, and most of the context failures we now see stem from how that budget is spent, not from a bad prompt at the front. Phil Schmid, who helped popularize the term in mid-2025, describes it as designing dynamic systems that give a model the right information and tools, in the right format, at the right time, so it can actually accomplish the task.
Think about a coding agent asked to fix a Kubernetes bug. It usually doesn't fail because the underlying frontier model can't reason. It fails because grep on a million-line monorepo returns 4,000 hits, the agent burns its window reading irrelevant information, and the actual cause never enters the context window. Good context engineering is the work of making sure it does.
The context-engineering vs. prompt-engineering debate is misleading. The two disciplines aren't in opposition; they sit at different layers of the stack. Prompt engineering focuses on how to phrase and structure instructions for the LLM to generate the best results, while effective context engineering is about designing the entire system that feeds the model the right context at the right time, including all the context across multiple interactions.
| Prompt Engineering | Context Engineering | |
|---|---|---|
| Scope | A single instruction string | The full set of tokens at inference time |
| Surface | System prompt + user message | Instructions, retrieved docs, memory, tool defs, history, output schema |
| State | Stateless or single-turn | Stateful, multi-turn, runs for hours |
| Optimization target | Better phrasing, fewer ambiguities | Higher signal-to-noise ratio in the context window |
| Failure mode | Model misunderstands the task | Model has too much, too little, or the wrong information |
| Owner | Anyone writing prompts | Platform team building the agent pipeline |
Most people working on this stuff treat context engineering as the natural progression of prompt engineering rather than a replacement for it. Prompt engineering is still valuable. You still need to know how to write system instructions that don't contradict themselves. But once your agent has tools, memory, and a retrieval layer, the act of writing a good prompt is a tiny fraction of the work. The rest is engineering the context engineering system around it, which mostly comes down to four questions:
The clearest tell that you've crossed from one discipline into the other is whether your improvements come from rewording or from rewiring. If you're swapping nouns and adjectives, you're still doing prompt engineering. If you're changing what data the agent retrieves, in what order, with what re-ranking, and what gets evicted when the context window fills, you're doing context engineering. Prompt engineering is essential for one-off tasks, but context engineering is what matters for complex tasks and agent systems that maintain conversation history and pull in external data across many turns.
Different authors carve the space up slightly differently. The Prompting Guide lists hierarchical layers; Phil Schmid lists seven components; Anthropic walks through system prompts, tools, examples, and message history without enumerating them. A practical way to organize the discipline is around four pillars, each addressing a different question the agent has to answer at every step.
The instruction layer is what the AI model knows about its role, constraints, and output format before it sees the user's first message. The trick is to find a Goldilocks zone between brittle if-else rules and vague hand-waving. You want it specific enough to guide behavior, but flexible enough that the model can still generalize. The most common failure here isn't length; it's a mismatch with the model's training. Detailed instructions that contradict the model's default tool-use heuristics will produce confused behavior on every turn.
Retrieval is how external data enters the context window. That includes classical retrieval-augmented generation (RAG) over a vector database, structured queries against a SQL store, file reads from a filesystem, and increasingly, what Anthropic calls just-in-time retrieval: the system pulls underlying content into context only when it needs it, using lightweight identifiers like file paths or query strings. The output of this pillar is the set of grounded facts the model gets to reason over. Bad retrieval is one of the largest sources of hallucinated or ungrounded answers in production agents.
Agent memory splits into two species. Short-term memory is the conversation history so far, including tool use and tool results. Long-term memory refers to information that persists across sessions, such as user preferences, project conventions, and summaries of past conversations. Production memory systems typically maintain both, with a compaction step that condenses older turns into a summary once the window begins to fill. Anthropic describes a structured note-taking pattern where the model writes its own scratchpad to a file outside the context window as persistent memory and re-reads it when needed.
Available tools are the executable surface area that the agent can call. This is where context engineering gets the most ruthless. Tokens add up fast on the tool layer:
The classic Anthropic observation here, and we've seen this hold up over and over, is that the most common failure mode is "bloated tool sets that cover too much functionality or lead to ambiguous decision points about which tool to use. If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better." The right number of available tools is almost always smaller than what teams ship in their first version, and input parameters should be unambiguous.
Context isn't assembled by hand. It's the output of a pipeline that runs on every turn.
A typical assembly pipeline starts with the user input and user query, runs one or more retrieval steps in parallel (vector search, keyword search, structured lookups, code-graph queries), and merges the results into a candidate set for the current task. The agent's memory layer stores short-term memory for relevant prior turns and persistent notes from previous conversations. System instructions and tool definitions are layered in. The whole context-engineering system then passes the full context to the language model. For multi-agent systems, each subagent may run the same pipeline against a narrower scope before reporting back.
Every token has an opportunity cost. In standard dense attention, cost scales quadratically with sequence length, and even with sparse-attention and flash-attention optimizations, larger context windows still increase latency, spend, and retrieval difficulty. This is the "context rot" problem (well-documented in Chroma's research and one that Anthropic has flagged, too): as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases.
Token budget management is the discipline of cutting low-signal content before it enters the context window, not after. In practice, that looks like:
As context grows, context effectiveness drops if the entire system isn't deliberate about what stays in.
The merge step almost always produces more candidates than the budget allows. A re-ranker, often a smaller cross-encoder or a cheap model, scores each candidate against the user query and keeps only the top-k relevant chunks. Re-ranking is where naive RAG pipelines either start working or stop working. As a rough illustration, a pipeline that retrieves 50 candidates with high recall and re-ranks them down to a precise top-5 is usually better than one that dumps all 50 chunks into the prompt and hopes for the best.
Code is where context engineering gets concrete and measurable. Most general-purpose articles on this topic use calendar examples or Pokémon. Coding agents are a better laboratory because the relevant context is structured, the retrieval problem is well-defined, and failure modes appear in test suites within seconds.
With our Sourcegraph 7.0 release in February 2026, we moved forward with the notion that our code intelligence platform is "the shared intelligence layer for both developers and AI agents." The underlying observation is that AI systems struggle on enterprise codebases for the same reasons humans do: cross-repo dependencies, historical decisions buried in old commits, and architectural patterns that aren't documented anywhere readable. Solving those problems for agents looks a lot like solving them for engineers.
Our MCP server is a concrete answer to the code retrieval pillar. It exposes 13 tools to any MCP-compatible agent (including Claude Code, Cursor, Amp, and Codex). The capabilities cover:
Underneath it is the same code graph that powers our developer products, indexed with SCIP, an open Protobuf-based code intelligence protocol that replaced LSIF and produces compiler-accurate cross-repository navigation.
The reason that matters for context engineering is that when code intelligence is available, SCIP-backed lookups can return definitions and references directly, rather than relying solely on probabilistic text retrieval. When an agent asks for the definition of RecordAccumulator in an indexed repo, it gets the actual definition rather than a top-k of files where the string might appear. Real-world coverage depends on indexing, language support, and permissions, but where structure exists, retrieving it directly turns "here are 50 files that mention the symbol" into "here is the one file where it's defined plus the three call sites." For an agent operating under a strict token budget, that's the difference between burning a window and finishing a task.
Our CodeScaleBench results, published in March 2026, give measured retrieval numbers from running the same agent under two configurations (local grep/file/read vs. Sourcegraph MCP) across 370 enterprise-scale tasks. With MCP, file recall rose from 0.127 to 0.277, Precision@5 from 0.140 to 0.478, and F1@5 from 0.099 to 0.262. The deltas matter less than what they unlocked: several difficult tasks went from timing out on the baseline to completing within benchmark limits. One example from the post is a Kubernetes monorepo task that hit the baseline's two-hour timeout but completed in 89 seconds with MCP, scoring 0.90 out of 1.0. Another is a cross-file refactor that took 96 tool calls and 84 minutes for the baseline but took 5 tool calls and 4.4 minutes with MCP, at double the reward.
Stripe is one public example where Stripe's Minions agents are being "connected to MCP" to "gather context like internal documentation, ticket details, build statuses, code intelligence via Sourcegraph search." The takeaway for anyone building multi-agent systems for code is that the retrieval pillar at production scale is rarely a single vector database. It's a code-aware retrieval layer wired into the agent through a standard protocol, with deterministic structure where structure exists and semantic search where it doesn't.
Context engineering has its own canonical bug list. Recognizing them in your own agent is half the work.
The first thing teams usually try is to dump everything into the LLM's context window, because it feels safe. It isn't. Beyond the obvious cost and latency hit, the model's ability to attend to any single piece degrades as input data grows. We've seen agents perform worse with a 100K-token codebase summary than with a 5K-token targeted retrieval on the same task. Two related failure modes show up here all the time:
Both stem from the same root cause: pushing past context window limitations without curating what's inside. The cure isn't less context, it's better-targeted context, with a re-ranker enforcing a hard cap so the model gets just the right information rather than just what.
Vector indexes go stale when repositories change, and embeddings aren't refreshed, so the chunk you embedded last quarter no longer reflects the function that's running in production. Retrieval systems that don't track freshness will quietly poison the agent's context window with irrelevant data. This is especially nasty for code, where an agent that finds a deprecated API in an old README will confidently call it.
The 2023 Lost in the Middle paper by Liu et al. is now a foundational reference in the field. The headline finding is that model performance "is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models." Practically, that means the order in which you assemble context matters. Put the highest-signal material at the top or the bottom, not buried in the middle of a 30K-token wall of retrieved chunks.
And then there's the boring-but-real cost angle. Every additional retrieval, every additional tool call, every extra round of re-ranking shows up in p95 latency and per-task cost, so it's worth tracking token usage and tool-call counts per task, setting budgets, and alerting when an agent class regularly exceeds them.
The toolchain in 2026 is wider than it was a year ago. Most production stacks pull from four categories.
Weaviate, Pinecone, Qdrant, Milvus, and pgvector are among the more visible managed and open-source options for embedding-based retrieval. The differentiation is mostly operational: hybrid search support, filter performance at scale, and how easily a re-ranker plugs in. For most teams, the bottleneck isn't the database; it's the chunking and embedding strategy that feeds it.
LangChain remains one of the most widely used orchestration frameworks, with LlamaIndex strong on the retrieval side and DSPy gaining ground for teams that want to optimize prompt-and-retrieval pipelines as a unit. None of these are required; many production agents are built directly against model SDKs. The right choice depends on how much of the pipeline your team wants to own.
For coding agents on large or multi-repo codebases, vector search alone is usually insufficient. Code has structure (call graphs, type information, cross-repo references) that text retrieval discards. Sourcegraph's MCP server, backed by SCIP indexing, is one production-grade answer for teams running coding agents on large codebases. If you want a coding agent already wired to Sourcegraph for context, Amp is one example, so teams adopting Amp get the code graph by default. For organizations building multi-agent systems and rolling their own custom tools, the Model Context Protocol is the standardized way to wire any retrieval source to any MCP-compatible client via JSON-RPC 2.0, using natural-language tool descriptions and structured input parameters.
On the memory side, a handful of frameworks have emerged for agents that need to carry state across sessions. mem0 and Letta are two of the more visible ones. Most production teams still build their own thin memory layer over a key-value store, plus a summarization pass, because requirements vary widely across applications.
Prompt engineering taught us how to talk to a single LLM call. Context engineering teaches us how to build the system around it. For any team that's tried to ship an agent past the prototype stage, the discipline isn't optional anymore. The model is no longer the only bottleneck; the pipeline that feeds it matters just as much.
For coding agents specifically, the retrieval pillar runs on code-aware infrastructure. If you're building or evaluating one, the Sourcegraph MCP server is the fastest path to deterministic, cross-repo code context. Schedule a demo to see how it fits into your agent stack.
What is context engineering in AI? Context engineering is the practice of designing the pipeline that assembles, prunes, and orders every token an AI model sees on a given inference call. It covers different aspects of the input: behavioral framing, retrieved data, message history, and tool definitions. Anthropic describes it as managing "the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts."
Is context engineering replacing prompt engineering? No. Context engineering includes prompt engineering as one part of a larger system. Prompt engineering still matters for writing instructions and tool descriptions, but in any agent more complex than a single-turn chatbot, the prompt is one input to a larger context pipeline. The center of gravity has shifted from wording to wiring.
What are the four pillars of context engineering? Instructions (system prompts and behavioral framing), retrieval (RAG and grounded search), memory (short-term conversation and long-term persistent state), and available tools (the function-calling surface, increasingly standardized via MCP).
How is context engineering different from RAG? RAG is one component of the retrieval pillar. Context engineering is the broader practice of deciding what the full context the model sees actually contains, of which retrieval is one input. A team can run a state-of-the-art RAG pipeline and still ship an agent that fails because it has no memory layer, an overloaded tool set, or bad token-budget hygiene.
What is a context engineer? A context engineer is the person (or role on a platform team) responsible for the architecture that feeds AI systems the right context at the right time. The work spans retrieval pipelines, memory systems including long-term memory, tool design, and evaluation; it sits closer to systems engineering than to prompt writing.
Is context engineering the future? For production agentic systems, it's becoming a core engineering discipline. As long as language models have a limited context window, someone has to decide what enters it on every turn. The term context engineering may evolve, but the underlying delicate art of curating model input for effective agents isn't going away.
Is context engineering still relevant when models get bigger context windows? Yes. Larger context windows reduce some pressure, but don't eliminate it. As context size grows, latency and cost grow with it, and the same lost-in-the-middle and context distraction patterns still apply. Even at 2M tokens, you still want the LLM to see just what's useful for the current task.

With Sourcegraph, the code understanding platform for enterprise.
Schedule a demo