Cody Input and Output Token Limits

Learn about Cody's token limits and how to manage them.

For all models, Cody allows up to 4,000 tokens of output, which is approximately 500-600 lines of code. For Claude 3 Sonnet models, Cody tracks two separate token limits:

The @-mention context is limited to 30,000 tokens (~4,000 lines of code) and can be specified using the @-filename syntax. The user explicitly defines this context, which provides specific information to Cody.
Conversation context is limited to 15,000 tokens, including user questions, system responses, and automatically retrieved context items. Apart from user questions, Cody generates this context automatically.

All other models are currently capped at 7,000 tokens of shared context between the @-mention context and chat history.

Here's a detailed breakdown of the token limits by model:

Model	Conversation Context	@-mention Context	Output
GPT 4o mini	7,000	shared	4,000
GPT o3 mini medium	7,000	shared	4,000
GPT 4 Turbo	7,000	shared	4,000
GPT 4o	7,000	shared	4,000
o1	7,000	shared	4,000
o3 mini high	7,000	shared	4,000
Claude 3.5 Haiku	7,000	shared	4,000
Claude Haiku 4.5	132,000	18,000	8,192
Claude Haiku 4.5 w/Thinking	93,000	18,000	64,000
Claude 3.5 Sonnet (New)	15,000	30,000	4,000
Claude Sonnet 4 w/Thinking	15,000	45,000	4,000
Claude Opus 4	15,000	45,000	4,000
Claude Opus 4 w/Thinking	15,000	45,000	4,000
Claude 3.7 Sonnet	15,000	30,000	4,000
Gemini 1.5 Pro	15,000	30,000	4,000
Gemini 2.0 Flash	7,000	shared	4,000
Gemini 2.0 Flash-Lite Preview	7,000	shared	4,000

For Cody Enterprise, the token limits are the standard limits. Exact token limits may vary depending on your deployment. Please get in touch with your Sourcegraph representative. For more information on how Cody builds context, see our docs here.

Enhanced Context Windows (Feature Flag)

Since 6.5 for Enterprise, we rolled out a feature flag enhanced-context-window that significantly expands Cody's context capabilities. This feature addresses developers' need to work with more context by expanding both input and output context windows.

When the enhanced-context-window feature flag is enabled, Cody Enterprise customers get access to:

Input context window (via @mention and user input):

Anthropic Claude: up to 150k tokens
Google Gemini: up to 150k tokens
OpenAI GPT-series: up to 102k tokens
OpenAI o-series: up to 93k tokens

Output context window:

Anthropic Claude: up to 64k tokens
Google Gemini: up to 65k tokens
OpenAI GPT-series: 16k tokens
OpenAI o-series: 100k tokens
Reasoning models: up to 100k tokens

The enhanced context windows require the enhanced-context-window feature flag to be set to true in your Sourcegraph instance. Contact Sourcegraph support if you need help enabling this feature.

What is a Context Window?

A context window in large language models refers to the maximum number of tokens (words or subwords) the model can process simultaneously. This window determines how much context the model can consider when generating text or code.

Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. Limiting the context window allows the model to operate more efficiently and make predictions in a reasonable amount of time.

What is an Output Limit?

Output Limit refers to the maximum number of tokens a large language model can generate in a single response. This limit is typically set to ensure the model's output remains manageable and relevant to the context.

When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could continue.

The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses. It also ensures that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.

Tradeoffs: Size, Accuracy, Latency and Cost

So why does Cody not use each model's full available context window? We need to consider a few tradeoffs, namely, context size, retrieval accuracy, latency, and costs.

Context Size

A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG-based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism.

If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not increase response quality. Conversely, some queries require a vast array of documents to synthesize the best response, so increasing the context window would be beneficial. We work to balance these nuances against increased latency and cost tradeoffs for input token lengths.

Retrieval Accuracy

Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases. This means it is important to put the relevant information into as few tokens as possible to avoid confusing the underlying LLM.

As foundation models continue to improve, we see an increase in context retrieval, meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.

Latency

With a larger context window, the model needs to process more data, which can increase the latency or response time. The end user often experiences this as "time to first token" or how long they wait until they see an output start to stream.

In some cases, longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use-case and user-dependent.

Computational Cost

Finally, processing large context windows costs linearly with the context window size. Cody leverages our expertise in code-based RAG to drive down generation costs while maintaining output quality to provide a high-quality response at a reasonable cost to the user.