Cody Input and Output Token Limits
Learn about Cody's token limits and how to manage them.
For all models, Cody allows up to 4,000 tokens of output, which is approximately 500-600 lines of code. For Claude 3 Sonnet or Opus models, Cody tracks two separate token limits:
- The @-mention context is limited to 30,000 tokens (~4,000 lines of code) and can be specified using the @-filename syntax. The user explicitly defines this context, which provides specific information to Cody.
- Conversation context is limited to 15,000 tokens, including user questions, system responses, and automatically retrieved context items. Apart from user questions, Cody generates this context automatically.
All other models are currently capped at 7,000 tokens of shared context between the @-mention
context and chat history.
Here's a detailed breakdown of the token limits by model:
Model | Conversation Context | @-mention Context | Output |
---|---|---|---|
gpt-3.5-turbo | 7,000 | shared | 4,000 |
gpt-4-turbo | 7,000 | shared | 4,000 |
gpt 4o | 7,000 | shared | 4,000 |
claude-2.0 | 7,000 | shared | 4,000 |
claude-2.1 | 7,000 | shared | 4,000 |
claude-3 Haiku | 7,000 | shared | 4,000 |
claude-3.5 Haiku | 7,000 | shared | 4,000 |
claude-3 Sonnet | 15,000 | 30,000 | 4,000 |
claude-3.5 Sonnet | 15,000 | 30,000 | 4,000 |
claude-3.5 Sonnet (New) | 15,000 | 30,000 | 4,000 |
mixtral 8x7B | 7,000 | shared | 4,000 |
mixtral 8x22B | 7,000 | shared | 4,000 |
Google Gemini 1.5 Flash | 7,000 | shared | 4,000 |
Google Gemini 1.5 Pro | 7,000 | shared | 4,000 |
What is a Context Window?
A context window in large language models refers to the maximum number of tokens (words or subwords) the model can process simultaneously. This window determines how much context the model can consider when generating text or code.
Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. Limiting the context window allows the model to operate more efficiently and make predictions in a reasonable amount of time.
What is an Output Limit?
Output Limit refers to the maximum number of tokens a large language model can generate in a single response. This limit is typically set to ensure the model's output remains manageable and relevant to the context.
When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could continue.
The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses. It also ensures that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.
Current foundation model limits
Here is a table with the context window sizes and output limits for each of our supported models.
Model | Context Window | Output Limit |
---|---|---|
gpt-3.5-turbo | 16,385 tokens | 4,096 tokens |
gpt-4 | 8,192 tokens | 4,096 tokens |
gpt-4-turbo | 128,000 tokens | 4,096 tokens |
claude instant | 100,000 tokens | 4,096 tokens |
claude-2.0 | 100,000 tokens | 4,096 tokens |
claude-2.1 | 200,000 tokens | 4,096 tokens |
claude-3 Haiku | 200,000 tokens | 4,096 tokens |
claude-3.5 Haiku | 200,000 tokens | 4,096 tokens |
claude-3 Sonnet | 200,000 tokens | 4,096 tokens |
claude-3 Opus | 200,000 tokens | 4,096 tokens |
mixtral 8x7b | 32,000 tokens | 4,096 tokens |
Tradeoffs: Size, Accuracy, Latency and Cost
So why does Cody not use each model's full available context window? We need to consider a few tradeoffs, namely, context size, retrieval accuracy, latency, and costs.
Context Size
A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG-based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism.
If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not increase response quality. Conversely, some queries require a vast array of documents to synthesize the best response, so increasing the context window would be beneficial. We work to balance these nuances against increased latency and cost tradeoffs for input token lengths.
Retrieval Accuracy
Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases. This means it is important to put the relevant information into as few tokens as possible to avoid confusing the underlying LLM.
As foundation models continue to improve, we see an increase in context retrieval, meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.
Latency
With a larger context window, the model needs to process more data, which can increase the latency or response time. The end user often experiences this as "time to first token" or how long they wait until they see an output start to stream.
In some cases, longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use-case and user-dependent.
Computational Cost
Finally, processing large context windows costs linearly with the context window size. Cody leverages our expertise in code-based RAG to drive down generation costs while maintaining output quality to provide a high-quality response at a reasonable cost to the user.