Configuration Approach for LLM Context Tokens

What is Context?

The context of an LLM (Large Language Model) typically refers to the range of prior information the model can consider when generating text. For an LLM, context serves as its "memory," enabling it to decide its next action based on previously seen or generated content. Context can be a single sentence, a paragraph, a document, or even a collection of multiple documents, depending on the model's architecture and design.

Context is crucial for language models as it helps the model understand the current task and generate coherent and relevant responses based on prior information. For instance, in a conversation, the context may include all exchanges up to that point, allowing the model to generate responses aligned with the conversation's theme and tone.

However, due to computational resource and memory limitations, language models usually have a fixed context length. The context length of mainstream models today ranges from approximately 128K to 1M tokens. As LLMs evolve, supporting longer contexts will become a common trend. Yet, from the perspective of token costs, LLM response quality, and response time in commercial scenarios, more accurate and efficient context management is essential.

A token can be a word, punctuation mark, or any other linguistic unit.

Context Configuration Approach

Since we know that context is the "amount of information an LLM can process in one go," how to allocate resources within this limited information space is a critical aspect of Agent design. GPTBots defines the LLM context as: Long-term Memory, Short-term Memory, Identity Prompts, User Question, Tools Data, and Knowledge Data. A single LLM interaction may include all these context types, each with different priorities.

Type	Priority Order	Description
User Question	1	The latest input content from the user during a conversation with the Agent. In "System Recognition" mode, content recognized from uploaded documents is treated as user Question.
Identity Prompts	2	Identity information set for the Agent LLM, i.e., system message or developer message.
Short-term Memory	3	Information from the last X rounds of conversation, where X can be customized in Agent memory.
Knowledge Data	4	Knowledge data retrieved from the Agent's knowledge base via vector search based on user input.
Tools Data	5	Tools data submitted to the LLM and the returned results from tool calls.
Long-term Memory	6	Historical conversation records retrieved via vector search based on user input.
LLM Output	7	Output result data from the LLM. The system provides options for LLM response token length, and this part is not affected by the input length of the above sections.

Note: If the total length of all context types in a single interaction with the LLM exceeds the LLM's limit, GPTBots will truncate the lowest-priority context types to ensure the success rate of LLM calls.

User Question

The latest input content from the user during a conversation with the Agent includes a variety of message types sent via the input box:

Manually entered Text Messages
Audio Messages recorded via audio
Image Messages, Video Messages, Document Messages, Audio Messages, and File Messages uploaded as attachments.

Note:
When the "Agent-Configuration-Input-Attachment" option is set to System Recognition, GPTBots will recognize all uploaded attachments as text content to be treated as User Question.
For File-type messages, GPTBots will convert the file into a URL link to be treated as User Question.

Identity Prompts

The identity information set for the LLM in GPTBots Agent serves as an important principle guiding the AI Agent's work. GPTBots will automatically adapt this to a system message or developer message depending on the model version.

Identity prompts are crucial in business scenarios and should be crafted clearly and comprehensively to guide the AI's work and responses.
Generally, you don't need to worry about the length of identity prompts. Compared to length, the quality of identity prompts is more important and worth investing more tokens in.
When drafting identity prompts, clear expression, correct logic, and precise instructions are critical. A good identity prompt should clearly articulate your goals, principles, skills, and work logic while avoiding ambiguous instructions.

Short-term Memory

Information from the last X rounds of conversation between the user and the Agent is carried with each LLM request. If the short-term memory function is turned off or the conversation has just been created, this part will be empty.

When configuring, consider:

If the Agent's business scenario does not require context or if context negatively impacts the AI's response quality, you can turn off short-term memory to save tokens and improve the Agent's performance.
If the Agent heavily relies on context (e.g., needing context to answer questions), you should set the memory round count as large as possible.

Knowledge Data

Based on the user's input, the Agent retrieves knowledge slices from the corresponding knowledge base via vector search. If the Agent does not have knowledge base content or cannot retrieve results, this part will be empty.

When configuring, consider:

If the Agent does not involve knowledge base queries, you can disable knowledge retrieval to save tokens.
If the Agent heavily relies on knowledge base query results (e.g., document Q&A scenarios), you need to configure the maximum knowledge recall count, relevance score, and other parameters in the "Knowledge Base" to ensure the quantity and quality of RAG (retrieved augmented generation).

Long-term Memory

Based on the user's input, historical conversation records retrieved from all conversations in the Agent via vector search will be carried. If the long-term memory function is turned off or the conversation has just been created, this part will be empty.

When configuring, consider:

If the Agent's business scenario relies on historical conversation content (e.g., virtual characters), you need to enable the long-term memory function.
If the Agent's scenario does not involve using historical conversation content, you can turn it off to save tokens.## Tools Data
When the system submits request data to the LLM, it will include the Tools data selected by the Agent to help the LLM correctly choose the required Tool for invocation. After successfully invoking the Tool, the results returned by the API service need to be submitted again to the LLM. If the Agent disables the Tools function, this section will remain empty.
When configuring, the following should be considered:
If the Agent's scenario does not involve using Tools, you can omit adding Tools for the Agent to save Tokens.
When a Tool contains multiple APIs, specific unnecessary APIs can be disabled in the Tool configuration. Disabled APIs will not be submitted to the LLM, thus saving Tokens.

LLM Output

The token length of the LLM output data is determined and reserved when requesting the LLM. GPTBots already support customizable LLM response token lengths, and this section is not affected by the input section length above.

Case Study

Let’s demonstrate the allocation of context tokens with a specific example. Assume we are using an LLM model with a context limit of 8K tokens:

Scenario: Customer Service Assistant

An online customer service assistant Agent needs to:

Remember the user's recent conversation content
Query the product knowledge base
Invoke the order query interface
Maintain a professional customer service image

Token Handling Plan

Context Type	Priority	Description
User Question	1	Reserve enough space to handle potentially long user Question, ensuring the user's input is fully processed as the highest priority
Identity Prompt	2	Includes important guidance such as customer service etiquette and dialogue norms, ensuring role positioning with the same priority as the user query
Short-term Memory	3	Retain the most recent 3-5 rounds of conversation records, which can be appropriately compressed when resources are limited
Knowledge Data	4	Content from the product knowledge base and FAQs, as an important basis for responses, requires a relatively high priority
Tools Data	5	Information and return results from the order query interface, dynamically adjustable based on actual invocation needs
Long-term Memory	6	Summaries of key information from the current session, which can be prioritized for truncation when necessary

Optimization Suggestions

Dynamic Adjustment
- When the user's query is short, the saved tokens can be automatically allocated to knowledge data.
- When Tools are not used, the reserved space can be automatically allocated to short-term memory.
Priority Execution
- When the total tokens exceed the limit, truncate according to priority:
- Retain: User Question, Short-term Memory, Identity Prompt, Knowledge Data, Tools Data
- Compress/Remove: Long-term Memory
Effect Assurance
- Ensure the completeness of core functions (e.g., order query) when tokens are insufficient.
- Long-term memory can be sacrificed to ensure response quality.

Through such planning, the core functions of the customer service assistant can be realized within the limited token space, while maintaining conversational coherence and professionalism.