The Allocation Idea of LLM Context Tokens
Last updated:2024-02-21

The Allocation Idea of LLM Context Tokens

What is the context?

The context of LLM (Large Language Model) usually refers to the range of previous information that the model can consider when generating text.

For a language model, the context is like its "memory". It can decide the next step based on the content it has seen or generated before. The context can be a single sentence, a paragraph, a document, or even a collection of multiple documents, depending on the architecture and design of the model.

The context is crucial for the language model because it helps the model understand the current task and generate coherent and relevant answers based on previous information. For example, in a conversation, the context may include all exchanges in the conversation so far, so the model can generate answers that match the topic and emotion of the conversation.

However, due to limitations in computational resources and memory, language models usually have a fixed context length. For example, some models may only be able to handle about 2K Tokens of context. When the context length exceeds this limit, the model may not be able to consider information beyond this part, which may lead to less accurate or relevant content.

1 K = 1024 A token can be a word, punctuation, or any other language unit.

For example, the context window of the well-known OpenAI's ChatGPT's gpt-3.5-turbo is 16K Tokens, while the context window of gpt-4-turbo has reached 128K Tokens. The context window of claude-2.1 launched by Anthropic has reached 200K Tokens. The glm-4 developed by the Chinese company "Zhipu AI" also has a context window of 128K Tokens.

In a nutshell, the length of the LLM's context window is the amount of information it can "handle at once".

Just like humans, people in the upper echelons of society (such as professors, scientists, or entrepreneurs, etc.) have the ability to process a large amount of information at once, processing and refining valuable information from it, while an uneducated person cannot.

For this reason, the length of the LLM's context is one of the important indicators of the LLM's ability.

Configuration Ideas

Since we know that the context is the "amount of information that LLM can handle at one time", how to allocate resources for this limited information space becomes the key to Bot design.

GPTBots divides the LLM context into several categories, which together make up the complete context. We can set a "proportion" (minimum 5%) for each part, which represents the length of the content in the context.

Note, if the actual length of a category exceeds the length we allocated for that category, the system will truncate the excess part. For example, if you set a 1000 Tokens identity hint for the Bot, but you only allocated 800 Tokens for the identity hint part, then the system will only take the first 800 Tokens of your set identity hint as the identity hint.

When allocating, each category needs to consider the following details.

Identity Prompt

The identity information set in the Bot configuration.

When configuring, consider:

  • If the identity prompt you set for the Bot is relatively long, make sure to allocate enough space for it. Just over the length of the identity prompt is enough.
  • The identity prompt is very important, it's worth writing more content for it.

Knowledge Data

Based on the user's input, the knowledge data is retrieved from the Bot's knowledge base through vector retrieval. When the Bot does not have content in the knowledge base, or cannot retrieve results, this part is empty.

When configuring, consider:

  • If the Bot does not involve querying the knowledge base, this part can be adjusted to the minimum to make room for other parts.
  • If the Bot is very dependent on the results of the knowledge base query (such as document Q&A scenarios), you need to judge how much proportion to allocate to this part based on the maximum knowledge recall number, the length of a single slice, etc. set in the "Knowledge Base" part, and the actual application situation of the Bot.

Short-term memory

Information from the most recent rounds of dialogue. If this item is turned off, or the dialogue is newly created, this section is empty.

Note: The response of tool-calling belongs to a part of short-term memory.

When configuring, consider:

  • If the Bot's scenario does not involve context, it can be turned off to make room for other parts.
  • If the response of the tool called by the Bot is very long, more space can be allocated.

Long-term memory

Complete information of the dialogue. If this item is turned off, or the dialogue is newly created, this section is empty.

When configuring, consider:

  • If the Bot's scenario does not involve the need to use all the context content from the beginning of the dialogue to the present, it can be turned off to make room for other parts.

Tool

After calling the Tool in the Bot, the return result of the Tool. If the Bot does not have a Tool, or the conversation does not call a Tool, this part is empty.

When configuring, consider:

  • If the Bot's scenario does not involve using Tools, you can choose not to add Tools to the Bot to make room for other parts.
  • If the Bot needs to use Tools, then you need to consider the number of Tools and the possible maximum length of their respective call information, and consider the proportion of this part based on this. Because, some Tools call with very short information, while others may be very long.

User Input

The content input by the user during the conversation with the Bot.

When configuring, consider:

  • The function and positioning of the Bot determine how long the content the Bot users may input as input. This needs to be used as a basis to consider the proportion of this part.

LLM Output

The result returned by the LLM after all the above parts of the Bot are integrated as input and submitted to the LLM.

When configuring, consider:

  • The function and positioning of the Bot determine how long the content the Bot may ultimately output. This needs to be used as a basis to consider the proportion of this part.

Examples

In summary, the clearer your positioning of the Bot's functions and the deeper your understanding of user scenarios, the more reasonably you can allocate the context.

To help you further understand the logic of context allocation, we list a few examples here.

Corporate Q&A

Taking a simple corporate system Q&A as an example, generally speaking, the scenarios involved are single-round dialogues, such as:

Q: How many days of annual leave does the company provide for employees? A: The number of annual leave days varies for different employees. For example, for managers, the annual leave is 10 days per year. For specific information, please refer to the "Company Benefits" document. Q: How do I reimburse expenses incurred on business trips? A: You can submit a reimbursement application in the company's OA system and provide the corresponding information as required by the system.
          Q: How many days of annual leave does the company provide for employees?
A: The number of annual leave days varies for different employees. For example, for managers, the annual leave is 10 days per year. For specific information, please refer to the "Company Benefits" document.
Q: How do I reimburse expenses incurred on business trips?
A: You can submit a reimbursement application in the company's OA system and provide the corresponding information as required by the system.

        
This code block in the floating window

The source of these answers is the Bot knowledge base, and no Tool is used. Both the questions and answers are very short. Each round of Q&A is independent and depends on context understanding.

Therefore, we can consider configuring the strategy as follows:

  • Identity prompt: 5%
  • Knowledge data: 65%
  • Short-term memory: 0%
  • Long-term memory: 0%
  • Tool: 0%
  • User input: 10%
  • LLM output: 20%

Online Customer Service

Take a courier inquiry consultation scenario as an example:

Q: Has my package been shipped? A: Hello, can I have your tracking number, please? Q: ABCD123456 A: Hello, the package with the tracking number ABCD123456 has been shipped. Q: About how long will it take to arrive? A: It should arrive in about 2 days.
          Q: Has my package been shipped?
A: Hello, can I have your tracking number, please?
Q: ABCD123456
A: Hello, the package with the tracking number ABCD123456 has been shipped.
Q: About how long will it take to arrive?
A: It should arrive in about 2 days.

        
This code block in the floating window

In this scenario, the source of the answer is obtained by calling the Tool. Both questions and answers are very brief. The Q&A depends on understanding the context, but it may not require too much context.

Therefore, we can consider configuring the strategy as follows:

  • Identity prompt: 10%
  • Knowledge data: 5%
  • Short-term memory: 25%
  • Long-term memory: 0%
  • Tool: 30%
  • User input: 10%
  • LLM output: 20%

Art Creation

Taking a tool that uses DALLE-3 to perform painting as an example, the user enters a description of the scene, and the Bot first optimizes the user's input description into a more refined and detailed "text-to-image" prompt. Then it uses this prompt to call DALLE-3 to paint, and DALLE-3 returns an image. For example:

问:raining, logo, minimalist, flat style, ratio 1:1 答:To refine the user's narrative, we could interpret it as follows: "A minimalist, flat-style logo that depicts the concept of rain, designed in a 1:1 ratio." Now, let's use this refined narrative to guide the 'DALL E 3 by GPTBots' plugin in image production.Here is the image based on your request: a minimalist, flat-style logo that visually represents the concept of rain. The design is in a square format, following a 1:1 ratio, with clean lines, simplified shapes, and minimal color usage. The aesthetic is modern and effectively communicates the rain theme through its visual elements. ![image](https://xxxxx/xxx.png)
          问:raining, logo, minimalist, flat style, ratio 1:1
答:To refine the user's narrative, we could interpret it as follows: "A minimalist, flat-style logo that depicts the concept of rain, designed in a 1:1 ratio."

Now, let's use this refined narrative to guide the 'DALL E 3 by GPTBots' plugin in image production.Here is the image based on your request: a minimalist, flat-style logo that visually represents the concept of rain. The design is in a square format, following a 1:1 ratio, with clean lines, simplified shapes, and minimal color usage. The aesthetic is modern and effectively communicates the rain theme through its visual elements.

![image](https://xxxxx/xxx.png)

        
This code block in the floating window

In this scenario, painting generation requires the use of a Tool. The questions may be very short or a bit longer (because some advanced users may describe the scene in great detail). The answers will be relatively short (there will only be some text answers and the URL of the generated image). This behavior does not depend on context understanding, each question is an independent task.

Therefore, we can consider configuring the strategy as follows:

  • Identity prompt: 5%
  • Knowledge data: 5%
  • Short-term memory: 0%
  • Long-term memory: 0%
  • Tool: 30%
  • User input: 40%
  • LLM output: 20%

Article Creation

Take a simple marketing copywriting scenario as an example:

Question: You are a copywriter, please write a marketing copy according to the following requirements: Copy theme: Heritage and Innovation: The Story of Osmanthus Watch Copy content: Combine the historical background of the Osmanthus watch and modern design to write an engaging story Answer: Heritage and Innovation: The Story of Osmanthus Watch In the long river of time, the Osmanthus watch tells a story about heritage and innovation with its unique charm and exquisite craftsmanship. This story flows slowly from the long river of history and extends to the future. The history of the Osmanthus watch can be traced back to a golden autumn of the last century. At that time, the osmanthus was in full bloom, and the air was filled with a faint fragrance. A master watchmaker in our country, on a chance occasion, had a burst of inspiration and decided to design a unique watch inspired by osmanthus. Thus, the Osmanthus watch was born and became a treasure in our country's watchmaking industry. ......
          Question: You are a copywriter, please write a marketing copy according to the following requirements:
Copy theme: Heritage and Innovation: The Story of Osmanthus Watch
Copy content: Combine the historical background of the Osmanthus watch and modern design to write an engaging story
Answer: Heritage and Innovation: The Story of Osmanthus Watch
In the long river of time, the Osmanthus watch tells a story about heritage and innovation with its unique charm and exquisite craftsmanship. This story flows slowly from the long river of history and extends to the future.
The history of the Osmanthus watch can be traced back to a golden autumn of the last century. At that time, the osmanthus was in full bloom, and the air was filled with a faint fragrance. A master watchmaker in our country, on a chance occasion, had a burst of inspiration and decided to design a unique watch inspired by osmanthus. Thus, the Osmanthus watch was born and became a treasure in our country's watchmaking industry.
......

        
This code block in the floating window

In this scenario, no knowledge base and Tool are used, the question can be long or short (depending on the requirements for article generation), and the answer will be relatively long (because it is article writing). This behavior does not depend on context understanding, each question is an independent task.

Therefore, we can consider configuring the strategy as follows:

  • Identity prompt: 5%
  • Knowledge data: 5%
  • Short-term memory: 0%
  • Long-term memory: 0%
  • Tool: 0%
  • User input: 30%
  • LLM output: 60%