Large language models do not “remember” a conversation the way humans do. They work from a context window: the set of tokens the model can consider at one time when generating a response. Tokens are chunks of text, often parts of words, full words, spaces, or punctuation. OpenAI’s documentation explains that models process text as tokens, and that a context window is the total token budget available for inputs, outputs, and in some cases reasoning tokens. (source here)
Tokenizer tool - OpenAI API
For a better understanding regarding how text is translated into tokens, OpenAPI provides a Tokenizer tool that allows you to check with real examples.
URL: https://platform.openai.com/tokenizer
Note: A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words). (source here)

In the above example I tested the Tokenizer tool and we can see that the following “I love test automation.” text is converted into 5 tokens using GPT-5.x
The context window is the AI's working memory
Think of the context window as the model’s short-term workspace. Every message in a chat – system instructions, user prompts, assistant replies, tool outputs, and uploaded text that is included in the prompt – takes up tokens. As a conversation grows, more of that window is consumed.
Anthropic’s Claude documentation describes this as progressive token accumulation: as the conversation advances, user and assistant messages accumulate inside the context window, and context usage grows over time. (source here)
Once the total token count approaches the model’s limit, the application has to decide what to do. The model itself cannot use more context than its maximum window allows.
What happens when the window fills up?
In real AI applications, developers usually manage a full context window in one of three ways:
- Truncation: remove older messages.
- Summarization: compress older conversation history into a shorter summary.
- Retrieval or external memory: store older information elsewhere and bring back only what is relevant.
Microsoft’s Semantic Kernel documentation describes these exact chat-history reduction strategies: older messages can be removed, condensed into a summary, or reduced based on token limits. (source here)
That means the common idea that “the AI creates a summary snapshot and starts a new context” is close, but needs one technical correction: this is usually an application-level strategy, not a guaranteed behavior of every model by itself. The chat product, agent framework, or developer code may summarize the earlier conversation, insert that summary into a new prompt, and continue from there.
The "summary snapshot" pattern
A summary snapshot is a compressed version of the earlier context. Instead of carrying thousands of previous tokens forward, the system asks a model – or another summarization process – to preserve the important facts, decisions, user preferences, open tasks, constraints, and recent state.
The new context may then contain something like:
“Summary so far: The user is writing a technical blog about context windows. They want only information from valid sources. We have established that tokens fill the context window, and summarization is a common context-management strategy.”
That summary becomes a lightweight replacement for the earlier conversation. The model can continue with useful continuity, but the original details may no longer be present unless they were preserved in the summary.
Why this matters
Context compression is powerful, but it is not perfect. A summary can omit nuance, lose exact wording, or preserve a mistaken interpretation. This is why long-running AI agents need careful context engineering: deciding what to keep, what to summarize, what to retrieve, and what to discard.
Anthropic’s engineering writing describes context engineering as the practice of curating and maintaining the right set of tokens during inference, while also noting that long-running agents often need compression and memory mechanisms when conversations exceed standard context limits. (source here)
A simple mental model
A context window is not permanent memory. It is more like a whiteboard.
At the beginning of a task, the whiteboard is mostly empty. As the conversation continues, the board fills with instructions, examples, code, documents, and prior answers. When it gets crowded, the system may erase older sections, rewrite them as a smaller summary, and keep working on a fresh board.
The AI still appears continuous because the summary carries forward the important state. But technically, the original context may have been compressed, trimmed, or replaced.
The takeaway
AI systems have limited working memory measured in tokens. When that memory fills up, modern AI applications often use context-management techniques such as truncation, summarization, or retrieval. A “summary snapshot” is one practical way to preserve continuity while freeing space for new conversation.
The important point is this: the AI does not remember everything forever. It only reasons over what is currently inside the context window – or what the surrounding application chooses to bring back into it.