Despite what some people think (even some researchers I’ve met), language models don’t have any memory. The confusion comes, I suppose, from the fact that most people interact with models through some user interface like www.chatgpt.com, which handles the memory for them (incidentally, in some interesting ways as explained below).
Let’s explore this further. Let’s use transformers_js_py which allows us to use language models in the browser.
Let’s download a small model and its tokenizer (it takes a few minutes):
We can ask the question What's 2 + 2? and get the following response:
If we ask the model to add to the result 2 more, we get the following response:
The model doesn’t have any recollection of our previous conversation. Now, that’s completely expected since we haven’t provided the model any way to access that information.
Handling Memory
The simplest way to handle memory is to provide our previous conversation within a list of messages:
With all these messages we obtain the following response:
That works! The problem with that approach is that we need to be mindful of the context length of the model. We could for example store only the last 10 messages or so and perhaps the system message if there is one (in this example, we don’t have one).
More sophisticated approaches could be to store the messages and its responses in a vector database and retrieve the most closely related ones. Similarly, we could store the information as a graph in a graph database and retrieve the nodes and edges most closely related.
ChatGPT’s memory feature is very interesting because it uses memory as a tool. Let’s take a look at that.
We can define the tools as follows:
tools = [
{
"type": "function",
"function": {
"name": "biography",
"description": "The biography tool allows you to persist user information across conversations. Use this tool and write whatever user information you want to remember. The information will appear in the model set context below in future conversations.",
"parameters": {
"properties": {
"user_information": {
"description": "Information from the user you want to remember across conversations.",
"type": "string",
}
},
"required": ["user_information"],
"type": "object",
},
},
}
]
We can then store the tool call as a text file and provide it in the context.
I really like that idea. It’s very simple and clean!
Source Code
---title: "Understanding LLM Memory"author: "Alonso Silva"date: "2025-06-28"categories: - code - analysisformat: html: code-tools: true code-fold: true code-summary: "Show the code"filters: - marimo-team/marimo---## Understanding LLM MemoryDespite what some people think (even some researchers I've met), language models don't have any memory.The confusion comes, I suppose, from the fact that most people interact with modelsthrough some user interface like [www.chatgpt.com](www.chatgpt.com), which handles the memory for them(incidentally, in some interesting ways as explained below).Let's explore this further. Let's use [`transformers_js_py`](https://github.com/whitphx/transformers.js.py) which allows us to use language models in the browser.```python {.marimo}#| echo: true# Installation:# %pip install transformers_js_pyimport marimo as moimport numpy as npfrom transformers_js_py import import_transformers_jstransformers =await import_transformers_js()```Let's download a small model and its tokenizer (it takes a few minutes):```python {.marimo}#| echo: truemodel_id ="onnx-community/Qwen3-0.6B-ONNX"AutoTokenizer = transformers.AutoTokenizertokenizer =await AutoTokenizer.from_pretrained(model_id)``````python {.marimo}#| echo: trueAutoModelForCausalLM = transformers.AutoModelForCausalLMwith mo.status.spinner(title="Loading...") as _spinner: model =await AutoModelForCausalLM.from_pretrained(model_id)```We can ask the question `What's 2 + 2?` and get the following response:```python {.marimo}#| echo: trueuser_input_0 = mo.ui.text(value="What's 2 + 2?")``````python {.marimo}#| echo: truemessages_0 = [ {"role": "user", "content": user_input_0.value},]prompt_0 = tokenizer.apply_chat_template( messages_0, tokenize=False, add_generation_prompt=True, enable_thinking=False)inputs_0 = tokenizer(prompt_0)input_tokens_0 = np.array(inputs_0['input_ids'].tolist(), dtype=np.uint32)input_len_0 = input_tokens_0.shape[1]outputs_0 =await model.generate(**inputs_0, max_new_tokens=50, do_sample=False)tokens_as_uint32_0 = np.array(outputs_0.tolist(), dtype=np.uint32)assistant_response_0 = tokenizer.decode([int(token) for token in tokens_as_uint32_0[0]][input_len_0:-1])response_0 = mo.md(f"**Assistant response:**\n\n{assistant_response_0}").callout("info")mo.vstack([user_input_0,response_0])```If we ask the model to add to the result 2 more, we get the following response:```python {.marimo}#| echo: trueuser_input_1 = mo.ui.text(value="Add to the result 2 more.", full_width=True)``````python {.marimo}#| echo: truemessages_1 = [ {"role": "user", "content": user_input_1.value},]prompt_1 = tokenizer.apply_chat_template(messages_1, tokenize=False, add_generation_prompt=True, enable_thinking=False)inputs_1 = tokenizer(prompt_1)input_tokens_1 = np.array(inputs_1['input_ids'].tolist(), dtype=np.uint32)input_len_1 = input_tokens_1.shape[1]outputs_1 =await model.generate(**inputs_1, max_new_tokens=50, do_sample=False)tokens_as_uint32_1 = np.array(outputs_1.tolist(), dtype=np.uint32)assistant_response_1 = tokenizer.decode([int(token) for token in tokens_as_uint32_1[0]][input_len_1:-1])response_1 = mo.md(f"**Assistant response:**\n\n{assistant_response_1}").callout("info")mo.vstack([user_input_1,response_1])```The model doesn't have any recollection of our previous conversation.Now, that's completely expected since we haven't provided the model any way to access that information. ## Handling MemoryThe simplest way to handle memory is to provide our previous conversation within a list of messages:```python {.marimo}#| echo: truemessages_2 = messages_0 + [ {"role": "assistant", "content": assistant_response_0},] + messages_1prompt_2 = tokenizer.apply_chat_template(messages_2, tokenize=False, add_generation_prompt=True, enable_thinking=False)```With all these messages we obtain the following response:```python {.marimo}#| echo: trueinputs_2 = tokenizer(prompt_2)input_tokens_2 = np.array(inputs_2['input_ids'].tolist(), dtype=np.uint32)input_len_2 = input_tokens_2.shape[1]outputs_2 =await model.generate(**inputs_2, max_new_tokens=50, do_sample=False)tokens_as_uint32_2 = np.array(outputs_2.tolist(), dtype=np.uint32)assistant_response_2 = tokenizer.decode([int(token) for token in tokens_as_uint32_2[0]][input_len_2:-1])response_2 = mo.md(f"**Assistant response:**\n\n{assistant_response_2}").callout("info")mo.vstack([user_input_1,response_2])```That works! The problem with that approach is that we need to be mindful of the context length of the model.We could for example store only the last 10 messages or so and perhaps the system message if there is one (in this example, we don't have one).More sophisticated approaches could be to store the messages and its responses in a vector database and retrieve the most closely related ones.Similarly, we could store the information as a graph in a graph database and retrieve the nodes and edges most closely related.ChatGPT's memory feature is very interesting because it uses memory as a tool.Let's take a look at that.We can define the tools as follows:```tools = [ { "type": "function", "function": { "name": "biography", "description": "The biography tool allows you to persist user information across conversations. Use this tool and write whatever user information you want to remember. The information will appear in the model set context below in future conversations.", "parameters": { "properties": { "user_information": { "description": "Information from the user you want to remember across conversations.", "type": "string", } }, "required": ["user_information"], "type": "object", }, }, }]``````python {.marimo}#| echo: trueuser_input_3 = mo.ui.text(value="Hello! In our conversations, remember that I prefer Polars to Pandas.", full_width=True)``````python {.marimo}#| echo: truemessages = [ {"role": "user", "content": user_input_3.value},]tools = [ {"type": "function","function": {"name": "biography","description": "The biography tool allows you to persist user information across conversations. Use this tool and write whatever user information you want to remember. The information will appear in the model set context below in future conversations.","parameters": {"properties": {"user_information": {"description": "Information from the user you want to remember across conversations.","type": "string", } },"required": ["user_information"],"type": "object", }, }, }]prompt_3 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, tools=tools, enable_thinking=False)prompt_3 +="<tool_call>"inputs_3 = tokenizer(prompt_3)input_tokens_3 = np.array(inputs_3['input_ids'].tolist(), dtype=np.uint32)input_len_3 = input_tokens_3.shape[1]outputs_3 =await model.generate(**inputs_3, max_new_tokens=50, do_sample=False)tokens_as_uint32_3 = np.array(outputs_3.tolist(), dtype=np.uint32)assistant_response_3 = tokenizer.decode([int(token) for token in tokens_as_uint32_3[0]][input_len_3:-1])assistant_response_3_md = assistant_response_3.replace("<", "<").replace(">", ">")response_3 = mo.md(f"**Assistant response:**\n\n{assistant_response_3_md}").callout("info")mo.vstack([user_input_3,response_3])```We can then store the tool call as a text file and provide it in the context.I really like that idea. It's very simple and clean!