AI Memory Is Not Chat Logs—It Is Context Budget Management
Why long conversations need truncation, summarization, and retrieval layers—and how to assemble prompts for production agents.
Early chatbots, RAG assistants, and agents often treat memory as naive persistence: save every turn, reload it next time. That is half right.
In production you immediately hit:
- Context windows are finite — more history means cost, latency, and failures.
- Not all history deserves the prompt — some turns need verbatim text, some need summaries, some should stay in storage only.
Memory is not “store messages.” It is deciding what enters the current context, when, and in what shape.
For most apps the default is not truncation-only or retrieval-only—it is layered memory: truncation for the short window, summarization for continuity, retrieval for long-range facts.
That framing clears up common mistakes: stuffing full history forever, dismissing summarization as “lossy,” or letting retrieval replace the last few raw turns.
Why every AI app hits a memory wall
LLMs are stateless per request. “Memory” is the app rebuilding context each call:
- Maintain a
messageslist - Append new
HumanMessage - Keep
AIMessage/ToolMessagefrom tool loops - Send the bundle to the model
It works until conversations grow:
- Tokens and cost climb
- Latency rises
- Tool chains blow structural limits
- Stale noise hurts answer quality
Goal: high-quality retention, not maximum retention.
Storage layer vs strategy layer
Two orthogonal questions:
Where is history stored? (persistence)
- In-process memory — demos
- Files — simple tools
- Redis — live sessions
- SQL — audit, multi-tenant
- Object storage — archives
What goes into this turn’s prompt? (strategy)
- Truncation — recent window only
- Summarization — compress older turns
- Retrieval — fetch relevant past facts
Perfect storage with “concatenate everything” still degrades over time.
Three strategies
Truncation — cost and stability guardrails
Keep the latest N messages or tokens; drop the rest. Feels blunt; it is usually the first hard boundary you need.
By message count — simple but uneven (one line vs a JSON blob; tool pairs must stay intact).
By token budget — what models actually bill:
import { trimMessages } from "@langchain/core/messages";
async function buildRecentMessages(messages, tokenCounter) {
return trimMessages(messages, {
maxTokens: 3000,
strategy: "last",
startOn: "human",
endOn: ["human", "tool"],
tokenCounter,
});
}
Reserve budget for system prompt, tools, the current question, and model output—not only history. Example: 32k context might cap history at 8–12k.
Summarization — continuity without full logs
Truncation alone forgets early facts (tenant, environment, steps already tried). Summaries compress low-frequency, high-value signals:
- User profile and role
- Long-running task goals and constraints
- Confirmed facts and outcomes
- Decisions (“chose A over B”)
Keep recent tool traces and structured results verbatim when precision matters. Summaries are lossy—trigger on thresholds, not every turn.
import { SystemMessage, HumanMessage } from "@langchain/core/messages";
export async function summarizeConversation(model, messages) {
const transcript = messages
.map((msg) => `${msg.getType()}: ${msg.content}`)
.join("\n");
const prompt = [
new SystemMessage(
[
"You compress conversation memory.",
"Keep only what later turns need:",
"1. User identity and preferences",
"2. Confirmed facts",
"3. Completed actions and results",
"4. Open tasks and constraints",
"Drop small talk and reasoning chains.",
].join("\n"),
),
new HumanMessage(`Summarize:\n\n${transcript}`),
];
return (await model.invoke(prompt)).content;
}
Aim for memory objects (who, what task, what’s done, what’s open)—structured JSON often beats prose paragraphs in production.
Retrieval — long-range, precise recall
Summaries miss detail when the user asks:
What was the API key rotation plan we agreed on last week?
That may not be in the recent window or the rolling summary. Retrieval memory mirrors RAG:
- Embed salient history units
- Store in a vector DB or search index
- Embed the new query
- Inject top hits into the prompt
Do not dump raw chat line-by-line into the index—noise hurts recall. Better units:
- Turn-pair summaries
- Confirmed “fact cards”
- Milestone summaries per task
Default recommendation
Truncation + summarization + retrieval
- Truncation — hard budget
- Summarization — thread continuity
- Retrieval — distant facts on demand
Truncation only — short, independent sessions (simple bots, one-shot forms).
Summarization only — risky; loses exact wording and branch state.
Retrieval only — risky; recent turns should stay raw; retrieval is for long range.
Demo shape: internal support agent
Users debug over many turns with logs, configs, and tickets; sessions span days; they say “where did we leave off?”
Suggested layers:
- History store — durable raw messages (Redis/DB)
- Recent window — token-trimmed short memory
- Summary memory — compressed older dialogue
- Retrieval memory — vectors over summaries and fact cards
Prompt assembly order:
System prompt
+ session summary
+ retrieved long-term memories
+ recent raw messages
+ current user input
Put recent raw messages closest to the current question; summaries and retrieval sit as background.
Recent window with tiktoken
import { trimMessages } from "@langchain/core/messages";
import { encodingForModel } from "js-tiktoken";
const encoder = encodingForModel("gpt-4o-mini");
function countTokens(messages) {
return messages.reduce((total, message) => {
const content =
typeof message.content === "string"
? message.content
: JSON.stringify(message.content);
return total + encoder.encode(content).length;
}, 0);
}
export async function buildRecentWindow(messages) {
return trimMessages(messages, {
maxTokens: 4000,
strategy: "last",
startOn: "human",
endOn: ["human", "tool"],
tokenCounter: countTokens,
});
}
startOn: "human" and endOn: ["human", "tool"] avoid broken tool-call pairs—slice(-8) breaks under tools and long assistant payloads.
Summary
Treat memory as context orchestration, not a single “save chat” API. Layer truncation, summarization, and retrieval; persist raw history separately; assemble prompts with explicit priority. That is what keeps agents usable, affordable, and honest across long-running work—exactly what clients hiring for AI engineering expect to see in production thinking.