Mingqi Hou

AI Memory Is Not Chat Logs—It Is Context Budget Management

Why long conversations need truncation, summarization, and retrieval layers—and how to assemble prompts for production agents.

Early chatbots, RAG assistants, and agents often treat memory as naive persistence: save every turn, reload it next time. That is half right.

In production you immediately hit:

  1. Context windows are finite — more history means cost, latency, and failures.
  2. Not all history deserves the prompt — some turns need verbatim text, some need summaries, some should stay in storage only.

Memory is not “store messages.” It is deciding what enters the current context, when, and in what shape.

For most apps the default is not truncation-only or retrieval-only—it is layered memory: truncation for the short window, summarization for continuity, retrieval for long-range facts.

That framing clears up common mistakes: stuffing full history forever, dismissing summarization as “lossy,” or letting retrieval replace the last few raw turns.

Why every AI app hits a memory wall

LLMs are stateless per request. “Memory” is the app rebuilding context each call:

  1. Maintain a messages list
  2. Append new HumanMessage
  3. Keep AIMessage / ToolMessage from tool loops
  4. Send the bundle to the model

It works until conversations grow:

Goal: high-quality retention, not maximum retention.

Storage layer vs strategy layer

Two orthogonal questions:

Where is history stored? (persistence)

What goes into this turn’s prompt? (strategy)

Perfect storage with “concatenate everything” still degrades over time.

Three strategies

Truncation — cost and stability guardrails

Keep the latest N messages or tokens; drop the rest. Feels blunt; it is usually the first hard boundary you need.

By message count — simple but uneven (one line vs a JSON blob; tool pairs must stay intact).

By token budget — what models actually bill:

import { trimMessages } from "@langchain/core/messages";

async function buildRecentMessages(messages, tokenCounter) {
  return trimMessages(messages, {
    maxTokens: 3000,
    strategy: "last",
    startOn: "human",
    endOn: ["human", "tool"],
    tokenCounter,
  });
}

Reserve budget for system prompt, tools, the current question, and model output—not only history. Example: 32k context might cap history at 8–12k.

Summarization — continuity without full logs

Truncation alone forgets early facts (tenant, environment, steps already tried). Summaries compress low-frequency, high-value signals:

Keep recent tool traces and structured results verbatim when precision matters. Summaries are lossy—trigger on thresholds, not every turn.

import { SystemMessage, HumanMessage } from "@langchain/core/messages";

export async function summarizeConversation(model, messages) {
  const transcript = messages
    .map((msg) => `${msg.getType()}: ${msg.content}`)
    .join("\n");

  const prompt = [
    new SystemMessage(
      [
        "You compress conversation memory.",
        "Keep only what later turns need:",
        "1. User identity and preferences",
        "2. Confirmed facts",
        "3. Completed actions and results",
        "4. Open tasks and constraints",
        "Drop small talk and reasoning chains.",
      ].join("\n"),
    ),
    new HumanMessage(`Summarize:\n\n${transcript}`),
  ];

  return (await model.invoke(prompt)).content;
}

Aim for memory objects (who, what task, what’s done, what’s open)—structured JSON often beats prose paragraphs in production.

Retrieval — long-range, precise recall

Summaries miss detail when the user asks:

What was the API key rotation plan we agreed on last week?

That may not be in the recent window or the rolling summary. Retrieval memory mirrors RAG:

  1. Embed salient history units
  2. Store in a vector DB or search index
  3. Embed the new query
  4. Inject top hits into the prompt

Do not dump raw chat line-by-line into the index—noise hurts recall. Better units:

Default recommendation

Truncation + summarization + retrieval

Truncation only — short, independent sessions (simple bots, one-shot forms).
Summarization only — risky; loses exact wording and branch state.
Retrieval only — risky; recent turns should stay raw; retrieval is for long range.

Demo shape: internal support agent

Users debug over many turns with logs, configs, and tickets; sessions span days; they say “where did we leave off?”

Suggested layers:

  1. History store — durable raw messages (Redis/DB)
  2. Recent window — token-trimmed short memory
  3. Summary memory — compressed older dialogue
  4. Retrieval memory — vectors over summaries and fact cards

Prompt assembly order:

System prompt
+ session summary
+ retrieved long-term memories
+ recent raw messages
+ current user input

Put recent raw messages closest to the current question; summaries and retrieval sit as background.

Recent window with tiktoken

import { trimMessages } from "@langchain/core/messages";
import { encodingForModel } from "js-tiktoken";

const encoder = encodingForModel("gpt-4o-mini");

function countTokens(messages) {
  return messages.reduce((total, message) => {
    const content =
      typeof message.content === "string"
        ? message.content
        : JSON.stringify(message.content);
    return total + encoder.encode(content).length;
  }, 0);
}

export async function buildRecentWindow(messages) {
  return trimMessages(messages, {
    maxTokens: 4000,
    strategy: "last",
    startOn: "human",
    endOn: ["human", "tool"],
    tokenCounter: countTokens,
  });
}

startOn: "human" and endOn: ["human", "tool"] avoid broken tool-call pairs—slice(-8) breaks under tools and long assistant payloads.

Summary

Treat memory as context orchestration, not a single “save chat” API. Layer truncation, summarization, and retrieval; persist raw history separately; assemble prompts with explicit priority. That is what keeps agents usable, affordable, and honest across long-running work—exactly what clients hiring for AI engineering expect to see in production thinking.