Your support agent handles a customer complaint about a delayed shipment. The customer mentions they're preparing for their daughter's birthday party this Saturday. The agent resolves the issue, expedites the package, confirms the new delivery date. Great interaction.
Two days later, the same customer calls back. Different question entirely — they want to add an item to their order. Your agent has no idea who they are. No memory of the birthday. No awareness that there's a time-sensitive delivery in progress. The customer repeats everything. The magic is gone.
This is the gap that separates a chatbot from an agent. Memory — the ability to retain, organize, and retrieve information across interactions — is what transforms a stateless language model into something that genuinely learns about the people it serves. And building it well is harder than it looks.
We'll build a working memory system from scratch in TypeScript, covering every layer: from simple conversation buffers to vector-powered semantic recall. Along the way, we'll explore the cognitive science behind memory types, examine production architectures from MemGPT to Zep, understand where memory and RAG converge, and tackle the privacy constraints that shape what your agent should — and shouldn't — remember.
Prerequisites and Setup
You'll need Node.js 20+, TypeScript, and familiarity with async/await patterns. Some sections reference vector embeddings and similarity search — if those concepts are new, start with RAG from Scratch for the foundations.
npm install openai uuid
npm install -D typescript @types/nodeWe'll use OpenAI's API for embeddings and text generation. The architectural patterns work with any LLM provider — swap in Anthropic, Ollama, or whatever you prefer.
The code examples build on each other progressively. Each is self-contained enough to run independently, but they're designed to show how simple memory evolves into production-grade systems.
Why Memory Matters: The Stateless Problem
Every LLM call is stateless by default — the model receives a prompt, generates a response, and forgets everything. Without external memory, agents can't learn from past interactions, recognize returning users, or build context over time. This fundamental limitation means even the most capable model starts every conversation from zero.
The numbers make this concrete. Claude's context window holds 200,000 tokens. GPT-4o supports 128,000. Gemini 2.0 Pro reaches 2 million. These sound enormous until you calculate what they actually hold. A typical customer service interaction runs 2,000-4,000 tokens. A user with 100 past conversations has generated 200,000-400,000 tokens of history — already exceeding most context windows, and that's just one user.
Even if you could fit everything in, you wouldn't want to. Research consistently shows that LLM performance degrades in the middle of long contexts — a phenomenon researchers call the "lost in the middle" problem. A model advertising 200K tokens typically becomes unreliable around 130K, with sudden accuracy drops rather than gradual degradation. Stuffing the full history into every prompt isn't just expensive. It actively hurts quality.
Memory systems solve this by acting as an intelligent filter between raw conversation history and the model's context window. Instead of "here's everything that ever happened," memory says "here's what's relevant right now."
The Four Types of Agent Memory
Cognitive science gives us a surprisingly useful framework for thinking about AI agent memory. The human memory system — studied for over a century — maps cleanly onto the challenges agents face. Four types matter most: working memory for the current conversation, episodic memory for specific past events, semantic memory for distilled knowledge, and procedural memory for learned behaviors.
This isn't just an analogy. The December 2025 survey "Memory in the Age of AI Agents" from Tsinghua University and CMU explicitly argues that traditional short-term/long-term taxonomies are insufficient, proposing a function-based taxonomy that mirrors cognitive science categories. Let's break each one down with TypeScript implementations.
Working Memory (Session Context)
Working memory holds the information needed for the current task — the active conversation, recent tool calls, and immediate context. It's fast, bounded, and disposable. When the session ends, working memory can be discarded or consolidated into longer-term storage.
Every chat application you've used implements working memory, even if it doesn't call it that. It's the message history that gets prepended to each LLM call.
Here's the simplest possible implementation — a bounded buffer that keeps the last N messages:
interface Message {
role: 'user' | 'assistant' | 'system';
content: string;
timestamp: Date;
}
class WorkingMemory {
private messages: Message[] = [];
private maxMessages: number;
constructor(maxMessages: number = 20) {
this.maxMessages = maxMessages;
}
add(message: Message): void {
this.messages.push(message);
// Evict oldest messages when buffer is full
if (this.messages.length > this.maxMessages) {
this.messages = this.messages.slice(-this.maxMessages);
}
}
getContext(): Message[] {
return [...this.messages];
}
getTokenEstimate(): number {
// Rough estimate: 1 token ≈ 4 characters
return this.messages.reduce(
(sum, m) => sum + Math.ceil(m.content.length / 4), 0
);
}
clear(): void {
this.messages = [];
}
}This works for short conversations, but it has an obvious flaw: once a message falls outside the window, it's gone. The customer's birthday mention from message #3 disappears after message #23. That's where the other memory types come in.
Episodic Memory (What Happened)
Episodic memory records specific events with their context — when they happened, who was involved, what the outcome was. Think of it as the agent's autobiography. Stanford's 2023 "Generative Agents" paper demonstrated this powerfully: agents that maintained episodic memory could autonomously organize a Valentine's Day party by recalling who they'd invited, what conversations they'd had, and when the event was scheduled.
The key insight is that episodic memories carry temporal and contextual metadata. It's not just "the customer likes email" — it's "on March 5, 2026, during a billing dispute about invoice #4821, the customer explicitly said they prefer email communication over phone calls."
This implementation stores episodes with rich metadata and retrieves them by recency and relevance:
import { randomUUID } from 'crypto';
interface Episode {
id: string;
userId: string;
sessionId: string;
event: string; // What happened
context: string; // Surrounding circumstances
outcome?: string; // How it resolved
importance: number; // 1-10 scale
timestamp: Date;
tags: string[];
embedding?: number[]; // For semantic search (added later)
}
class EpisodicMemory {
private episodes: Map<string, Episode[]> = new Map();
async store(episode: Omit<Episode, 'id'>): Promise<string> {
const id = randomUUID();
const stored: Episode = { ...episode, id };
const userEpisodes = this.episodes.get(episode.userId) || [];
userEpisodes.push(stored);
this.episodes.set(episode.userId, userEpisodes);
return id;
}
// Retrieve by recency — most recent episodes first
getRecent(userId: string, limit: number = 10): Episode[] {
const episodes = this.episodes.get(userId) || [];
return episodes
.sort((a, b) => b.timestamp.getTime() - a.timestamp.getTime())
.slice(0, limit);
}
// Retrieve by importance — highest importance first
getMostImportant(userId: string, limit: number = 5): Episode[] {
const episodes = this.episodes.get(userId) || [];
return episodes
.sort((a, b) => b.importance - a.importance)
.slice(0, limit);
}
// Combined retrieval: weighted score of recency + importance
getRelevant(
userId: string,
limit: number = 5,
recencyWeight: number = 0.4,
importanceWeight: number = 0.6
): Episode[] {
const episodes = this.episodes.get(userId) || [];
if (episodes.length === 0) return [];
const now = Date.now();
const maxAge = Math.max(
...episodes.map(e => now - e.timestamp.getTime())
);
return episodes
.map(episode => {
const age = now - episode.timestamp.getTime();
const recencyScore = 1 - (age / (maxAge || 1));
const importanceScore = episode.importance / 10;
const score =
recencyWeight * recencyScore +
importanceWeight * importanceScore;
return { episode, score };
})
.sort((a, b) => b.score - a.score)
.slice(0, limit)
.map(({ episode }) => episode);
}
}The getRelevant method implements the same scoring approach used in the Stanford generative agents paper — combining recency, importance, and (when embeddings are available) relevance to determine which memories surface. Production systems add a third signal: relevance to the current query, computed via embedding similarity.
Semantic Memory (What the Agent Knows)
Semantic memory stores distilled facts and knowledge — not specific events, but the patterns and preferences extracted from them. While episodic memory says "the customer called about billing on March 5," semantic memory says "this customer frequently has billing questions and prefers email resolution."
The distinction matters because semantic memories are more compact, more generalizable, and more useful for shaping agent behavior. They're the result of consolidation — the process of converting raw experiences into reusable knowledge.
Here's how to extract semantic memories from conversations using an LLM:
import OpenAI from 'openai';
interface SemanticMemory {
id: string;
userId: string;
fact: string; // The distilled knowledge
confidence: number; // 0-1, how certain we are
source: string; // Which episode(s) this came from
category: string; // preference, fact, behavior, relationship
lastAccessed: Date;
accessCount: number;
createdAt: Date;
updatedAt: Date;
}
class SemanticMemoryExtractor {
private openai: OpenAI;
private memories: Map<string, SemanticMemory[]> = new Map();
constructor(apiKey: string) {
this.openai = new OpenAI({ apiKey });
}
async extractFromConversation(
userId: string,
conversation: string,
sessionId: string
): Promise<SemanticMemory[]> {
const existing = this.memories.get(userId) || [];
const existingFacts = existing.map(m => m.fact).join('\n');
const response = await this.openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.1,
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: `Extract factual knowledge about the user from this conversation.
Return JSON: { "memories": [{ "fact": "...", "confidence": 0.0-1.0, "category": "preference|fact|behavior|relationship" }] }
Rules:
- Only extract information explicitly stated or strongly implied
- Confidence 0.9+ for direct statements, 0.5-0.8 for inferences
- Skip transient information (current mood, one-time requests)
- If a fact contradicts existing knowledge, include it with the updated information
Existing knowledge about this user:
${existingFacts || 'None yet'}`
},
{ role: 'user', content: conversation }
]
});
const parsed = JSON.parse(
response.choices[0].message.content || '{"memories":[]}'
);
const newMemories: SemanticMemory[] = parsed.memories.map(
(m: { fact: string; confidence: number; category: string }) => ({
id: randomUUID(),
userId,
fact: m.fact,
confidence: m.confidence,
source: sessionId,
category: m.category,
lastAccessed: new Date(),
accessCount: 0,
createdAt: new Date(),
updatedAt: new Date(),
})
);
// Merge with existing — update if contradicting, add if new
this.mergeMemories(userId, newMemories);
return newMemories;
}
private mergeMemories(
userId: string,
newMemories: SemanticMemory[]
): void {
const existing = this.memories.get(userId) || [];
for (const newMem of newMemories) {
const conflict = existing.findIndex(
e => e.category === newMem.category &&
this.isContradiction(e.fact, newMem.fact)
);
if (conflict >= 0 && newMem.confidence > existing[conflict].confidence) {
// Replace lower-confidence memory with higher-confidence one
existing[conflict] = { ...newMem, updatedAt: new Date() };
} else if (conflict < 0) {
existing.push(newMem);
}
}
this.memories.set(userId, existing);
}
private isContradiction(a: string, b: string): boolean {
// Simplified — production systems use embedding similarity
// to detect semantic overlap, then LLM to judge contradiction
const normalize = (s: string) => s.toLowerCase().trim();
return normalize(a).includes(normalize(b).split(' ')[0]);
}
getMemories(
userId: string,
category?: string
): SemanticMemory[] {
const all = this.memories.get(userId) || [];
if (category) {
return all.filter(m => m.category === category);
}
return all;
}
}Notice the merge logic: when new information contradicts existing memories, the higher-confidence version wins. This prevents the classic problem where an outdated preference overrides a recent correction ("I actually moved to Portland last month — please stop sending things to Seattle").
Procedural Memory (How to Do Things)
Procedural memory captures learned processes and strategies — not what happened or what's true, but how to accomplish tasks. In cognitive science, this is the memory type that lets you ride a bike without thinking about it. For AI agents, it's the memory that captures successful problem-solving patterns.
Recent research like "Remember Me, Refine Me" (2025) demonstrates agents that evolve their procedures based on experience. An agent that has successfully resolved 50 billing disputes develops a procedural memory for the optimal resolution flow — check account status, verify charge, offer appropriate resolution based on customer tier.
Here's a practical implementation that records and retrieves successful action sequences:
interface Procedure {
id: string;
name: string;
description: string;
steps: ProcedureStep[];
successRate: number;
timesUsed: number;
context: string; // When to apply this procedure
lastUsed: Date;
createdAt: Date;
}
interface ProcedureStep {
action: string;
parameters?: Record<string, unknown>;
expectedOutcome: string;
fallback?: string; // What to do if this step fails
}
class ProceduralMemory {
private procedures: Procedure[] = [];
// Record a successful action sequence as a procedure
recordProcedure(
name: string,
description: string,
steps: ProcedureStep[],
context: string
): Procedure {
const existing = this.procedures.find(p => p.name === name);
if (existing) {
// Reinforce existing procedure
existing.timesUsed++;
existing.successRate =
(existing.successRate * (existing.timesUsed - 1) + 1) /
existing.timesUsed;
existing.lastUsed = new Date();
return existing;
}
const procedure: Procedure = {
id: randomUUID(),
name,
description,
steps,
successRate: 1.0,
timesUsed: 1,
context,
lastUsed: new Date(),
createdAt: new Date(),
};
this.procedures.push(procedure);
return procedure;
}
// Record a failure to adjust success rate
recordFailure(procedureId: string): void {
const proc = this.procedures.find(p => p.id === procedureId);
if (proc) {
proc.timesUsed++;
proc.successRate =
(proc.successRate * (proc.timesUsed - 1)) / proc.timesUsed;
}
}
// Find the best procedure for a given context
findProcedure(context: string): Procedure | null {
// Simple keyword matching — production uses embedding similarity
const candidates = this.procedures.filter(p =>
context.toLowerCase().includes(p.context.toLowerCase()) ||
p.context.toLowerCase().includes(context.toLowerCase())
);
if (candidates.length === 0) return null;
// Prefer high success rate, then recency
return candidates.sort((a, b) => {
const scoreA = a.successRate * 0.7 +
(a.lastUsed.getTime() / Date.now()) * 0.3;
const scoreB = b.successRate * 0.7 +
(b.lastUsed.getTime() / Date.now()) * 0.3;
return scoreB - scoreA;
})[0];
}
// Format procedure as instructions for the LLM
toPromptInstructions(procedure: Procedure): string {
const steps = procedure.steps
.map((s, i) => `${i + 1}. ${s.action}${
s.fallback ? ` (if this fails: ${s.fallback})` : ''
}`)
.join('\n');
return `Recommended approach (${Math.round(procedure.successRate * 100)}% success rate, used ${procedure.timesUsed} times):
${procedure.description}
Steps:
${steps}`;
}
}Procedural memory is the least commonly implemented of the four types, but it's arguably the most powerful for agents that handle repeated workflows. Instead of figuring out the billing dispute resolution process from scratch every time, the agent recalls: "Last 47 times this happened, here's what worked."
Memory Architectures: From Simple to Production
Now that we understand the memory types, how do you actually structure a memory system? Three architectures dominate production systems, each with different tradeoffs around complexity, cost, and retrieval quality. Most production deployments combine multiple approaches.
Buffer Memory
Buffer memory is the simplest architecture — a sliding window of recent messages passed directly as context to the LLM. No retrieval, no embeddings, no external storage. You already saw this in the working memory implementation above.
It works well for short, focused interactions. The problem surfaces when conversations get long or span multiple sessions: the oldest context silently disappears as new messages push it out of the buffer.
A common refinement is the windowed buffer with token awareness:
class TokenAwareBuffer {
private messages: Message[] = [];
private maxTokens: number;
constructor(maxTokens: number = 4000) {
this.maxTokens = maxTokens;
}
add(message: Message): void {
this.messages.push(message);
this.trim();
}
private trim(): void {
let totalTokens = this.estimateTokens(this.messages);
while (totalTokens > this.maxTokens && this.messages.length > 1) {
this.messages.shift();
totalTokens = this.estimateTokens(this.messages);
}
}
private estimateTokens(msgs: Message[]): number {
return msgs.reduce(
(sum, m) => sum + Math.ceil(m.content.length / 4) + 4, // +4 for role tokens
0
);
}
getMessages(): Message[] {
return [...this.messages];
}
}When to use buffer memory: Prototyping, single-session interactions, and as the working memory layer within a larger system. Don't use it alone if your agent needs to remember anything between sessions.
Summary Memory
Summary memory addresses the buffer's main weakness by compressing old messages into summaries before discarding them. Instead of losing information entirely, the system condenses it into a shorter representation that captures the essential points.
The idea is straightforward: when the buffer fills up, summarize the oldest messages, replace them with the summary, and continue. The LLM sees a compressed version of history plus the recent full messages.
Here's how to build one that progressively summarizes as conversations grow:
class SummaryMemory {
private recentMessages: Message[] = [];
private summary: string = '';
private maxRecentMessages: number;
private openai: OpenAI;
constructor(apiKey: string, maxRecentMessages: number = 10) {
this.openai = new OpenAI({ apiKey });
this.maxRecentMessages = maxRecentMessages;
}
async add(message: Message): Promise<void> {
this.recentMessages.push(message);
if (this.recentMessages.length > this.maxRecentMessages) {
// Take the oldest messages and summarize them
const toSummarize = this.recentMessages.splice(
0,
Math.floor(this.maxRecentMessages / 2)
);
await this.updateSummary(toSummarize);
}
}
private async updateSummary(messages: Message[]): Promise<void> {
const conversation = messages
.map(m => `${m.role}: ${m.content}`)
.join('\n');
const response = await this.openai.chat.completions.create({
model: 'gpt-4o-mini', // Cheaper model for summarization
temperature: 0,
messages: [
{
role: 'system',
content: `Progressively summarize the conversation, adding to the existing summary.
Include: key facts, user preferences, unresolved issues, action items, and any commitments made.
Be concise but don't drop important details.`
},
{
role: 'user',
content: `Existing summary:\n${this.summary || '(none yet)'}\n\nNew messages:\n${conversation}`
}
]
});
this.summary = response.choices[0].message.content || this.summary;
}
getContext(): { summary: string; recentMessages: Message[] } {
return {
summary: this.summary,
recentMessages: [...this.recentMessages],
};
}
// Format for injection into LLM prompt
toPromptContext(): string {
const parts: string[] = [];
if (this.summary) {
parts.push(`Previous conversation summary:\n${this.summary}`);
}
if (this.recentMessages.length > 0) {
parts.push('Recent messages:');
for (const msg of this.recentMessages) {
parts.push(`${msg.role}: ${msg.content}`);
}
}
return parts.join('\n\n');
}
}Summary memory makes a tradeoff: you preserve the gist of old conversations at the cost of specific details. The customer's exact words about their daughter's birthday might get summarized to "customer has a time-sensitive delivery" — which captures the urgency but loses the personal context. For many applications, that's an acceptable tradeoff. For others, you need the next architecture.
Vector-Based Retrieval Memory
Vector retrieval memory stores every memory as a vector embedding and retrieves entries by semantic similarity to the current query. Instead of keeping everything in a buffer or summarizing down to a fixed size, you search through the full memory store for what's actually relevant.
This is where memory intersects with RAG — the same embedding and retrieval techniques covered in RAG from Scratch apply directly. The difference is the data source: RAG retrieves from documents; memory retrieves from the agent's own experience.
Here's a complete implementation with cosine similarity search:
class VectorMemoryStore {
private entries: Array<{
id: string;
text: string;
embedding: number[];
metadata: Record<string, unknown>;
timestamp: Date;
}> = [];
private openai: OpenAI;
constructor(apiKey: string) {
this.openai = new OpenAI({ apiKey });
}
async store(
text: string,
metadata: Record<string, unknown> = {}
): Promise<string> {
const embedding = await this.embed(text);
const id = randomUUID();
this.entries.push({
id,
text,
embedding,
metadata,
timestamp: new Date(),
});
return id;
}
async search(
query: string,
topK: number = 5,
filter?: (meta: Record<string, unknown>) => boolean
): Promise<Array<{ text: string; score: number; metadata: Record<string, unknown> }>> {
const queryEmbedding = await this.embed(query);
let candidates = this.entries;
if (filter) {
candidates = candidates.filter(e => filter(e.metadata));
}
const scored = candidates.map(entry => ({
text: entry.text,
score: this.cosineSimilarity(queryEmbedding, entry.embedding),
metadata: entry.metadata,
}));
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
private async embed(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
private cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Decay old memories — reduce their retrieval priority over time
applyDecay(halfLifeDays: number = 30): void {
const now = Date.now();
for (const entry of this.entries) {
const ageDays =
(now - entry.timestamp.getTime()) / (1000 * 60 * 60 * 24);
const decayFactor = Math.pow(0.5, ageDays / halfLifeDays);
// Store decay factor in metadata for retrieval scoring
entry.metadata._decayFactor = decayFactor;
}
}
}In production, you'd replace the in-memory store with a vector database — Pinecone, Qdrant, pgvector, or Weaviate. The API surface is essentially the same: embed, store, search. The vector database handles efficient approximate nearest neighbor search at scale, which matters once you have thousands or millions of memory entries.
The decay mechanism deserves attention. Without it, ancient memories compete equally with recent ones during retrieval. The applyDecay method implements exponential decay with a configurable half-life — a memory from 30 days ago scores at 50% of its original relevance. Mem0 does something similar, calling it "dynamic forgetting," which helps keep retrieved context current and relevant.
Putting It All Together: A Unified Memory System
A production memory system doesn't use just one architecture — it combines all four memory types into a unified layer that the agent queries before every response. The key is making this transparent to the application code: the agent asks "what do I know about this user and this situation?" and gets back a curated context block.
Here's a unified memory manager that orchestrates the pieces:
interface MemoryContext {
workingMemory: Message[];
relevantEpisodes: Episode[];
semanticFacts: SemanticMemory[];
suggestedProcedure: Procedure | null;
summary: string;
}
class UnifiedMemoryManager {
private working: WorkingMemory;
private episodic: EpisodicMemory;
private semantic: SemanticMemoryExtractor;
private procedural: ProceduralMemory;
private vectorStore: VectorMemoryStore;
private summaryMemory: SummaryMemory;
constructor(apiKey: string) {
this.working = new WorkingMemory(20);
this.episodic = new EpisodicMemory();
this.semantic = new SemanticMemoryExtractor(apiKey);
this.procedural = new ProceduralMemory();
this.vectorStore = new VectorMemoryStore(apiKey);
this.summaryMemory = new SummaryMemory(apiKey);
}
// Called on every user message
async processMessage(
userId: string,
sessionId: string,
message: Message
): Promise<void> {
// Update working memory
this.working.add(message);
// Update summary
await this.summaryMemory.add(message);
// Store in vector memory for future retrieval
await this.vectorStore.store(message.content, {
userId,
sessionId,
role: message.role,
timestamp: message.timestamp.toISOString(),
});
}
// Build full context for LLM prompt
async buildContext(
userId: string,
currentQuery: string
): Promise<MemoryContext> {
// Parallel retrieval for speed
const [vectorResults, episodes, facts] = await Promise.all([
this.vectorStore.search(currentQuery, 5, (meta) =>
meta.userId === userId
),
Promise.resolve(this.episodic.getRelevant(userId, 3)),
Promise.resolve(this.semantic.getMemories(userId)),
]);
// Find applicable procedure
const procedure = this.procedural.findProcedure(currentQuery);
const { summary, recentMessages } = this.summaryMemory.getContext();
return {
workingMemory: recentMessages,
relevantEpisodes: episodes,
semanticFacts: facts,
suggestedProcedure: procedure,
summary,
};
}
// Format context for injection into system prompt
formatForPrompt(context: MemoryContext): string {
const sections: string[] = [];
if (context.summary) {
sections.push(
`## Conversation History\n${context.summary}`
);
}
if (context.semanticFacts.length > 0) {
const facts = context.semanticFacts
.map(f => `- ${f.fact} (confidence: ${f.confidence})`)
.join('\n');
sections.push(`## What You Know About This User\n${facts}`);
}
if (context.relevantEpisodes.length > 0) {
const episodes = context.relevantEpisodes
.map(e => `- [${e.timestamp.toLocaleDateString()}] ${e.event}${
e.outcome ? ` → ${e.outcome}` : ''
}`)
.join('\n');
sections.push(`## Relevant Past Interactions\n${episodes}`);
}
if (context.suggestedProcedure) {
sections.push(
`## Suggested Approach\n${this.procedural.toPromptInstructions(
context.suggestedProcedure
)}`
);
}
return sections.join('\n\n');
}
// End-of-session consolidation
async consolidateSession(
userId: string,
sessionId: string,
conversation: string
): Promise<void> {
// Extract semantic memories from the full conversation
await this.semantic.extractFromConversation(
userId,
conversation,
sessionId
);
// Store key events as episodes
// (In production, use LLM to identify notable events)
await this.episodic.store({
userId,
sessionId,
event: `Conversation session ${sessionId}`,
context: conversation.slice(0, 500),
importance: 5,
timestamp: new Date(),
tags: ['conversation'],
});
// Clear working memory
this.working.clear();
}
}The buildContext method runs retrieval in parallel — vector search, episodic lookup, and semantic fact retrieval happen simultaneously. This keeps latency manageable even with multiple memory sources. In production, the formatForPrompt output goes into the system message, giving the LLM everything it needs to respond with full context.
The consolidateSession method runs when a conversation ends. It's the bridge between ephemeral working memory and persistent long-term storage — extracting the valuable signal from raw conversation and storing it where future sessions can find it.

Customer Memory
4 memories recalled
“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”
Where Memory Meets RAG
Memory and RAG share infrastructure — embeddings, vector stores, similarity search — but serve fundamentally different purposes. Understanding the boundary prevents architectural confusion and helps you build systems where both work together effectively.
RAG retrieves from organizational knowledge: product documentation, FAQs, policy documents, knowledge base articles. This information exists independently of any particular user or conversation. It's the same for everyone.
Memory retrieves from experiential knowledge: what happened in past conversations, what this specific user prefers, how similar situations resolved before. This information is generated through interaction and is unique to each user or agent.
The practical overlap looks like this:
In production, a single vector database often hosts both. The namespace or collection separates them: RAG documents live in one collection, memory entries in another. The retrieval layer queries both, and the results get merged into a single context block.
Here's how that merger works in practice:
interface RetrievalResult {
text: string;
score: number;
source: 'rag' | 'memory';
metadata: Record<string, unknown>;
}
async function hybridRetrieval(
query: string,
userId: string,
ragStore: VectorMemoryStore,
memoryStore: VectorMemoryStore,
options: {
ragTopK?: number;
memoryTopK?: number;
ragWeight?: number;
memoryWeight?: number;
} = {}
): Promise<RetrievalResult[]> {
const {
ragTopK = 3,
memoryTopK = 3,
ragWeight = 0.5,
memoryWeight = 0.5,
} = options;
const [ragResults, memResults] = await Promise.all([
ragStore.search(query, ragTopK),
memoryStore.search(query, memoryTopK, (meta) =>
meta.userId === userId
),
]);
const combined: RetrievalResult[] = [
...ragResults.map(r => ({
...r,
score: r.score * ragWeight,
source: 'rag' as const,
})),
...memResults.map(r => ({
...r,
score: r.score * memoryWeight,
source: 'memory' as const,
})),
];
// Sort by weighted score, interleave sources
return combined.sort((a, b) => b.score - a.score);
}The weighting between RAG and memory results depends on the use case. Customer support agents might weight memory higher (the customer's history matters more than generic docs). A technical support agent might weight RAG higher (the answer is in the documentation, memory provides context). If you're working with knowledge bases, Chanl's memory and knowledge base features handle this retrieval orchestration, letting you configure the balance per agent.
Production Memory Architectures
Several open-source and commercial systems have emerged specifically for agent memory. Understanding their approaches helps you decide whether to build or integrate — and what patterns to borrow if you build your own.
MemGPT / Letta: OS-Inspired Tiering
The MemGPT paper (arXiv:2310.08560) introduced the idea of treating the LLM's context window like an operating system treats RAM. Just as an OS pages data between fast memory and disk, MemGPT pages information between the LLM's context (core memory) and external storage (archival memory).
Letta, the production system built from the MemGPT research, implements three tiers:
- Core memory — Always in the context window. Contains the agent's persona, key facts about the current user, and active conversation state. Size-limited, like RAM.
- Recall memory — Searchable conversation history. The agent can query past sessions by issuing memory retrieval function calls. Analogous to a disk cache.
- Archival memory — Long-term storage for any information the agent wants to preserve. Backed by a vector database. Analogous to disk.
The breakthrough insight is that the agent manages its own memory through tool calls. Instead of a separate system deciding what to store, the agent itself calls core_memory_append, archival_memory_insert, or conversation_search as tools. This gives the agent agency over what it remembers — a form of metacognition.
Zep: Temporal Knowledge Graphs
Zep takes a different approach entirely. Instead of flat vector retrieval, it builds a temporal knowledge graph from conversations — nodes represent entities (people, products, events), edges represent relationships, and every element carries temporal metadata showing when it was true.
Their January 2025 paper (arXiv:2501.13956) reports up to 18.5% accuracy improvements over baseline implementations and 90% lower latency. The temporal component is the key differentiator: Zep can answer "what was the customer's address before they moved?" because the graph preserves historical state, not just current state.
This matters for customer experience agents where context evolves over time. A customer's preferences, addresses, account details, and relationships change — a flat memory store overwrites history, while a temporal graph preserves the full timeline.
Mem0: Intelligent Extraction and Consolidation
Mem0 takes the approach closest to human memory consolidation. Instead of storing raw conversation chunks, it uses an LLM to extract meaningful memories from interactions, consolidate overlapping information, and decay irrelevant entries over time.
Their April 2025 paper reports 26% improvement in LLM-as-judge quality metrics over raw retrieval, 91% lower P95 latency, and over 90% token cost savings. The key is selectivity — not every sentence in a conversation deserves to become a memory. Mem0's extraction pipeline filters for information that's reusable across future interactions.
The architecture supports three memory scopes: user memory (persists across all sessions with a person), session memory (within a single conversation), and agent memory (specific to a particular agent instance). This mapping aligns naturally with the four memory types we built earlier.
Choosing an Architecture
| Approach | Best for | Tradeoff |
|---|---|---|
| Buffer | Prototypes, single-session | Loses context between sessions |
| Summary | Long conversations, cost-sensitive | Loses specific details |
| Vector retrieval | Cross-session recall, large history | Requires embedding infrastructure |
| OS-tiered (Letta) | Agents needing metacognition | Complexity, tool call overhead |
| Knowledge graph (Zep) | Temporal relationships, entity tracking | Graph infrastructure complexity |
| Intelligent extraction (Mem0) | Personalization at scale | Extraction quality varies |
Most teams should start with buffer + summary for working memory, add vector retrieval for cross-session recall, and only adopt knowledge graphs or full memory frameworks when the simpler approaches hit measurable limits.
Memory Consolidation: From Raw to Refined
Memory consolidation — the process of converting raw experiences into durable, retrievable knowledge — is where the magic happens. Without it, your memory store becomes an ever-growing pile of conversation fragments. With it, the agent genuinely learns and improves over time.
The consolidation pipeline runs asynchronously after each session, extracting structured knowledge from unstructured conversation:
interface ConsolidationResult {
newFacts: SemanticMemory[];
updatedFacts: SemanticMemory[];
episodes: Episode[];
procedures: Procedure[];
}
class MemoryConsolidator {
private openai: OpenAI;
constructor(apiKey: string) {
this.openai = new OpenAI({ apiKey });
}
async consolidate(
userId: string,
sessionId: string,
transcript: string,
existingFacts: SemanticMemory[]
): Promise<ConsolidationResult> {
const response = await this.openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.1,
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: `Analyze this conversation and extract structured memory.
Return JSON with:
{
"facts": [{ "text": "...", "confidence": 0-1, "category": "preference|fact|behavior", "updates": "id-of-existing-fact-if-updating|null" }],
"episodes": [{ "event": "...", "importance": 1-10, "outcome": "..." }],
"procedures": [{ "name": "...", "description": "...", "steps": ["step1", "step2"], "context": "when-to-use" }]
}
Existing facts about this user:
${existingFacts.map(f => `[${f.id}] ${f.fact}`).join('\n') || 'None'}
Rules:
- Update existing facts when new information supersedes them
- Only extract episodes that are noteworthy (importance >= 5)
- Only extract procedures from successful resolution patterns
- Confidence 0.9+ for explicit statements, 0.5-0.8 for inferences`
},
{ role: 'user', content: transcript }
]
});
const parsed = JSON.parse(
response.choices[0].message.content || '{}'
);
// Transform into typed results
const result: ConsolidationResult = {
newFacts: [],
updatedFacts: [],
episodes: [],
procedures: [],
};
for (const fact of parsed.facts || []) {
const memory: SemanticMemory = {
id: fact.updates || randomUUID(),
userId,
fact: fact.text,
confidence: fact.confidence,
source: sessionId,
category: fact.category,
lastAccessed: new Date(),
accessCount: 0,
createdAt: new Date(),
updatedAt: new Date(),
};
if (fact.updates) {
result.updatedFacts.push(memory);
} else {
result.newFacts.push(memory);
}
}
for (const ep of parsed.episodes || []) {
result.episodes.push({
id: randomUUID(),
userId,
sessionId,
event: ep.event,
context: transcript.slice(0, 200),
outcome: ep.outcome,
importance: ep.importance,
timestamp: new Date(),
tags: [],
});
}
return result;
}
}This consolidation pipeline mirrors what cognitive scientists call "memory consolidation during sleep" — the brain replays experiences and extracts patterns. The agent's version runs after each session, distilling raw conversation into the three long-term memory types: semantic facts, episodic events, and procedural knowledge.
Scoring and Ranking Retrieved Memories
Retrieving memories is only half the problem. The other half is ranking them — deciding which memories deserve to occupy the limited space in the LLM's context window. The Stanford generative agents paper established the standard approach: a weighted combination of recency, importance, and relevance.
Here's a production-grade retrieval scorer:
interface ScoredMemory {
content: string;
recencyScore: number;
importanceScore: number;
relevanceScore: number;
finalScore: number;
source: 'episodic' | 'semantic' | 'vector';
}
function scoreMemories(
candidates: Array<{
content: string;
timestamp: Date;
importance: number; // 1-10
similarityScore: number; // 0-1 (from vector search)
source: 'episodic' | 'semantic' | 'vector';
}>,
weights: {
recency: number;
importance: number;
relevance: number;
} = { recency: 0.3, importance: 0.3, relevance: 0.4 }
): ScoredMemory[] {
const now = Date.now();
// Find the range of timestamps for normalization
const timestamps = candidates.map(c => c.timestamp.getTime());
const oldest = Math.min(...timestamps);
const newest = Math.max(...timestamps);
const timeRange = newest - oldest || 1;
return candidates
.map(candidate => {
// Recency: exponential decay, most recent = 1.0
const age = now - candidate.timestamp.getTime();
const recencyScore = Math.exp(-age / (7 * 24 * 60 * 60 * 1000)); // 7-day half-life
// Importance: normalize to 0-1
const importanceScore = candidate.importance / 10;
// Relevance: already 0-1 from vector similarity
const relevanceScore = candidate.similarityScore;
const finalScore =
weights.recency * recencyScore +
weights.importance * importanceScore +
weights.relevance * relevanceScore;
return {
content: candidate.content,
recencyScore,
importanceScore,
relevanceScore,
finalScore,
source: candidate.source,
};
})
.sort((a, b) => b.finalScore - a.finalScore);
}The weight distribution matters. For customer support agents, relevance should dominate (0.5+) because the agent needs to surface contextually appropriate memories. For personal assistant agents, recency matters more — the user's recent requests override older patterns. For compliance-sensitive applications, importance should be weighted higher to ensure critical facts (like consent status or account restrictions) always surface.
If you're building prompt systems that integrate memory context, the scoring weights become part of your prompt engineering — they determine what the model sees and therefore how it responds.
Privacy-First Memory Design
Memory creates a tension: the more your agent remembers, the more useful it becomes, and the more privacy risk it carries. Building memory without privacy controls isn't just a regulatory problem — it's a trust problem. When 82% of consumers see AI data handling as a serious threat, getting this right is a competitive advantage.
Spain's data protection authority (AEPD) published a 71-page guide in February 2026 specifically addressing AI agent memory risks. They identify four critical dimensions: relevance (what gets stored must be controlled), consistency (stored data must be accurate), retention (data must not persist beyond necessity), and integrity (stored information must resist manipulation).
What to Store and What Not To
Not all conversation content deserves to become a memory. The principle of data minimization — required by GDPR, HIPAA, and CCPA — means storing only what's adequate, relevant, and necessary for the stated purpose.
A practical classification framework:
type MemoryTier = 'transient' | 'short-term' | 'long-term' | 'never-store';
interface MemoryClassification {
tier: MemoryTier;
retentionDays: number | null; // null = until explicit deletion
requiresConsent: boolean;
piiCategory?: string;
}
function classifyForStorage(content: string): MemoryClassification {
// Tier 1: Never store — sensitive PII, health, financial details
const neverStorePatterns = [
/\b\d{3}-?\d{2}-?\d{4}\b/, // SSN
/\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/, // Credit card
/\b(?:password|ssn|social security)\b/i,
];
if (neverStorePatterns.some(p => p.test(content))) {
return {
tier: 'never-store',
retentionDays: 0,
requiresConsent: false,
piiCategory: 'sensitive',
};
}
// Tier 2: Transient — current session only
const transientPatterns = [
/\b(?:hold on|one moment|let me check)\b/i,
/\b(?:yes|no|okay|sure|thanks)\b/i,
];
if (transientPatterns.some(p => p.test(content)) && content.length < 50) {
return {
tier: 'transient',
retentionDays: 1,
requiresConsent: false,
};
}
// Tier 3: Short-term — operational retention (30-90 days)
// Default for conversation content not matching other categories
return {
tier: 'short-term',
retentionDays: 90,
requiresConsent: false,
};
}This is a starting point — production systems use LLM-based classification for nuance that regex can't capture. The principle holds: classify before storing, apply retention policies automatically, and default to shorter retention when uncertain.
Memory Deletion and the Right to Erasure
GDPR Article 17 grants individuals the right to request data erasure. For AI memory systems, this means you need to be able to delete all memories associated with a specific user without destroying other users' data or breaking the system.
The architecture decision that makes this possible: keep memory in retrieval databases (vector stores, key-value stores) rather than baking user data into fine-tuned models. You can delete a vector entry. You can't un-train a model.
class PrivacyAwareMemoryStore {
private store: VectorMemoryStore;
constructor(apiKey: string) {
this.store = new VectorMemoryStore(apiKey);
}
async storeWithConsent(
text: string,
userId: string,
consentBasis: 'explicit' | 'legitimate-interest' | 'contract',
retentionDays: number
): Promise<string> {
const classification = classifyForStorage(text);
if (classification.tier === 'never-store') {
// Log the rejection for audit, don't store the content
console.log(`Rejected storage: sensitive content detected for user ${userId}`);
return 'rejected';
}
const expiresAt = new Date();
expiresAt.setDate(
expiresAt.getDate() +
Math.min(retentionDays, classification.retentionDays || retentionDays)
);
return this.store.store(text, {
userId,
consentBasis,
retentionDays,
expiresAt: expiresAt.toISOString(),
storedAt: new Date().toISOString(),
});
}
// Honor right to erasure — delete ALL memories for a user
async deleteUserMemories(userId: string): Promise<number> {
// In production, this queries the vector DB with a metadata filter
// and deletes all matching entries
let deleted = 0;
// ... deletion logic against your vector database
return deleted;
}
// Automated retention enforcement — run daily
async enforceRetention(): Promise<number> {
const now = new Date().toISOString();
// Delete all entries where expiresAt < now
let expired = 0;
// ... expiration logic against your vector database
return expired;
}
}The New America Foundation's 2025 report on AI agents and memory highlights a critical point: when agents interact with external services via protocols like MCP, personal data may flow to third parties without the user's awareness. Memory governance isn't just about what your agent stores — it's about what gets transmitted during tool execution and external API calls.
Audit Trails
Every memory operation — creation, access, modification, deletion — should produce an audit log. This isn't just compliance overhead. When a customer asks "why did your agent say that?", the audit trail tells you which memories informed the response.
interface MemoryAuditEntry {
action: 'create' | 'read' | 'update' | 'delete';
memoryId: string;
userId: string;
agentId: string;
sessionId: string;
timestamp: Date;
reason: string;
metadata?: Record<string, unknown>;
}Production monitoring and analytics systems should track memory access patterns alongside conversation quality metrics. If an agent's quality score drops, checking which memories it retrieved (or failed to retrieve) is often the fastest path to diagnosis — a pattern covered in depth in How to Evaluate AI Agents.
Common Pitfalls and How to Avoid Them
Building memory systems teaches hard lessons. Here are the ones that cost the most time.
Storing Everything
The temptation is strong: disk is cheap, embeddings are cheap, so why not store every message? Because retrieval quality degrades when the memory store fills with noise. "Hello" and "can you hold on a second" don't need to be searchable memories. Filter before storing — a simple relevance threshold (embedding similarity to the conversation topic > 0.3) eliminates most noise.
Ignoring Memory Conflicts
When a customer says "I moved to Portland" but their account still shows Seattle, you have a conflict. Naive memory systems store both, and the agent becomes confused — sometimes using one, sometimes the other. Always implement conflict resolution: newer high-confidence memories should update or override older conflicting ones.
Missing Temporal Context
"The customer prefers email" is less useful than "the customer said they prefer email on March 5, 2026, during a billing dispute." Temporal metadata makes memories auditable, debuggable, and deletable. Store timestamps on everything.
One-Size-Fits-All Retrieval
Different questions need different memory types. "What's this customer's preferred contact method?" needs semantic memory. "What happened last time they called?" needs episodic memory. "How should I handle a refund request?" needs procedural memory. Route queries to the appropriate memory type instead of searching everything uniformly.
Overlooking Memory in Eval
If you're evaluating your agent (and you should be — see how to build an eval framework), memory needs to be part of the eval. Test cases should cover: Does the agent correctly recall information from a previous session? Does it handle updated preferences? Does it avoid surfacing information the user asked to forget? Memory is a feature, and features need tests.
What's Next for Agent Memory
The field is moving fast. Three directions stand out.
Memory as a service. Dedicated memory layers — Mem0, Zep, Letta — are becoming infrastructure that agents connect to rather than build from scratch. The same way you don't build your own database, you may not need to build your own memory system. But understanding the internals helps you evaluate which approach fits your use case and debug when things go wrong.
Graph-based memory. Flat vector stores treat each memory as independent. Knowledge graphs capture relationships between memories — this customer is connected to that account, which uses this product, which had that issue. Zep's temporal knowledge graph approach reports significant accuracy improvements for tasks requiring cross-session synthesis. Expect graph-based memory to become the default for complex, relationship-heavy domains.
Memory governance standards. The EU AI Act becomes fully applicable in August 2026. Spain's AEPD has already published detailed guidance on agent memory. ICLR 2026 accepted a workshop specifically on "Memory for LLM-Based Agentic Systems." The regulatory and research communities are converging on the position that agent memory isn't just a feature — it's a trust and safety concern that needs governance frameworks.
Memory is what separates agents that answer questions from agents that build relationships. The technical foundations are straightforward: buffers for the current conversation, summaries for compression, vector stores for retrieval, and extraction pipelines for consolidation. The hard part is the design decisions — what to remember, what to forget, and how to surface the right context at the right moment. Start simple, measure what matters, and let your users' needs drive the complexity.
- Memory in the Age of AI Agents: A Survey — Tsinghua, CMU et al. (arXiv:2512.13564, Dec 2025)
- MemGPT: Towards LLMs as Operating Systems — Packer et al. (arXiv:2310.08560)
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory (arXiv:2501.13956, Jan 2025)
- Generative Agents: Interactive Simulacra of Human Behavior — Park et al. (Stanford, 2023)
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (arXiv:2504.19413)
- Memory for AI Agents: A New Paradigm of Context Engineering — The New Stack
- Beyond Short-term Memory: The 3 Types of Long-term Memory AI Agents Need — Machine Learning Mastery
- AI Agent Memory: What, Why and How It Works — Mem0
- Agent Memory: How to Build Agents that Learn and Remember — Letta
- ChatGPT Memory and New Controls — OpenAI (April 2025 update)
- Engineering GDPR Compliance in the Age of Agentic AI — IAPP
- Spain AEPD: Hidden GDPR Risks of Agentic AI (Feb 2026)
- AI Agents and Memory: Privacy and Power in the MCP Era — New America Foundation
- The False Promise of Massive Context Windows — Yusef Ulum (Jan 2026)
- Conversational Memory for LLMs with LangChain — Pinecone
- Vector Databases vs. Graph RAG for Agent Memory — Machine Learning Mastery
- Minding Mindful Machines: AI Agents and Data Protection — Future of Privacy Forum
- What Is Memory Governance and Why It Matters for AI Security — Acuvity
Build AI agents with persistent memory
Chanl handles memory management, retrieval, and privacy controls — so your agents remember what matters and forget what they should.
Start building freeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



