ChanlChanl
Learning AI

AI Agent Memory: From Session Context to Long-Term Knowledge

Build AI agent memory systems from scratch in TypeScript. Covers memory types (session, episodic, semantic, procedural), architectures (buffer, summary, vector retrieval), RAG intersection, and privacy-first design.

DGDean GroverCo-founderFollow
March 10, 2026
25 min read read
Watercolor illustration of interconnected memory nodes forming a knowledge network in sage and olive tones

Your support agent handles a customer complaint about a delayed shipment. The customer mentions they're preparing for their daughter's birthday party this Saturday. The agent resolves the issue, expedites the package, confirms the new delivery date. Great interaction.

Two days later, the same customer calls back. Different question entirely — they want to add an item to their order. Your agent has no idea who they are. No memory of the birthday. No awareness that there's a time-sensitive delivery in progress. The customer repeats everything. The magic is gone.

This is the gap that separates a chatbot from an agent. Memory — the ability to retain, organize, and retrieve information across interactions — is what transforms a stateless language model into something that genuinely learns about the people it serves. And building it well is harder than it looks.

We'll build a working memory system from scratch in TypeScript, covering every layer: from simple conversation buffers to vector-powered semantic recall. Along the way, we'll explore the cognitive science behind memory types, examine production architectures from MemGPT to Zep, understand where memory and RAG converge, and tackle the privacy constraints that shape what your agent should — and shouldn't — remember.

Prerequisites and Setup

You'll need Node.js 20+, TypeScript, and familiarity with async/await patterns. Some sections reference vector embeddings and similarity search — if those concepts are new, start with RAG from Scratch for the foundations.

bash
npm install openai uuid
npm install -D typescript @types/node

We'll use OpenAI's API for embeddings and text generation. The architectural patterns work with any LLM provider — swap in Anthropic, Ollama, or whatever you prefer.

The code examples build on each other progressively. Each is self-contained enough to run independently, but they're designed to show how simple memory evolves into production-grade systems.

Why Memory Matters: The Stateless Problem

Every LLM call is stateless by default — the model receives a prompt, generates a response, and forgets everything. Without external memory, agents can't learn from past interactions, recognize returning users, or build context over time. This fundamental limitation means even the most capable model starts every conversation from zero.

The numbers make this concrete. Claude's context window holds 200,000 tokens. GPT-4o supports 128,000. Gemini 2.0 Pro reaches 2 million. These sound enormous until you calculate what they actually hold. A typical customer service interaction runs 2,000-4,000 tokens. A user with 100 past conversations has generated 200,000-400,000 tokens of history — already exceeding most context windows, and that's just one user.

Even if you could fit everything in, you wouldn't want to. Research consistently shows that LLM performance degrades in the middle of long contexts — a phenomenon researchers call the "lost in the middle" problem. A model advertising 200K tokens typically becomes unreliable around 130K, with sudden accuracy drops rather than gradual degradation. Stuffing the full history into every prompt isn't just expensive. It actively hurts quality.

Memory systems solve this by acting as an intelligent filter between raw conversation history and the model's context window. Instead of "here's everything that ever happened," memory says "here's what's relevant right now."

User message(Session 1) LLM(Stateless) Response User message(Session 2) LLM(Stateless) Response(No context from Session 1) User message(Session 1) MemorySystem LLM +Relevant Context Response User message(Session 2)
The stateless problem: without memory, every interaction starts from zero

The Four Types of Agent Memory

Cognitive science gives us a surprisingly useful framework for thinking about AI agent memory. The human memory system — studied for over a century — maps cleanly onto the challenges agents face. Four types matter most: working memory for the current conversation, episodic memory for specific past events, semantic memory for distilled knowledge, and procedural memory for learned behaviors.

This isn't just an analogy. The December 2025 survey "Memory in the Age of AI Agents" from Tsinghua University and CMU explicitly argues that traditional short-term/long-term taxonomies are insufficient, proposing a function-based taxonomy that mirrors cognitive science categories. Let's break each one down with TypeScript implementations.

Working Memory (Session Context)

Working memory holds the information needed for the current task — the active conversation, recent tool calls, and immediate context. It's fast, bounded, and disposable. When the session ends, working memory can be discarded or consolidated into longer-term storage.

Every chat application you've used implements working memory, even if it doesn't call it that. It's the message history that gets prepended to each LLM call.

Here's the simplest possible implementation — a bounded buffer that keeps the last N messages:

typescript
interface Message {
  role: 'user' | 'assistant' | 'system';
  content: string;
  timestamp: Date;
}
 
class WorkingMemory {
  private messages: Message[] = [];
  private maxMessages: number;
 
  constructor(maxMessages: number = 20) {
    this.maxMessages = maxMessages;
  }
 
  add(message: Message): void {
    this.messages.push(message);
    // Evict oldest messages when buffer is full
    if (this.messages.length > this.maxMessages) {
      this.messages = this.messages.slice(-this.maxMessages);
    }
  }
 
  getContext(): Message[] {
    return [...this.messages];
  }
 
  getTokenEstimate(): number {
    // Rough estimate: 1 token ≈ 4 characters
    return this.messages.reduce(
      (sum, m) => sum + Math.ceil(m.content.length / 4), 0
    );
  }
 
  clear(): void {
    this.messages = [];
  }
}

This works for short conversations, but it has an obvious flaw: once a message falls outside the window, it's gone. The customer's birthday mention from message #3 disappears after message #23. That's where the other memory types come in.

Episodic Memory (What Happened)

Episodic memory records specific events with their context — when they happened, who was involved, what the outcome was. Think of it as the agent's autobiography. Stanford's 2023 "Generative Agents" paper demonstrated this powerfully: agents that maintained episodic memory could autonomously organize a Valentine's Day party by recalling who they'd invited, what conversations they'd had, and when the event was scheduled.

The key insight is that episodic memories carry temporal and contextual metadata. It's not just "the customer likes email" — it's "on March 5, 2026, during a billing dispute about invoice #4821, the customer explicitly said they prefer email communication over phone calls."

This implementation stores episodes with rich metadata and retrieves them by recency and relevance:

typescript
import { randomUUID } from 'crypto';
 
interface Episode {
  id: string;
  userId: string;
  sessionId: string;
  event: string;           // What happened
  context: string;         // Surrounding circumstances
  outcome?: string;        // How it resolved
  importance: number;      // 1-10 scale
  timestamp: Date;
  tags: string[];
  embedding?: number[];    // For semantic search (added later)
}
 
class EpisodicMemory {
  private episodes: Map<string, Episode[]> = new Map();
 
  async store(episode: Omit<Episode, 'id'>): Promise<string> {
    const id = randomUUID();
    const stored: Episode = { ...episode, id };
 
    const userEpisodes = this.episodes.get(episode.userId) || [];
    userEpisodes.push(stored);
    this.episodes.set(episode.userId, userEpisodes);
 
    return id;
  }
 
  // Retrieve by recency — most recent episodes first
  getRecent(userId: string, limit: number = 10): Episode[] {
    const episodes = this.episodes.get(userId) || [];
    return episodes
      .sort((a, b) => b.timestamp.getTime() - a.timestamp.getTime())
      .slice(0, limit);
  }
 
  // Retrieve by importance — highest importance first
  getMostImportant(userId: string, limit: number = 5): Episode[] {
    const episodes = this.episodes.get(userId) || [];
    return episodes
      .sort((a, b) => b.importance - a.importance)
      .slice(0, limit);
  }
 
  // Combined retrieval: weighted score of recency + importance
  getRelevant(
    userId: string,
    limit: number = 5,
    recencyWeight: number = 0.4,
    importanceWeight: number = 0.6
  ): Episode[] {
    const episodes = this.episodes.get(userId) || [];
    if (episodes.length === 0) return [];
 
    const now = Date.now();
    const maxAge = Math.max(
      ...episodes.map(e => now - e.timestamp.getTime())
    );
 
    return episodes
      .map(episode => {
        const age = now - episode.timestamp.getTime();
        const recencyScore = 1 - (age / (maxAge || 1));
        const importanceScore = episode.importance / 10;
        const score =
          recencyWeight * recencyScore +
          importanceWeight * importanceScore;
        return { episode, score };
      })
      .sort((a, b) => b.score - a.score)
      .slice(0, limit)
      .map(({ episode }) => episode);
  }
}

The getRelevant method implements the same scoring approach used in the Stanford generative agents paper — combining recency, importance, and (when embeddings are available) relevance to determine which memories surface. Production systems add a third signal: relevance to the current query, computed via embedding similarity.

Semantic Memory (What the Agent Knows)

Semantic memory stores distilled facts and knowledge — not specific events, but the patterns and preferences extracted from them. While episodic memory says "the customer called about billing on March 5," semantic memory says "this customer frequently has billing questions and prefers email resolution."

The distinction matters because semantic memories are more compact, more generalizable, and more useful for shaping agent behavior. They're the result of consolidation — the process of converting raw experiences into reusable knowledge.

Here's how to extract semantic memories from conversations using an LLM:

typescript
import OpenAI from 'openai';
 
interface SemanticMemory {
  id: string;
  userId: string;
  fact: string;              // The distilled knowledge
  confidence: number;        // 0-1, how certain we are
  source: string;            // Which episode(s) this came from
  category: string;          // preference, fact, behavior, relationship
  lastAccessed: Date;
  accessCount: number;
  createdAt: Date;
  updatedAt: Date;
}
 
class SemanticMemoryExtractor {
  private openai: OpenAI;
  private memories: Map<string, SemanticMemory[]> = new Map();
 
  constructor(apiKey: string) {
    this.openai = new OpenAI({ apiKey });
  }
 
  async extractFromConversation(
    userId: string,
    conversation: string,
    sessionId: string
  ): Promise<SemanticMemory[]> {
    const existing = this.memories.get(userId) || [];
    const existingFacts = existing.map(m => m.fact).join('\n');
 
    const response = await this.openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0.1,
      response_format: { type: 'json_object' },
      messages: [
        {
          role: 'system',
          content: `Extract factual knowledge about the user from this conversation.
Return JSON: { "memories": [{ "fact": "...", "confidence": 0.0-1.0, "category": "preference|fact|behavior|relationship" }] }
 
Rules:
- Only extract information explicitly stated or strongly implied
- Confidence 0.9+ for direct statements, 0.5-0.8 for inferences
- Skip transient information (current mood, one-time requests)
- If a fact contradicts existing knowledge, include it with the updated information
 
Existing knowledge about this user:
${existingFacts || 'None yet'}`
        },
        { role: 'user', content: conversation }
      ]
    });
 
    const parsed = JSON.parse(
      response.choices[0].message.content || '{"memories":[]}'
    );
 
    const newMemories: SemanticMemory[] = parsed.memories.map(
      (m: { fact: string; confidence: number; category: string }) => ({
        id: randomUUID(),
        userId,
        fact: m.fact,
        confidence: m.confidence,
        source: sessionId,
        category: m.category,
        lastAccessed: new Date(),
        accessCount: 0,
        createdAt: new Date(),
        updatedAt: new Date(),
      })
    );
 
    // Merge with existing — update if contradicting, add if new
    this.mergeMemories(userId, newMemories);
 
    return newMemories;
  }
 
  private mergeMemories(
    userId: string,
    newMemories: SemanticMemory[]
  ): void {
    const existing = this.memories.get(userId) || [];
 
    for (const newMem of newMemories) {
      const conflict = existing.findIndex(
        e => e.category === newMem.category &&
             this.isContradiction(e.fact, newMem.fact)
      );
 
      if (conflict >= 0 && newMem.confidence > existing[conflict].confidence) {
        // Replace lower-confidence memory with higher-confidence one
        existing[conflict] = { ...newMem, updatedAt: new Date() };
      } else if (conflict < 0) {
        existing.push(newMem);
      }
    }
 
    this.memories.set(userId, existing);
  }
 
  private isContradiction(a: string, b: string): boolean {
    // Simplified — production systems use embedding similarity
    // to detect semantic overlap, then LLM to judge contradiction
    const normalize = (s: string) => s.toLowerCase().trim();
    return normalize(a).includes(normalize(b).split(' ')[0]);
  }
 
  getMemories(
    userId: string,
    category?: string
  ): SemanticMemory[] {
    const all = this.memories.get(userId) || [];
    if (category) {
      return all.filter(m => m.category === category);
    }
    return all;
  }
}

Notice the merge logic: when new information contradicts existing memories, the higher-confidence version wins. This prevents the classic problem where an outdated preference overrides a recent correction ("I actually moved to Portland last month — please stop sending things to Seattle").

Procedural Memory (How to Do Things)

Procedural memory captures learned processes and strategies — not what happened or what's true, but how to accomplish tasks. In cognitive science, this is the memory type that lets you ride a bike without thinking about it. For AI agents, it's the memory that captures successful problem-solving patterns.

Recent research like "Remember Me, Refine Me" (2025) demonstrates agents that evolve their procedures based on experience. An agent that has successfully resolved 50 billing disputes develops a procedural memory for the optimal resolution flow — check account status, verify charge, offer appropriate resolution based on customer tier.

Here's a practical implementation that records and retrieves successful action sequences:

typescript
interface Procedure {
  id: string;
  name: string;
  description: string;
  steps: ProcedureStep[];
  successRate: number;
  timesUsed: number;
  context: string;         // When to apply this procedure
  lastUsed: Date;
  createdAt: Date;
}
 
interface ProcedureStep {
  action: string;
  parameters?: Record<string, unknown>;
  expectedOutcome: string;
  fallback?: string;       // What to do if this step fails
}
 
class ProceduralMemory {
  private procedures: Procedure[] = [];
 
  // Record a successful action sequence as a procedure
  recordProcedure(
    name: string,
    description: string,
    steps: ProcedureStep[],
    context: string
  ): Procedure {
    const existing = this.procedures.find(p => p.name === name);
 
    if (existing) {
      // Reinforce existing procedure
      existing.timesUsed++;
      existing.successRate =
        (existing.successRate * (existing.timesUsed - 1) + 1) /
        existing.timesUsed;
      existing.lastUsed = new Date();
      return existing;
    }
 
    const procedure: Procedure = {
      id: randomUUID(),
      name,
      description,
      steps,
      successRate: 1.0,
      timesUsed: 1,
      context,
      lastUsed: new Date(),
      createdAt: new Date(),
    };
 
    this.procedures.push(procedure);
    return procedure;
  }
 
  // Record a failure to adjust success rate
  recordFailure(procedureId: string): void {
    const proc = this.procedures.find(p => p.id === procedureId);
    if (proc) {
      proc.timesUsed++;
      proc.successRate =
        (proc.successRate * (proc.timesUsed - 1)) / proc.timesUsed;
    }
  }
 
  // Find the best procedure for a given context
  findProcedure(context: string): Procedure | null {
    // Simple keyword matching — production uses embedding similarity
    const candidates = this.procedures.filter(p =>
      context.toLowerCase().includes(p.context.toLowerCase()) ||
      p.context.toLowerCase().includes(context.toLowerCase())
    );
 
    if (candidates.length === 0) return null;
 
    // Prefer high success rate, then recency
    return candidates.sort((a, b) => {
      const scoreA = a.successRate * 0.7 +
        (a.lastUsed.getTime() / Date.now()) * 0.3;
      const scoreB = b.successRate * 0.7 +
        (b.lastUsed.getTime() / Date.now()) * 0.3;
      return scoreB - scoreA;
    })[0];
  }
 
  // Format procedure as instructions for the LLM
  toPromptInstructions(procedure: Procedure): string {
    const steps = procedure.steps
      .map((s, i) => `${i + 1}. ${s.action}${
        s.fallback ? ` (if this fails: ${s.fallback})` : ''
      }`)
      .join('\n');
 
    return `Recommended approach (${Math.round(procedure.successRate * 100)}% success rate, used ${procedure.timesUsed} times):
${procedure.description}
 
Steps:
${steps}`;
  }
}

Procedural memory is the least commonly implemented of the four types, but it's arguably the most powerful for agents that handle repeated workflows. Instead of figuring out the billing dispute resolution process from scratch every time, the agent recalls: "Last 47 times this happened, here's what worked."

Memory Architectures: From Simple to Production

Now that we understand the memory types, how do you actually structure a memory system? Three architectures dominate production systems, each with different tradeoffs around complexity, cost, and retrieval quality. Most production deployments combine multiple approaches.

Buffer Memory

Buffer memory is the simplest architecture — a sliding window of recent messages passed directly as context to the LLM. No retrieval, no embeddings, no external storage. You already saw this in the working memory implementation above.

It works well for short, focused interactions. The problem surfaces when conversations get long or span multiple sessions: the oldest context silently disappears as new messages push it out of the buffer.

A common refinement is the windowed buffer with token awareness:

typescript
class TokenAwareBuffer {
  private messages: Message[] = [];
  private maxTokens: number;
 
  constructor(maxTokens: number = 4000) {
    this.maxTokens = maxTokens;
  }
 
  add(message: Message): void {
    this.messages.push(message);
    this.trim();
  }
 
  private trim(): void {
    let totalTokens = this.estimateTokens(this.messages);
    while (totalTokens > this.maxTokens && this.messages.length > 1) {
      this.messages.shift();
      totalTokens = this.estimateTokens(this.messages);
    }
  }
 
  private estimateTokens(msgs: Message[]): number {
    return msgs.reduce(
      (sum, m) => sum + Math.ceil(m.content.length / 4) + 4, // +4 for role tokens
      0
    );
  }
 
  getMessages(): Message[] {
    return [...this.messages];
  }
}

When to use buffer memory: Prototyping, single-session interactions, and as the working memory layer within a larger system. Don't use it alone if your agent needs to remember anything between sessions.

Summary Memory

Summary memory addresses the buffer's main weakness by compressing old messages into summaries before discarding them. Instead of losing information entirely, the system condenses it into a shorter representation that captures the essential points.

The idea is straightforward: when the buffer fills up, summarize the oldest messages, replace them with the summary, and continue. The LLM sees a compressed version of history plus the recent full messages.

Here's how to build one that progressively summarizes as conversations grow:

typescript
class SummaryMemory {
  private recentMessages: Message[] = [];
  private summary: string = '';
  private maxRecentMessages: number;
  private openai: OpenAI;
 
  constructor(apiKey: string, maxRecentMessages: number = 10) {
    this.openai = new OpenAI({ apiKey });
    this.maxRecentMessages = maxRecentMessages;
  }
 
  async add(message: Message): Promise<void> {
    this.recentMessages.push(message);
 
    if (this.recentMessages.length > this.maxRecentMessages) {
      // Take the oldest messages and summarize them
      const toSummarize = this.recentMessages.splice(
        0,
        Math.floor(this.maxRecentMessages / 2)
      );
      await this.updateSummary(toSummarize);
    }
  }
 
  private async updateSummary(messages: Message[]): Promise<void> {
    const conversation = messages
      .map(m => `${m.role}: ${m.content}`)
      .join('\n');
 
    const response = await this.openai.chat.completions.create({
      model: 'gpt-4o-mini',  // Cheaper model for summarization
      temperature: 0,
      messages: [
        {
          role: 'system',
          content: `Progressively summarize the conversation, adding to the existing summary.
Include: key facts, user preferences, unresolved issues, action items, and any commitments made.
Be concise but don't drop important details.`
        },
        {
          role: 'user',
          content: `Existing summary:\n${this.summary || '(none yet)'}\n\nNew messages:\n${conversation}`
        }
      ]
    });
 
    this.summary = response.choices[0].message.content || this.summary;
  }
 
  getContext(): { summary: string; recentMessages: Message[] } {
    return {
      summary: this.summary,
      recentMessages: [...this.recentMessages],
    };
  }
 
  // Format for injection into LLM prompt
  toPromptContext(): string {
    const parts: string[] = [];
 
    if (this.summary) {
      parts.push(`Previous conversation summary:\n${this.summary}`);
    }
 
    if (this.recentMessages.length > 0) {
      parts.push('Recent messages:');
      for (const msg of this.recentMessages) {
        parts.push(`${msg.role}: ${msg.content}`);
      }
    }
 
    return parts.join('\n\n');
  }
}

Summary memory makes a tradeoff: you preserve the gist of old conversations at the cost of specific details. The customer's exact words about their daughter's birthday might get summarized to "customer has a time-sensitive delivery" — which captures the urgency but loses the personal context. For many applications, that's an acceptable tradeoff. For others, you need the next architecture.

Vector-Based Retrieval Memory

Vector retrieval memory stores every memory as a vector embedding and retrieves entries by semantic similarity to the current query. Instead of keeping everything in a buffer or summarizing down to a fixed size, you search through the full memory store for what's actually relevant.

This is where memory intersects with RAG — the same embedding and retrieval techniques covered in RAG from Scratch apply directly. The difference is the data source: RAG retrieves from documents; memory retrieves from the agent's own experience.

Here's a complete implementation with cosine similarity search:

typescript
class VectorMemoryStore {
  private entries: Array<{
    id: string;
    text: string;
    embedding: number[];
    metadata: Record<string, unknown>;
    timestamp: Date;
  }> = [];
  private openai: OpenAI;
 
  constructor(apiKey: string) {
    this.openai = new OpenAI({ apiKey });
  }
 
  async store(
    text: string,
    metadata: Record<string, unknown> = {}
  ): Promise<string> {
    const embedding = await this.embed(text);
    const id = randomUUID();
 
    this.entries.push({
      id,
      text,
      embedding,
      metadata,
      timestamp: new Date(),
    });
 
    return id;
  }
 
  async search(
    query: string,
    topK: number = 5,
    filter?: (meta: Record<string, unknown>) => boolean
  ): Promise<Array<{ text: string; score: number; metadata: Record<string, unknown> }>> {
    const queryEmbedding = await this.embed(query);
 
    let candidates = this.entries;
    if (filter) {
      candidates = candidates.filter(e => filter(e.metadata));
    }
 
    const scored = candidates.map(entry => ({
      text: entry.text,
      score: this.cosineSimilarity(queryEmbedding, entry.embedding),
      metadata: entry.metadata,
    }));
 
    return scored
      .sort((a, b) => b.score - a.score)
      .slice(0, topK);
  }
 
  private async embed(text: string): Promise<number[]> {
    const response = await this.openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: text,
    });
    return response.data[0].embedding;
  }
 
  private cosineSimilarity(a: number[], b: number[]): number {
    let dot = 0, normA = 0, normB = 0;
    for (let i = 0; i < a.length; i++) {
      dot += a[i] * b[i];
      normA += a[i] * a[i];
      normB += b[i] * b[i];
    }
    return dot / (Math.sqrt(normA) * Math.sqrt(normB));
  }
 
  // Decay old memories — reduce their retrieval priority over time
  applyDecay(halfLifeDays: number = 30): void {
    const now = Date.now();
    for (const entry of this.entries) {
      const ageDays =
        (now - entry.timestamp.getTime()) / (1000 * 60 * 60 * 24);
      const decayFactor = Math.pow(0.5, ageDays / halfLifeDays);
      // Store decay factor in metadata for retrieval scoring
      entry.metadata._decayFactor = decayFactor;
    }
  }
}

In production, you'd replace the in-memory store with a vector database — Pinecone, Qdrant, pgvector, or Weaviate. The API surface is essentially the same: embed, store, search. The vector database handles efficient approximate nearest neighbor search at scale, which matters once you have thousands or millions of memory entries.

The decay mechanism deserves attention. Without it, ancient memories compete equally with recent ones during retrieval. The applyDecay method implements exponential decay with a configurable half-life — a memory from 30 days ago scores at 50% of its original relevance. Mem0 does something similar, calling it "dynamic forgetting," which helps keep retrieved context current and relevant.

Putting It All Together: A Unified Memory System

A production memory system doesn't use just one architecture — it combines all four memory types into a unified layer that the agent queries before every response. The key is making this transparent to the application code: the agent asks "what do I know about this user and this situation?" and gets back a curated context block.

Here's a unified memory manager that orchestrates the pieces:

typescript
interface MemoryContext {
  workingMemory: Message[];
  relevantEpisodes: Episode[];
  semanticFacts: SemanticMemory[];
  suggestedProcedure: Procedure | null;
  summary: string;
}
 
class UnifiedMemoryManager {
  private working: WorkingMemory;
  private episodic: EpisodicMemory;
  private semantic: SemanticMemoryExtractor;
  private procedural: ProceduralMemory;
  private vectorStore: VectorMemoryStore;
  private summaryMemory: SummaryMemory;
 
  constructor(apiKey: string) {
    this.working = new WorkingMemory(20);
    this.episodic = new EpisodicMemory();
    this.semantic = new SemanticMemoryExtractor(apiKey);
    this.procedural = new ProceduralMemory();
    this.vectorStore = new VectorMemoryStore(apiKey);
    this.summaryMemory = new SummaryMemory(apiKey);
  }
 
  // Called on every user message
  async processMessage(
    userId: string,
    sessionId: string,
    message: Message
  ): Promise<void> {
    // Update working memory
    this.working.add(message);
 
    // Update summary
    await this.summaryMemory.add(message);
 
    // Store in vector memory for future retrieval
    await this.vectorStore.store(message.content, {
      userId,
      sessionId,
      role: message.role,
      timestamp: message.timestamp.toISOString(),
    });
  }
 
  // Build full context for LLM prompt
  async buildContext(
    userId: string,
    currentQuery: string
  ): Promise<MemoryContext> {
    // Parallel retrieval for speed
    const [vectorResults, episodes, facts] = await Promise.all([
      this.vectorStore.search(currentQuery, 5, (meta) =>
        meta.userId === userId
      ),
      Promise.resolve(this.episodic.getRelevant(userId, 3)),
      Promise.resolve(this.semantic.getMemories(userId)),
    ]);
 
    // Find applicable procedure
    const procedure = this.procedural.findProcedure(currentQuery);
 
    const { summary, recentMessages } = this.summaryMemory.getContext();
 
    return {
      workingMemory: recentMessages,
      relevantEpisodes: episodes,
      semanticFacts: facts,
      suggestedProcedure: procedure,
      summary,
    };
  }
 
  // Format context for injection into system prompt
  formatForPrompt(context: MemoryContext): string {
    const sections: string[] = [];
 
    if (context.summary) {
      sections.push(
        `## Conversation History\n${context.summary}`
      );
    }
 
    if (context.semanticFacts.length > 0) {
      const facts = context.semanticFacts
        .map(f => `- ${f.fact} (confidence: ${f.confidence})`)
        .join('\n');
      sections.push(`## What You Know About This User\n${facts}`);
    }
 
    if (context.relevantEpisodes.length > 0) {
      const episodes = context.relevantEpisodes
        .map(e => `- [${e.timestamp.toLocaleDateString()}] ${e.event}${
          e.outcome ? `${e.outcome}` : ''
        }`)
        .join('\n');
      sections.push(`## Relevant Past Interactions\n${episodes}`);
    }
 
    if (context.suggestedProcedure) {
      sections.push(
        `## Suggested Approach\n${this.procedural.toPromptInstructions(
          context.suggestedProcedure
        )}`
      );
    }
 
    return sections.join('\n\n');
  }
 
  // End-of-session consolidation
  async consolidateSession(
    userId: string,
    sessionId: string,
    conversation: string
  ): Promise<void> {
    // Extract semantic memories from the full conversation
    await this.semantic.extractFromConversation(
      userId,
      conversation,
      sessionId
    );
 
    // Store key events as episodes
    // (In production, use LLM to identify notable events)
    await this.episodic.store({
      userId,
      sessionId,
      event: `Conversation session ${sessionId}`,
      context: conversation.slice(0, 500),
      importance: 5,
      timestamp: new Date(),
      tags: ['conversation'],
    });
 
    // Clear working memory
    this.working.clear();
  }
}

The buildContext method runs retrieval in parallel — vector search, episodic lookup, and semantic fact retrieval happen simultaneously. This keeps latency manageable even with multiple memory sources. In production, the formatForPrompt output goes into the system message, giving the LLM everything it needs to respond with full context.

The consolidateSession method runs when a conversation ends. It's the bridge between ephemeral working memory and persistent long-term storage — extracting the valuable signal from raw conversation and storing it where future sessions can find it.

Customer service representative

Customer Memory

4 memories recalled

Sarah Chen
Premium
Last call
2 days ago
Prefers
Email follow-up
Session Memory

“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”

85% relevance

Where Memory Meets RAG

Memory and RAG share infrastructure — embeddings, vector stores, similarity search — but serve fundamentally different purposes. Understanding the boundary prevents architectural confusion and helps you build systems where both work together effectively.

RAG retrieves from organizational knowledge: product documentation, FAQs, policy documents, knowledge base articles. This information exists independently of any particular user or conversation. It's the same for everyone.

Memory retrieves from experiential knowledge: what happened in past conversations, what this specific user prefers, how similar situations resolved before. This information is generated through interaction and is unique to each user or agent.

The practical overlap looks like this:

User Query Retrieval Layer Vector Database RAG Results(Product docs, FAQs,policies) Memory Results(Past conversations,preferences, episodes) Combined Context LLM Generation
Memory and RAG: shared infrastructure, different data sources

In production, a single vector database often hosts both. The namespace or collection separates them: RAG documents live in one collection, memory entries in another. The retrieval layer queries both, and the results get merged into a single context block.

Here's how that merger works in practice:

typescript
interface RetrievalResult {
  text: string;
  score: number;
  source: 'rag' | 'memory';
  metadata: Record<string, unknown>;
}
 
async function hybridRetrieval(
  query: string,
  userId: string,
  ragStore: VectorMemoryStore,
  memoryStore: VectorMemoryStore,
  options: {
    ragTopK?: number;
    memoryTopK?: number;
    ragWeight?: number;
    memoryWeight?: number;
  } = {}
): Promise<RetrievalResult[]> {
  const {
    ragTopK = 3,
    memoryTopK = 3,
    ragWeight = 0.5,
    memoryWeight = 0.5,
  } = options;
 
  const [ragResults, memResults] = await Promise.all([
    ragStore.search(query, ragTopK),
    memoryStore.search(query, memoryTopK, (meta) =>
      meta.userId === userId
    ),
  ]);
 
  const combined: RetrievalResult[] = [
    ...ragResults.map(r => ({
      ...r,
      score: r.score * ragWeight,
      source: 'rag' as const,
    })),
    ...memResults.map(r => ({
      ...r,
      score: r.score * memoryWeight,
      source: 'memory' as const,
    })),
  ];
 
  // Sort by weighted score, interleave sources
  return combined.sort((a, b) => b.score - a.score);
}

The weighting between RAG and memory results depends on the use case. Customer support agents might weight memory higher (the customer's history matters more than generic docs). A technical support agent might weight RAG higher (the answer is in the documentation, memory provides context). If you're working with knowledge bases, Chanl's memory and knowledge base features handle this retrieval orchestration, letting you configure the balance per agent.

Production Memory Architectures

Several open-source and commercial systems have emerged specifically for agent memory. Understanding their approaches helps you decide whether to build or integrate — and what patterns to borrow if you build your own.

MemGPT / Letta: OS-Inspired Tiering

The MemGPT paper (arXiv:2310.08560) introduced the idea of treating the LLM's context window like an operating system treats RAM. Just as an OS pages data between fast memory and disk, MemGPT pages information between the LLM's context (core memory) and external storage (archival memory).

Letta, the production system built from the MemGPT research, implements three tiers:

  • Core memory — Always in the context window. Contains the agent's persona, key facts about the current user, and active conversation state. Size-limited, like RAM.
  • Recall memory — Searchable conversation history. The agent can query past sessions by issuing memory retrieval function calls. Analogous to a disk cache.
  • Archival memory — Long-term storage for any information the agent wants to preserve. Backed by a vector database. Analogous to disk.

The breakthrough insight is that the agent manages its own memory through tool calls. Instead of a separate system deciding what to store, the agent itself calls core_memory_append, archival_memory_insert, or conversation_search as tools. This gives the agent agency over what it remembers — a form of metacognition.

Zep: Temporal Knowledge Graphs

Zep takes a different approach entirely. Instead of flat vector retrieval, it builds a temporal knowledge graph from conversations — nodes represent entities (people, products, events), edges represent relationships, and every element carries temporal metadata showing when it was true.

Their January 2025 paper (arXiv:2501.13956) reports up to 18.5% accuracy improvements over baseline implementations and 90% lower latency. The temporal component is the key differentiator: Zep can answer "what was the customer's address before they moved?" because the graph preserves historical state, not just current state.

This matters for customer experience agents where context evolves over time. A customer's preferences, addresses, account details, and relationships change — a flat memory store overwrites history, while a temporal graph preserves the full timeline.

Mem0: Intelligent Extraction and Consolidation

Mem0 takes the approach closest to human memory consolidation. Instead of storing raw conversation chunks, it uses an LLM to extract meaningful memories from interactions, consolidate overlapping information, and decay irrelevant entries over time.

Their April 2025 paper reports 26% improvement in LLM-as-judge quality metrics over raw retrieval, 91% lower P95 latency, and over 90% token cost savings. The key is selectivity — not every sentence in a conversation deserves to become a memory. Mem0's extraction pipeline filters for information that's reusable across future interactions.

The architecture supports three memory scopes: user memory (persists across all sessions with a person), session memory (within a single conversation), and agent memory (specific to a particular agent instance). This mapping aligns naturally with the four memory types we built earlier.

Choosing an Architecture

ApproachBest forTradeoff
BufferPrototypes, single-sessionLoses context between sessions
SummaryLong conversations, cost-sensitiveLoses specific details
Vector retrievalCross-session recall, large historyRequires embedding infrastructure
OS-tiered (Letta)Agents needing metacognitionComplexity, tool call overhead
Knowledge graph (Zep)Temporal relationships, entity trackingGraph infrastructure complexity
Intelligent extraction (Mem0)Personalization at scaleExtraction quality varies

Most teams should start with buffer + summary for working memory, add vector retrieval for cross-session recall, and only adopt knowledge graphs or full memory frameworks when the simpler approaches hit measurable limits.

Memory Consolidation: From Raw to Refined

Memory consolidation — the process of converting raw experiences into durable, retrievable knowledge — is where the magic happens. Without it, your memory store becomes an ever-growing pile of conversation fragments. With it, the agent genuinely learns and improves over time.

The consolidation pipeline runs asynchronously after each session, extracting structured knowledge from unstructured conversation:

typescript
interface ConsolidationResult {
  newFacts: SemanticMemory[];
  updatedFacts: SemanticMemory[];
  episodes: Episode[];
  procedures: Procedure[];
}
 
class MemoryConsolidator {
  private openai: OpenAI;
 
  constructor(apiKey: string) {
    this.openai = new OpenAI({ apiKey });
  }
 
  async consolidate(
    userId: string,
    sessionId: string,
    transcript: string,
    existingFacts: SemanticMemory[]
  ): Promise<ConsolidationResult> {
    const response = await this.openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0.1,
      response_format: { type: 'json_object' },
      messages: [
        {
          role: 'system',
          content: `Analyze this conversation and extract structured memory.
 
Return JSON with:
{
  "facts": [{ "text": "...", "confidence": 0-1, "category": "preference|fact|behavior", "updates": "id-of-existing-fact-if-updating|null" }],
  "episodes": [{ "event": "...", "importance": 1-10, "outcome": "..." }],
  "procedures": [{ "name": "...", "description": "...", "steps": ["step1", "step2"], "context": "when-to-use" }]
}
 
Existing facts about this user:
${existingFacts.map(f => `[${f.id}] ${f.fact}`).join('\n') || 'None'}
 
Rules:
- Update existing facts when new information supersedes them
- Only extract episodes that are noteworthy (importance >= 5)
- Only extract procedures from successful resolution patterns
- Confidence 0.9+ for explicit statements, 0.5-0.8 for inferences`
        },
        { role: 'user', content: transcript }
      ]
    });
 
    const parsed = JSON.parse(
      response.choices[0].message.content || '{}'
    );
 
    // Transform into typed results
    const result: ConsolidationResult = {
      newFacts: [],
      updatedFacts: [],
      episodes: [],
      procedures: [],
    };
 
    for (const fact of parsed.facts || []) {
      const memory: SemanticMemory = {
        id: fact.updates || randomUUID(),
        userId,
        fact: fact.text,
        confidence: fact.confidence,
        source: sessionId,
        category: fact.category,
        lastAccessed: new Date(),
        accessCount: 0,
        createdAt: new Date(),
        updatedAt: new Date(),
      };
 
      if (fact.updates) {
        result.updatedFacts.push(memory);
      } else {
        result.newFacts.push(memory);
      }
    }
 
    for (const ep of parsed.episodes || []) {
      result.episodes.push({
        id: randomUUID(),
        userId,
        sessionId,
        event: ep.event,
        context: transcript.slice(0, 200),
        outcome: ep.outcome,
        importance: ep.importance,
        timestamp: new Date(),
        tags: [],
      });
    }
 
    return result;
  }
}

This consolidation pipeline mirrors what cognitive scientists call "memory consolidation during sleep" — the brain replays experiences and extracts patterns. The agent's version runs after each session, distilling raw conversation into the three long-term memory types: semantic facts, episodic events, and procedural knowledge.

Scoring and Ranking Retrieved Memories

Retrieving memories is only half the problem. The other half is ranking them — deciding which memories deserve to occupy the limited space in the LLM's context window. The Stanford generative agents paper established the standard approach: a weighted combination of recency, importance, and relevance.

Here's a production-grade retrieval scorer:

typescript
interface ScoredMemory {
  content: string;
  recencyScore: number;
  importanceScore: number;
  relevanceScore: number;
  finalScore: number;
  source: 'episodic' | 'semantic' | 'vector';
}
 
function scoreMemories(
  candidates: Array<{
    content: string;
    timestamp: Date;
    importance: number;       // 1-10
    similarityScore: number;  // 0-1 (from vector search)
    source: 'episodic' | 'semantic' | 'vector';
  }>,
  weights: {
    recency: number;
    importance: number;
    relevance: number;
  } = { recency: 0.3, importance: 0.3, relevance: 0.4 }
): ScoredMemory[] {
  const now = Date.now();
 
  // Find the range of timestamps for normalization
  const timestamps = candidates.map(c => c.timestamp.getTime());
  const oldest = Math.min(...timestamps);
  const newest = Math.max(...timestamps);
  const timeRange = newest - oldest || 1;
 
  return candidates
    .map(candidate => {
      // Recency: exponential decay, most recent = 1.0
      const age = now - candidate.timestamp.getTime();
      const recencyScore = Math.exp(-age / (7 * 24 * 60 * 60 * 1000)); // 7-day half-life
 
      // Importance: normalize to 0-1
      const importanceScore = candidate.importance / 10;
 
      // Relevance: already 0-1 from vector similarity
      const relevanceScore = candidate.similarityScore;
 
      const finalScore =
        weights.recency * recencyScore +
        weights.importance * importanceScore +
        weights.relevance * relevanceScore;
 
      return {
        content: candidate.content,
        recencyScore,
        importanceScore,
        relevanceScore,
        finalScore,
        source: candidate.source,
      };
    })
    .sort((a, b) => b.finalScore - a.finalScore);
}

The weight distribution matters. For customer support agents, relevance should dominate (0.5+) because the agent needs to surface contextually appropriate memories. For personal assistant agents, recency matters more — the user's recent requests override older patterns. For compliance-sensitive applications, importance should be weighted higher to ensure critical facts (like consent status or account restrictions) always surface.

If you're building prompt systems that integrate memory context, the scoring weights become part of your prompt engineering — they determine what the model sees and therefore how it responds.

Privacy-First Memory Design

Memory creates a tension: the more your agent remembers, the more useful it becomes, and the more privacy risk it carries. Building memory without privacy controls isn't just a regulatory problem — it's a trust problem. When 82% of consumers see AI data handling as a serious threat, getting this right is a competitive advantage.

Spain's data protection authority (AEPD) published a 71-page guide in February 2026 specifically addressing AI agent memory risks. They identify four critical dimensions: relevance (what gets stored must be controlled), consistency (stored data must be accurate), retention (data must not persist beyond necessity), and integrity (stored information must resist manipulation).

What to Store and What Not To

Not all conversation content deserves to become a memory. The principle of data minimization — required by GDPR, HIPAA, and CCPA — means storing only what's adequate, relevant, and necessary for the stated purpose.

A practical classification framework:

typescript
type MemoryTier = 'transient' | 'short-term' | 'long-term' | 'never-store';
 
interface MemoryClassification {
  tier: MemoryTier;
  retentionDays: number | null;  // null = until explicit deletion
  requiresConsent: boolean;
  piiCategory?: string;
}
 
function classifyForStorage(content: string): MemoryClassification {
  // Tier 1: Never store — sensitive PII, health, financial details
  const neverStorePatterns = [
    /\b\d{3}-?\d{2}-?\d{4}\b/,              // SSN
    /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/, // Credit card
    /\b(?:password|ssn|social security)\b/i,
  ];
 
  if (neverStorePatterns.some(p => p.test(content))) {
    return {
      tier: 'never-store',
      retentionDays: 0,
      requiresConsent: false,
      piiCategory: 'sensitive',
    };
  }
 
  // Tier 2: Transient — current session only
  const transientPatterns = [
    /\b(?:hold on|one moment|let me check)\b/i,
    /\b(?:yes|no|okay|sure|thanks)\b/i,
  ];
 
  if (transientPatterns.some(p => p.test(content)) && content.length < 50) {
    return {
      tier: 'transient',
      retentionDays: 1,
      requiresConsent: false,
    };
  }
 
  // Tier 3: Short-term — operational retention (30-90 days)
  // Default for conversation content not matching other categories
  return {
    tier: 'short-term',
    retentionDays: 90,
    requiresConsent: false,
  };
}

This is a starting point — production systems use LLM-based classification for nuance that regex can't capture. The principle holds: classify before storing, apply retention policies automatically, and default to shorter retention when uncertain.

Memory Deletion and the Right to Erasure

GDPR Article 17 grants individuals the right to request data erasure. For AI memory systems, this means you need to be able to delete all memories associated with a specific user without destroying other users' data or breaking the system.

The architecture decision that makes this possible: keep memory in retrieval databases (vector stores, key-value stores) rather than baking user data into fine-tuned models. You can delete a vector entry. You can't un-train a model.

typescript
class PrivacyAwareMemoryStore {
  private store: VectorMemoryStore;
 
  constructor(apiKey: string) {
    this.store = new VectorMemoryStore(apiKey);
  }
 
  async storeWithConsent(
    text: string,
    userId: string,
    consentBasis: 'explicit' | 'legitimate-interest' | 'contract',
    retentionDays: number
  ): Promise<string> {
    const classification = classifyForStorage(text);
 
    if (classification.tier === 'never-store') {
      // Log the rejection for audit, don't store the content
      console.log(`Rejected storage: sensitive content detected for user ${userId}`);
      return 'rejected';
    }
 
    const expiresAt = new Date();
    expiresAt.setDate(
      expiresAt.getDate() +
      Math.min(retentionDays, classification.retentionDays || retentionDays)
    );
 
    return this.store.store(text, {
      userId,
      consentBasis,
      retentionDays,
      expiresAt: expiresAt.toISOString(),
      storedAt: new Date().toISOString(),
    });
  }
 
  // Honor right to erasure — delete ALL memories for a user
  async deleteUserMemories(userId: string): Promise<number> {
    // In production, this queries the vector DB with a metadata filter
    // and deletes all matching entries
    let deleted = 0;
    // ... deletion logic against your vector database
    return deleted;
  }
 
  // Automated retention enforcement — run daily
  async enforceRetention(): Promise<number> {
    const now = new Date().toISOString();
    // Delete all entries where expiresAt < now
    let expired = 0;
    // ... expiration logic against your vector database
    return expired;
  }
}

The New America Foundation's 2025 report on AI agents and memory highlights a critical point: when agents interact with external services via protocols like MCP, personal data may flow to third parties without the user's awareness. Memory governance isn't just about what your agent stores — it's about what gets transmitted during tool execution and external API calls.

Audit Trails

Every memory operation — creation, access, modification, deletion — should produce an audit log. This isn't just compliance overhead. When a customer asks "why did your agent say that?", the audit trail tells you which memories informed the response.

typescript
interface MemoryAuditEntry {
  action: 'create' | 'read' | 'update' | 'delete';
  memoryId: string;
  userId: string;
  agentId: string;
  sessionId: string;
  timestamp: Date;
  reason: string;
  metadata?: Record<string, unknown>;
}

Production monitoring and analytics systems should track memory access patterns alongside conversation quality metrics. If an agent's quality score drops, checking which memories it retrieved (or failed to retrieve) is often the fastest path to diagnosis — a pattern covered in depth in How to Evaluate AI Agents.

Common Pitfalls and How to Avoid Them

Building memory systems teaches hard lessons. Here are the ones that cost the most time.

Storing Everything

The temptation is strong: disk is cheap, embeddings are cheap, so why not store every message? Because retrieval quality degrades when the memory store fills with noise. "Hello" and "can you hold on a second" don't need to be searchable memories. Filter before storing — a simple relevance threshold (embedding similarity to the conversation topic > 0.3) eliminates most noise.

Ignoring Memory Conflicts

When a customer says "I moved to Portland" but their account still shows Seattle, you have a conflict. Naive memory systems store both, and the agent becomes confused — sometimes using one, sometimes the other. Always implement conflict resolution: newer high-confidence memories should update or override older conflicting ones.

Missing Temporal Context

"The customer prefers email" is less useful than "the customer said they prefer email on March 5, 2026, during a billing dispute." Temporal metadata makes memories auditable, debuggable, and deletable. Store timestamps on everything.

One-Size-Fits-All Retrieval

Different questions need different memory types. "What's this customer's preferred contact method?" needs semantic memory. "What happened last time they called?" needs episodic memory. "How should I handle a refund request?" needs procedural memory. Route queries to the appropriate memory type instead of searching everything uniformly.

Overlooking Memory in Eval

If you're evaluating your agent (and you should be — see how to build an eval framework), memory needs to be part of the eval. Test cases should cover: Does the agent correctly recall information from a previous session? Does it handle updated preferences? Does it avoid surfacing information the user asked to forget? Memory is a feature, and features need tests.

What's Next for Agent Memory

The field is moving fast. Three directions stand out.

Memory as a service. Dedicated memory layers — Mem0, Zep, Letta — are becoming infrastructure that agents connect to rather than build from scratch. The same way you don't build your own database, you may not need to build your own memory system. But understanding the internals helps you evaluate which approach fits your use case and debug when things go wrong.

Graph-based memory. Flat vector stores treat each memory as independent. Knowledge graphs capture relationships between memories — this customer is connected to that account, which uses this product, which had that issue. Zep's temporal knowledge graph approach reports significant accuracy improvements for tasks requiring cross-session synthesis. Expect graph-based memory to become the default for complex, relationship-heavy domains.

Memory governance standards. The EU AI Act becomes fully applicable in August 2026. Spain's AEPD has already published detailed guidance on agent memory. ICLR 2026 accepted a workshop specifically on "Memory for LLM-Based Agentic Systems." The regulatory and research communities are converging on the position that agent memory isn't just a feature — it's a trust and safety concern that needs governance frameworks.

Memory is what separates agents that answer questions from agents that build relationships. The technical foundations are straightforward: buffers for the current conversation, summaries for compression, vector stores for retrieval, and extraction pipelines for consolidation. The hard part is the design decisions — what to remember, what to forget, and how to surface the right context at the right moment. Start simple, measure what matters, and let your users' needs drive the complexity.

Build AI agents with persistent memory

Chanl handles memory management, retrieval, and privacy controls — so your agents remember what matters and forget what they should.

Start building free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions