ChanlChanl
Learning AI

Build Your Own AI Agent Memory System (Then Learn What Breaks at Scale)

Build a complete memory system for customer-facing AI agents — session context, persistent recall, semantic search. Then learn what breaks when real customers start returning.

DGDean GroverCo-founderFollow
March 10, 2026
20 min read read
AI agent memory architecture with semantic search vectors

A customer calls your support line for the third time this week. Each time, they explain the same problem from scratch. Each time, the agent says "I'd be happy to help!" as if they've never spoken before.

Your agent has amnesia — and your customer has run out of patience.

This isn't a hypothetical. It's the default behavior of every AI agent without a memory system. The LLM powering your agent doesn't remember anything between API calls. Each conversation starts from absolute zero. For internal tools and one-off chatbots, that's fine. For customer-facing agents — support, sales, appointments, account management — it's a dealbreaker.

Memory is what separates a chatbot from a relationship. It's the difference between "How can I help you today?" and "Last time we spoke, we were working on getting your billing issue resolved — let me check the status." One of those builds trust. The other erodes it.

In this tutorial, you'll build a complete memory system from scratch: session context, persistent recall across conversations, and semantic search over everything. Real TypeScript, runnable code. Then we'll break it — deliberately — by walking through the five failure modes that hit every team shipping memory to production. By the end, you'll understand what memory does, how it works at every layer, and why most teams eventually stop building it themselves.

Prerequisites & Setup

You'll need Node.js 18+, an Anthropic API key (for summarization and extraction), and an OpenAI API key (for embeddings). Install both SDKs:

bash
npm install @anthropic-ai/sdk openai

Create a .env file:

text
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

We use Claude for conversational reasoning and memory extraction, and OpenAI's text-embedding-3-small for vector embeddings. If you haven't worked with embeddings before, the RAG from Scratch tutorial covers the fundamentals.

The Three Memory Problems

Every customer-facing agent faces three distinct memory challenges. They operate at different timescales, require different storage strategies, and break in different ways.

Problem 1: Session Continuity. The customer says "as I mentioned earlier" and the agent has no idea what "earlier" means. LLMs have finite context windows. Once the conversation exceeds that window, older messages get dropped. The customer mentioned their account number 40 messages ago — gone. They described the error message in detail at the start of the call — evicted to make room for recent turns.

Problem 2: Cross-Session Recall. The customer called last week about a billing issue. They're calling back today for a follow-up. The agent has zero context on the previous interaction. The customer repeats everything — their name, their problem, the steps they already tried, the resolution they were promised. Every returning customer starts from scratch.

Problem 3: Learned Preferences. Over time, patterns emerge. This customer always asks for Spanish-speaking agents. That one prefers email over phone. Another has a complex enterprise account setup that requires specific context every time. The agent never learns any of this. Each interaction is an island.

These three problems map to three memory layers, each operating at a different timescale:

Session Memory(minutes)Current conversation context Persistent Memory(days / weeks)Facts extracted after conversations Semantic Memory(months / forever)Searchable, embedded knowledge 'As I mentioned earlier...' 'Last time we discussed...' 'You prefer Spanish speakers...'
Three memory layers: from minutes to months

Let's build each one.

Build It: Session Memory

Session memory is the simplest layer — keeping track of the current conversation so the agent doesn't lose its thread halfway through.

The naive approach is a sliding window: store every message, and when you hit the token budget, drop the oldest ones. Here's a basic implementation:

typescript
interface Message {
  role: 'user' | 'assistant' | 'system';
  content: string;
  timestamp: number;
}
 
class SessionMemory {
  private messages: Message[] = [];
  private maxTokens: number;
 
  constructor(maxTokens = 4096) {
    this.maxTokens = maxTokens;
  }
 
  add(message: Message) {
    this.messages.push(message);
    this.trim();
  }
 
  getContext(): Message[] {
    return [...this.messages];
  }
 
  private trim() {
    // Simple sliding window — drop oldest messages when over budget
    while (this.estimateTokens() > this.maxTokens && this.messages.length > 2) {
      this.messages.shift();
    }
  }
 
  private estimateTokens(): number {
    // Rough estimate: ~4 characters per token
    return this.messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
  }
}

This works for short conversations. But here's the problem: when the window fills up and you start dropping old messages, you lose information permanently. The customer mentioned their order number in message 3. By message 40, that order number is gone. The agent asks for it again. The customer sighs.

A smarter approach: summarization-based compression. Instead of silently dropping old messages, use the LLM to summarize them into a condensed context block. The raw messages disappear, but the key facts survive.

typescript
import Anthropic from '@anthropic-ai/sdk';
 
const anthropic = new Anthropic();
 
class SmartSessionMemory {
  private messages: Message[] = [];
  private summary: string | null = null;
  private maxTokens: number;
  private summaryThreshold: number;
 
  constructor(maxTokens = 4096, summaryThreshold = 3000) {
    this.maxTokens = maxTokens;
    this.summaryThreshold = summaryThreshold;
  }
 
  async add(message: Message) {
    this.messages.push(message);
 
    if (this.estimateTokens() > this.summaryThreshold) {
      await this.compress();
    }
  }
 
  getContext(): Message[] {
    const context: Message[] = [];
 
    if (this.summary) {
      context.push({
        role: 'system',
        content: `Previous conversation summary:\n${this.summary}`,
        timestamp: Date.now(),
      });
    }
 
    context.push(...this.messages);
    return context;
  }
 
  private async compress() {
    // Take the oldest half of messages and summarize them
    const splitPoint = Math.floor(this.messages.length / 2);
    const toSummarize = this.messages.slice(0, splitPoint);
    const toKeep = this.messages.slice(splitPoint);
 
    const formatted = toSummarize
      .map((m) => `${m.role}: ${m.content}`)
      .join('\n');
 
    const existing = this.summary
      ? `Existing summary:\n${this.summary}\n\nNew messages to incorporate:\n`
      : '';
 
    const response = await anthropic.messages.create({
      model: 'claude-sonnet-4-5-20250514',
      max_tokens: 512,
      system:
        'Summarize this conversation excerpt into a concise paragraph. Preserve: customer name, account/order numbers, specific problems described, commitments made, and any action items. Drop pleasantries and filler.',
      messages: [{ role: 'user', content: `${existing}${formatted}` }],
    });
 
    this.summary =
      response.content[0].type === 'text' ? response.content[0].text : '';
    this.messages = toKeep;
  }
 
  private estimateTokens(): number {
    const messageTokens = this.messages.reduce(
      (sum, m) => sum + Math.ceil(m.content.length / 4),
      0
    );
    const summaryTokens = this.summary
      ? Math.ceil(this.summary.length / 4)
      : 0;
    return messageTokens + summaryTokens;
  }
}

The tradeoff is clear. Sliding window is fast and free but lossy — critical details vanish without warning. Summarization preserves key facts but costs an extra LLM call every time you compress. For a support agent handling a ten-minute call, that's maybe two compressions. For a voice agent on a 45-minute enterprise troubleshooting session, you might compress five or six times, and each compression introduces a small risk of losing a detail the LLM deemed unimportant.

Neither approach solves the cross-session problem. When this conversation ends, everything — raw messages and summaries alike — disappears. Tomorrow's call starts from zero. That's where persistent memory comes in.

Build It: Persistent Memory

This is where things get interesting for customer-facing agents. The customer calls back tomorrow. What do you remember about them?

Persistent memory stores facts about entities — customers, accounts, conversations — that outlast any single session. After each conversation, you extract the important bits and store them where future sessions can find them.

First, the storage layer:

typescript
import { randomUUID } from 'crypto';
 
interface MemoryEntry {
  id: string;
  entityType: 'customer' | 'session' | 'agent';
  entityId: string; // e.g., "customer_123"
  content: string; // "Customer prefers email communication"
  source: 'conversation' | 'manual' | 'extraction';
  createdAt: number;
  accessCount: number;
  lastAccessedAt: number;
  ttl?: number; // Expiration timestamp, optional
}
 
class PersistentMemory {
  private store: Map<string, MemoryEntry[]> = new Map();
 
  async save(
    entry: Omit<MemoryEntry, 'id' | 'accessCount' | 'lastAccessedAt'>
  ): Promise<MemoryEntry> {
    const full: MemoryEntry = {
      ...entry,
      id: randomUUID(),
      accessCount: 0,
      lastAccessedAt: Date.now(),
    };
 
    const key = `${entry.entityType}:${entry.entityId}`;
    const existing = this.store.get(key) || [];
    existing.push(full);
    this.store.set(key, existing);
 
    return full;
  }
 
  async recall(
    entityType: string,
    entityId: string
  ): Promise<MemoryEntry[]> {
    const key = `${entityType}:${entityId}`;
    const entries = this.store.get(key) || [];
 
    // Filter expired entries
    const now = Date.now();
    const valid = entries.filter((e) => !e.ttl || e.ttl > now);
 
    // Update access tracking
    valid.forEach((e) => {
      e.accessCount++;
      e.lastAccessedAt = now;
    });
 
    // Sort by recency and access frequency
    return valid.sort(
      (a, b) =>
        b.lastAccessedAt - a.lastAccessedAt ||
        b.accessCount - a.accessCount
    );
  }
 
  async deleteByEntity(entityType: string, entityId: string): Promise<number> {
    const key = `${entityType}:${entityId}`;
    const count = (this.store.get(key) || []).length;
    this.store.delete(key);
    return count;
  }
}

Storage is the easy part. The hard part is deciding what to store. You don't want to dump entire transcripts into memory — that's a knowledge base, not a memory system. You want the distilled facts: preferences, problems, commitments, account details.

The automatic extraction pattern uses the LLM itself to pull memorable facts from completed conversations:

typescript
import Anthropic from '@anthropic-ai/sdk';
 
const anthropic = new Anthropic();
 
async function extractMemories(
  conversation: Message[]
): Promise<string[]> {
  const formatted = conversation
    .map((m) => `${m.role}: ${m.content}`)
    .join('\n');
 
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5-20250514',
    max_tokens: 512,
    system: `Extract key facts worth remembering about this customer for future conversations.
 
Focus on:
- Stated preferences (language, communication channel, time zones)
- Account details mentioned (tier, plan, company name)
- Problems described and their resolution status
- Commitments made by the agent ("I'll follow up by Friday")
- Customer sentiment and any escalation triggers
- Action items that remain open
 
Return a JSON array of strings. Each string should be a self-contained fact.
Only include facts that would be useful if this customer contacts us again.
Do NOT include generic observations like "customer had a question."`,
    messages: [
      {
        role: 'user',
        content: `Extract memorable facts from this conversation:\n\n${formatted}`,
      },
    ],
  });
 
  const text =
    response.content[0].type === 'text' ? response.content[0].text : '[]';
 
  // Extract JSON array from response (handle markdown code blocks)
  const jsonMatch = text.match(/\[[\s\S]*\]/);
  return jsonMatch ? JSON.parse(jsonMatch[0]) : [];
}

Wire extraction into the end of every conversation:

typescript
async function onConversationEnd(
  customerId: string,
  conversation: Message[],
  memory: PersistentMemory
): Promise<void> {
  const facts = await extractMemories(conversation);
 
  for (const fact of facts) {
    await memory.save({
      entityType: 'customer',
      entityId: customerId,
      content: fact,
      source: 'extraction',
      createdAt: Date.now(),
      // Action items expire after 2 weeks; preferences last 90 days
      ttl: fact.toLowerCase().includes('follow up')
        ? Date.now() + 14 * 24 * 3600 * 1000
        : Date.now() + 90 * 24 * 3600 * 1000,
    });
  }
}

A few design decisions worth calling out:

Entity scoping. Memories belong to a specific customer, not a global pool. When customer_123 calls, you recall their memories — not a random assortment from across your entire customer base. This seems obvious, but plenty of early memory implementations got this wrong.

TTL (time-to-live). "Customer is waiting for a callback" is actionable today, useless in two weeks. "Customer prefers Spanish" is valid for months. Different facts expire at different rates. Without TTL, your memory store fills up with zombie facts that are technically true but operationally misleading.

Source tracking. Was this memory extracted automatically from a conversation, or did a human agent manually add it? Extraction is convenient but noisy — the LLM might extract something wrong. Manual entries carry higher confidence. Tracking the source lets you weight memories differently.

Deduplication. The customer mentions they prefer email in ten different conversations. Without dedup, you store ten near-identical memories. When you recall them, they crowd out other useful facts. We'll handle this properly in the semantic layer — but even at the persistent layer, a simple string-similarity check before insertion helps.

Build It: Semantic Search Memory

Here's where the real power shows up. You have hundreds of memories stored for a customer. The agent needs to find the relevant ones for the current conversation — not dump everything into the context window.

If a customer asks "Can you check on that shipping issue?" you need to find the memory about their delayed package, not the one about their billing preferences. Keyword matching won't cut it — the customer said "shipping issue" but the memory says "Order #4821 delayed, expected delivery pushed to March 15." No overlapping words, but clearly the right memory.

Semantic search solves this with embeddings. Convert text into vectors — numerical representations of meaning — and find memories whose vectors are closest to the current query.

typescript
import OpenAI from 'openai';
 
const openai = new OpenAI();
 
async function embedText(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return response.data[0].embedding;
}
 
function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0;
  let normA = 0;
  let normB = 0;
 
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
 
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

Those two functions — embed and compare — are the foundation of every semantic search system. If you've read RAG from Scratch, this will look familiar. RAG retrieves from organizational knowledge (docs, FAQs). Memory retrieves from experiential knowledge (past conversations, preferences). Same infrastructure, different data.

Now build the semantic memory layer:

typescript
interface MemorySearchResult extends MemoryEntry {
  score: number;
}
 
class SemanticMemory {
  private entries: Array<MemoryEntry & { embedding: number[] }> = [];
 
  async add(entry: MemoryEntry): Promise<void> {
    const embedding = await embedText(entry.content);
    this.entries.push({ ...entry, embedding });
  }
 
  async search(
    query: string,
    entityId: string,
    limit = 5
  ): Promise<MemorySearchResult[]> {
    const queryEmbedding = await embedText(query);
 
    return this.entries
      .filter((e) => e.entityId === entityId)
      .map((e) => ({
        ...e,
        score: cosineSimilarity(queryEmbedding, e.embedding),
      }))
      .sort((a, b) => b.score - a.score)
      .slice(0, limit)
      .filter((e) => e.score > 0.3); // Minimum relevance threshold
  }
 
  async update(id: string, content: string): Promise<void> {
    const entry = this.entries.find((e) => e.id === id);
    if (!entry) return;
 
    entry.content = content;
    entry.embedding = await embedText(content);
  }
 
  async deleteByEntity(entityId: string): Promise<number> {
    const before = this.entries.length;
    this.entries = this.entries.filter((e) => e.entityId !== entityId);
    return before - this.entries.length;
  }
}

The 0.3 threshold is critical. Without it, every query returns results — even when nothing in memory is relevant. A customer asking about pricing gets back their shipping memories at a 0.1 similarity score. That noise in the context window confuses the agent. Set a floor and enforce it.

Now combine all three layers into a unified context builder. This is the function that runs before every agent response, assembling everything the LLM needs to know:

typescript
async function buildAgentContext(
  sessionMemory: SmartSessionMemory,
  persistentMemory: PersistentMemory,
  semanticMemory: SemanticMemory,
  customerId: string,
  currentQuery: string
): Promise<string> {
  // 1. Current conversation (session memory handles its own compression)
  const conversation = sessionMemory.getContext();
 
  // 2. All persistent facts about this customer
  const customerFacts = await persistentMemory.recall('customer', customerId);
 
  // 3. Semantically relevant memories for the current query
  const relevant = await semanticMemory.search(currentQuery, customerId);
 
  // 4. Assemble context — most specific first
  const sections: string[] = [];
 
  if (customerFacts.length > 0) {
    sections.push(
      `## Customer Context\n${customerFacts
        .slice(0, 10) // Cap to avoid flooding the window
        .map((f) => `- ${f.content}`)
        .join('\n')}`
    );
  }
 
  if (relevant.length > 0) {
    sections.push(
      `## Relevant Past Interactions\n${relevant
        .map(
          (r) => `- (${(r.score * 100).toFixed(0)}% match) ${r.content}`
        )
        .join('\n')}`
    );
  }
 
  sections.push(
    `## Current Conversation\n${conversation
      .map((m) => `${m.role}: ${m.content}`)
      .join('\n')}`
  );
 
  return sections.join('\n\n');
}

That buildAgentContext output goes straight into the system message of your LLM call. The agent now knows who the customer is, what happened last time, and what's relevant to the current question. For the first time, it can say "I see we were working on your shipping delay last week — let me check the latest status" instead of asking the customer to start from scratch.

You've built the full stack. Session memory compresses the current conversation. Persistent memory stores facts across sessions. Semantic memory retrieves the right facts at the right time.

Ship it to production. Watch it break.

Watch It Break: Production Failure Modes

Everything above works beautifully in a demo. One developer, one test customer, a handful of memories, controlled inputs. Production is different. Here are five ways your memory system will fail when real customers start using it — and why each failure is harder to fix than it looks.

Memory Pollution

Your agent has a tough call. The customer is frustrated, maybe angry. The extraction pipeline dutifully records: "Customer expressed strong dissatisfaction with support experience. Customer was hostile and confrontational."

Three months later, the customer calls back about an unrelated question. The agent retrieves that old memory and opens with "I see you've had some frustrations with us in the past — I want to make sure we get this right for you." The customer, who had completely moved on, is now reminded of a bad experience. Or worse, the agent's tone shifts to be overly cautious and apologetic when the customer just wants a simple answer.

Memory pollution is stale emotional context poisoning future interactions. The fix isn't just TTL — some factual memories (account details, preferences) should persist indefinitely. You need a way to classify memories: factual vs. sentiment, actionable vs. historical. And you need decay logic that erodes sentiment memories faster than factual ones. None of that was in our simple implementation.

Privacy and Compliance

A customer says: "I have a medical condition that affects my speech, so I might need extra time to respond."

Your extraction pipeline records this helpfully. Now you've got Protected Health Information in your memory store. Depending on your industry, that could be a HIPAA violation, a GDPR special-categories violation, or both.

It gets worse. Under GDPR's right to erasure, a customer can request deletion of all their data. Your deleteByEntity method handles the memory store, but what about the embeddings? The LLM summaries that mentioned the customer's details? The backup of your database from two weeks ago?

Privacy-aware memory needs content classification at ingestion time — detecting and flagging sensitive information before it hits persistent storage. It needs retention policies that vary by content category. It needs audit logs of who accessed what memory and when. Our implementation has none of this.

Cross-Agent Contamination

Your support agent stores a memory: "Customer mentioned they're evaluating competitor products." Your sales agent retrieves this in a later conversation and pivots to a retention pitch: "I hear you've been looking at alternatives — let me tell you about our new features."

The customer didn't bring this up. They feel surveilled. The support conversation was confidential, and now sales is using it against them.

This is the agent-scoping problem. In a multi-agent system, memories shouldn't flow freely between agents with different roles. Support memories are support context. Sales memories are sales context. Some memories can be shared (account details, billing status), but emotional context and competitive intelligence should be isolated by default.

Our implementation stores memories by entity type and entity ID, but there's no concept of which agent created the memory or which agents should be allowed to read it. Adding agent scoping after the fact means migrating every existing memory — and deciding, for each one, who gets access.

Scale and Cost

Run the numbers. You have 10,000 customers. Each has an average of 50 memories. That's 500,000 embedded memory entries.

When customer_42 calls, semantic search compares their query embedding against all of customer_42's memories. That's maybe 50 comparisons — fast. But the embedding call itself takes 50-100ms. On a voice call with sub-300ms latency requirements, that's a significant chunk of your budget.

Now factor in the extraction pipeline. Every completed conversation triggers an LLM call to extract facts, then an embedding call for each extracted fact. At 200 conversations per day, that's 200 extraction calls (Claude Sonnet at ~$0.003 per call) plus maybe 600 embedding calls (text-embedding-3-small at ~$0.00002 per call). The per-call cost is tiny. But it adds up, and more importantly, our brute-force cosine similarity over an in-memory array doesn't scale. At 10,000 memories per customer — think a high-touch enterprise account manager — search latency becomes noticeable.

Production systems use approximate nearest neighbor (ANN) indexes — HNSW, IVF — that trade a small amount of accuracy for massive speed improvements. Vector databases like Pinecone, pgvector, or Qdrant handle this. Our in-memory Array.filter().map().sort() does not.

The Deduplication Nightmare

The customer mentions their shipping address in 15 conversations across three months. Each extraction produces a slightly different phrasing:

  • "Customer's shipping address is 123 Main St, Austin TX"
  • "Ships to 123 Main Street, Austin, Texas 78701"
  • "Delivery address: 123 Main St, Austin"
  • "Customer confirmed shipping to Main Street address in Austin"

Semantic search returns three or four of these for any address-related query. They all say the same thing but each one consumes context window space that could hold more useful information.

Deduplication sounds simple — just check if a similar memory exists before inserting. But "similar" is fuzzy. Exact string matching misses the variants above. Cosine similarity with a high threshold (0.95+) catches most duplicates but also merges legitimately different memories that happen to be semantically close. And what do you do when a memory updates? The customer moves. Now "Ships to 456 Oak Ave, Dallas TX" should replace the Austin address, not sit alongside it.

Production dedup needs: similarity-based duplicate detection before insertion, merge logic for near-duplicates, update/supersede logic for contradictory facts, and a way to distinguish "same fact, different phrasing" from "two different facts that sound alike." Our implementation doesn't attempt any of this.

What Production Actually Looks Like

Every problem above has a solution. The question is whether you want to build and maintain those solutions yourself, or use infrastructure that already handles them.

Here's what the Chanl memory system handles out of the box:

typescript
import { PlatformClient } from '@chanl-ai/platform-sdk';
 
const client = new PlatformClient({ apiKey: process.env.CHANL_API_KEY });
 
// Store a memory with TTL and agent scoping
await client.memory.create({
  entityType: 'customer',
  entityId: 'cust_123',
  content: 'Prefers Spanish-speaking agents. Account is on Business tier.',
  source: 'extraction',
  agentId: 'agent_support', // Scoped to support agent
  ttlSeconds: 90 * 24 * 3600, // 90-day expiration
});
 
// Semantic search — find relevant memories for this query
const results = await client.memory.search({
  entityType: 'customer',
  entityId: 'cust_123',
  query: 'What language does this customer prefer?',
  agentId: 'agent_support',
  includeAgentScoped: true,
  minScore: 0.3,
  limit: 5,
});
 
// results: [{ content: "Prefers Spanish-speaking agents...", score: 0.89, priority: 0.91 }]

Compare that to the DIY version. Same capability, but with the production infrastructure baked in:

  • Auto-embedding on creation (and re-embedding when content changes) — no separate embedding pipeline to maintain
  • Entity-scoped memories (customer, session, agent, conversation) — cross-tenant isolation by default
  • Agent-scoped access control — support agent can't see sales agent's memories unless explicitly shared
  • TTL with automatic expiration — no cron jobs cleaning up zombie memories
  • Access tracking (accessCount, lastAccessedAt) — built-in priority scoring for retrieval
  • Source tracking (conversation, manual, extraction) — confidence weighting by provenance
  • Bulk delete with filters — GDPR right-to-erasure in one API call
  • Cosine similarity with configurable thresholds — no brute-force in-memory search at scale

For teams building agent dashboards, the React hooks make memory visible to human operators:

typescript
import { useMemorySearch, useCreateMemory } from '@chanl-ai/platform-sdk/react';
 
function CustomerContext({ customerId, query }: {
  customerId: string;
  query: string;
}) {
  const { data: memories, isLoading } = useMemorySearch({
    entityType: 'customer',
    entityId: customerId,
    query,
    limit: 5,
  });
 
  const createMemory = useCreateMemory();
 
  if (isLoading) return <div>Loading context...</div>;
 
  return (
    <div className="space-y-2">
      <h3 className="text-sm font-medium">Customer Memory</h3>
      {memories?.map((m) => (
        <div key={m.id} className="flex items-center justify-between text-sm">
          <span>{m.content}</span>
          <span className="text-muted-foreground">
            {(m.score * 100).toFixed(0)}% match
          </span>
        </div>
      ))}
      {memories?.length === 0 && (
        <p className="text-sm text-muted-foreground">
          No relevant memories found
        </p>
      )}
    </div>
  );
}

Human agents can see what the AI remembers, correct wrong memories, and add context the extraction pipeline missed. That feedback loop is how memory systems get better over time — which brings us back to the monitoring question. Memory without observability is a black box you can't improve.

Customer service representative

Customer Memory

4 memories recalled

Sarah Chen
Premium
Last call
2 days ago
Prefers
Email follow-up
Session Memory

“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”

85% relevance

When to Build vs. Buy

Not every project needs a managed memory system. Here's the honest breakdown.

Build it yourself when:

  • You're learning. This tutorial exists for a reason — understanding the internals makes you a better architect even if you never ship your own implementation.
  • You're prototyping a single-agent, single-customer demo. The code above works fine for that.
  • You have fewer than 100 memories total and no compliance requirements.
  • Your agents are internal tools, not customer-facing. The stakes are lower.

Use a platform when:

  • You're customer-facing. The production failure modes above aren't theoretical — they'll hit you within weeks of launch.
  • You run multiple agents that need isolated but occasionally shared memory.
  • You need privacy controls, TTL, or GDPR compliance. These aren't features you add later — they're architectural decisions that affect your storage layer.
  • You need semantic search at scale. The jump from in-memory cosine similarity to a proper vector index is a significant engineering project.
  • You need to monitor agent quality and memory is part of that picture. A memory that hurts agent performance is worse than no memory at all.

The build teaches you what matters. The platform handles what scales. Those aren't competing goals — they're sequential stages of the same journey.

Wrapping Up

Memory is the difference between a stranger and a relationship. Every time your agent says "as we discussed last time" — that's memory working. Every time it doesn't — that's the gap your customers feel.

You now understand all three layers: session memory that keeps the current conversation coherent, persistent memory that carries facts across conversations, and semantic memory that retrieves the right context at the right moment. You've seen the code. You've seen what breaks.

If you want to go deeper on the retrieval side, RAG from Scratch covers the embedding and vector search infrastructure in detail — memory search is fundamentally RAG over customer data. For prompt engineering, the techniques for injecting retrieved context into system messages apply directly to memory context assembly. And if you're wondering how to test whether memory actually improves your agent's performance, the eval framework tutorial shows you how to measure it with numbers instead of gut feeling.

For teams building production tools alongside memory — because an agent that remembers but can't act is only half useful — the combination of memory context and MCP tool execution is where things get powerful. An agent that remembers the customer's preferred resolution method and can execute it through connected tools? That's the goal.

Build the memory system. Learn what breaks. Then decide what to build and what to buy.

Give your agents a memory

Chanl handles memory storage, semantic search, privacy controls, and TTL — so your agents remember what matters and forget what they should.

Explore Chanl Memory
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions