ChanlChanl
Learning AI

Multi-Agent AI Systems: Build an Agent Orchestrator Without a Framework

Build a multi-agent system from scratch — delegation, planning loops, and inter-agent communication — before reaching for LangGraph or CrewAI.

DGDean GroverCo-founderFollow
March 8, 2026
20 min read
Team of developers collaborating on multi-agent AI architecture

Gartner projects a 1,445% surge in AI agent deployments between 2024 and 2028. Most of those agents will start as single-purpose systems — one agent, one job. But the moment you need an agent to research a topic, write a report about it, and fact-check the result, you hit a wall. Not because the model isn't smart enough, but because one agent can't hold competing instructions without drifting.

This guide walks you through building multi-agent systems from scratch. You'll implement an orchestrator that coordinates specialized agents, learn the four orchestration patterns that cover nearly every use case, and build a working three-agent research team — all before reaching for LangGraph, CrewAI, or any framework.

What you'll learnWhy it matters
When one agent isn't enoughRecognize the symptoms of a task that needs splitting
Four orchestration patternsSequential, parallel, hierarchical, debate — pick the right one
Build a 3-agent research teamWorking orchestrator in ~150 lines, no framework
Shared state between agentsHow agents remember what others found
Google's A2A protocolHow agent-to-agent communication is being standardized
Framework comparisonLangGraph, CrewAI, OpenAI Swarm — when to adopt one
Production failure modesWhat breaks at scale and how to test for it

What you'll need

Environment:

  • Node.js 20+
  • A Chanl account (free tier works for everything here)

Install the SDK:

bash
npm install @chanl-ai/sdk

Get your API key:

  1. Go to app.chanl.ai/settings/api-keys
  2. Create a new API key
  3. Set it as an environment variable:
bash
export CHANL_API_KEY="your-api-key-here"

All code examples are complete and runnable — copy-paste and go.

When one agent isn't enough

A single agent fails when competing instructions cause context pollution — the model tries to be a researcher, writer, and fact-checker simultaneously, and quality degrades across all three roles.

Here's the problem in practice. You ask one agent to research market data on AI infrastructure spending, write a 500-word analysis, and fact-check its own claims:

typescript
import { ChanlClient } from "@chanl-ai/sdk";
 
const chanl = new ChanlClient({
  apiKey: process.env.CHANL_API_KEY,
  model: "claude-sonnet-4-20250514",
});
 
// One agent trying to do everything at once
const agent = await chanl.agents.create({
  name: "Do-Everything Agent",
  instructions: `You are a research analyst. Your job:
1. Research AI infrastructure market data for 2026
2. Write a 500-word analysis of the findings
3. Fact-check every claim in your analysis
4. Flag any claims you can't verify`,
});
 
const response = await chanl.chat.send(agent.id, [
  {
    role: "user",
    content: "Write a market analysis of AI infrastructure spending in 2026.",
  },
]);
 
console.log(response.content);

This looks fine on the surface. Three things go wrong:

Context pollution. The model is trying to research, write, and critique simultaneously. Its "research" phase has no tools — it's pulling from training data with no way to verify recency. The "fact-checking" phase can't honestly evaluate claims it just made up.

Instruction drift. By paragraph three, the writing instruction dominates. The model stops flagging uncertain claims because "write a polished analysis" outweighs "flag what you can't verify." It optimizes for fluency over accuracy.

Role confusion. A researcher's job is to find data. A writer's job is to synthesize it compellingly. A fact-checker's job is to poke holes. These are adversarial roles — you want tension between them, not compromise.

The fix: split each role into its own agent with focused instructions. Notice how each agent below gets exactly one job and only the tools appropriate to that role:

typescript
import { ChanlClient } from "@chanl-ai/sdk";
 
const chanl = new ChanlClient({
  apiKey: process.env.CHANL_API_KEY,
  model: "claude-sonnet-4-20250514",
});
 
// Each agent gets ONE job with focused tools and instructions
const researcher = await chanl.agents.create({
  name: "Market Researcher",
  instructions:
    "You are a research specialist. Find relevant data, cite sources, and flag confidence levels. Never synthesize or editorialize — just report findings.",
  toolIds: [webSearchTool.id, urlReaderTool.id],
});
 
const writer = await chanl.agents.create({
  name: "Analysis Writer",
  instructions:
    "You are an analyst writer. Transform raw research findings into clear, structured analysis. Use only the data provided — never add claims beyond what the research supports.",
});
 
const factChecker = await chanl.agents.create({
  name: "Fact Checker",
  instructions:
    "You are a fact-checker. Review the analysis against the original research data. Flag any claim that lacks a source, contradicts the data, or overstates confidence.",
  toolIds: [webSearchTool.id],
});

Now there's natural tension between roles. The researcher has search tools. The writer has no tools — just the research output as input. The fact-checker has search tools to independently verify claims. That tension is exactly what produces high-quality output.

The four multi-agent patterns

Nearly every multi-agent system maps to one of four orchestration patterns. The right choice depends on whether your subtasks have dependencies and whether you need speed or quality.

PatternHow it worksBest forTradeoff
Sequential PipelineAgent A output feeds into Agent BTasks with natural order (research, then write, then edit)Slowest — linear execution
Parallel Fan-outMultiple agents work simultaneouslyIndependent subtasks that can merge laterNeeds a merge step; inconsistency risk
Hierarchical DelegationOrchestrator assigns tasks to workersComplex projects with many subtasksOrchestrator becomes bottleneck
Debate/ConsensusMultiple agents solve same problemHigh-stakes decisions needing verificationMost expensive — runs N agents per task

Pattern 1: Sequential pipeline

The simplest pattern — and often the right one to start with. Each agent's output becomes the next agent's input, like a Unix pipe.

Topic Research findings Draft analysis Final output User Researcher Writer Editor
Sequential pipeline: each agent's output feeds directly into the next

The key thing to notice: runAgent is just a chat call with a system prompt. The orchestration complexity lives entirely in how you chain them together:

typescript
import { ChanlClient } from "@chanl-ai/sdk";
 
const chanl = new ChanlClient({
  apiKey: process.env.CHANL_API_KEY,
  model: "claude-sonnet-4-20250514",
});
 
interface AgentConfig {
  name: string;
  systemPrompt: string;
}
 
async function runAgent(
  agent: AgentConfig,
  input: string
): Promise<string> {
  const agentInstance = await chanl.agents.create({
    name: agent.name,
    instructions: agent.systemPrompt,
  });
  const response = await chanl.chat.send(agentInstance.id, [
    { role: "user", content: input },
  ]);
  console.log(`[${agent.name}] completed`);
  return response.content;
}
 
async function sequentialPipeline(
  agents: AgentConfig[],
  initialInput: string
): Promise<string> {
  let output = initialInput;
  for (const agent of agents) {
    output = await runAgent(agent, output);
  }
  return output;
}
 
// Usage
const result = await sequentialPipeline(
  [
    {
      name: "Researcher",
      systemPrompt:
        "Find and list key data points about the given topic. Output structured findings with sources.",
    },
    {
      name: "Writer",
      systemPrompt:
        "Transform the research findings into a clear 500-word analysis. Use only the data provided.",
    },
    {
      name: "Editor",
      systemPrompt:
        "Review the analysis for clarity, accuracy, and flow. Fix issues and return the polished version.",
    },
  ],
  "AI infrastructure spending trends in 2026"
);

Sequential is the default choice. Use it when each step depends on the previous one's output.

Pattern 2: Parallel fan-out

When subtasks are independent, run agents concurrently and merge their outputs. Three analysts working simultaneously instead of waiting in line. The critical piece is the merger step at the end:

User Task Market Agent Technical Agent Financial Agent Merger Agent Final Report
Parallel fan-out: independent agents work simultaneously, then a merger synthesizes results

Promise.all runs all agents at the same time. Without the merger agent, you'd get three separate outputs instead of one coherent result:

typescript
async function parallelFanOut(
  agents: AgentConfig[],
  task: string,
  mergerPrompt: string
): Promise<string> {
  // Run all agents concurrently
  const results = await Promise.all(
    agents.map(async (agent) => ({
      name: agent.name,
      output: await runAgent(agent, task),
    }))
  );
 
  // Merge results
  const mergeInput = results
    .map((r) => `## ${r.name} Output\n${r.output}`)
    .join("\n\n---\n\n");
 
  return runAgent(
    { name: "Merger", systemPrompt: mergerPrompt },
    mergeInput
  );
}
 
const report = await parallelFanOut(
  [
    {
      name: "Market Analyst",
      systemPrompt:
        "Analyze market size, growth rates, and key players for the given topic.",
    },
    {
      name: "Technical Analyst",
      systemPrompt:
        "Analyze technical trends, architectures, and infrastructure shifts.",
    },
    {
      name: "Financial Analyst",
      systemPrompt:
        "Analyze investment patterns, funding, and financial projections.",
    },
  ],
  "AI infrastructure in 2026",
  "Synthesize the following specialist analyses into one coherent report. Resolve contradictions by noting them explicitly."
);

Parallel is fastest when subtasks are genuinely independent. The merger step is critical — without it, you get three separate outputs instead of one coherent result.

Pattern 3: Hierarchical delegation

An orchestrator decides which agents to call and in what order, dynamically adapting based on intermediate results. This is the most flexible pattern because the orchestrator can change the plan mid-execution — if the researcher finds something unexpected, the orchestrator can add a new step or skip one that's no longer relevant.

typescript
interface TaskResult {
  agent: string;
  task: string;
  output: string;
}
 
async function hierarchicalOrchestrator(
  task: string,
  availableAgents: AgentConfig[]
): Promise<string> {
  const agentDescriptions = availableAgents
    .map((a) => `- ${a.name}: ${a.systemPrompt}`)
    .join("\n");
 
  // Step 1: Orchestrator creates a plan
  const plan = await runAgent(
    {
      name: "Orchestrator",
      systemPrompt: `You are a project manager. Given a task and available team members,
create a step-by-step execution plan. For each step, specify which team member should handle it
and what their specific subtask is.
 
Available team members:
${agentDescriptions}
 
Respond in JSON: { "steps": [{ "agent": "name", "task": "specific subtask" }] }`,
    },
    task
  );
 
  const { steps } = JSON.parse(plan);
  const results: TaskResult[] = [];
 
  // Step 2: Execute each step
  for (const step of steps) {
    const agent = availableAgents.find(
      (a) => a.name === step.agent
    );
    if (!agent) continue;
 
    const context =
      results.length > 0
        ? `\n\nPrevious results:\n${results.map((r) => `[${r.agent}]: ${r.output}`).join("\n")}`
        : "";
 
    const output = await runAgent(
      agent,
      step.task + context
    );
 
    results.push({
      agent: step.agent,
      task: step.task,
      output,
    });
  }
 
  // Step 3: Orchestrator synthesizes
  return runAgent(
    {
      name: "Orchestrator",
      systemPrompt:
        "Synthesize all team outputs into a final deliverable. Ensure coherence and completeness.",
    },
    results
      .map((r) => `## ${r.agent}: ${r.task}\n${r.output}`)
      .join("\n\n")
  );
}

The orchestrator itself becomes a potential bottleneck: if it creates a bad plan, everything downstream suffers.

Pattern 4: Debate/consensus

Multiple agents solve the same problem independently. A judge evaluates their answers and picks the best one or synthesizes a consensus. This is your go-to for high-stakes decisions where you'd rather spend more compute than risk a single agent's blind spots. Each debater has a deliberately different perspective — that tension is the whole point:

typescript
async function debateConsensus(
  task: string,
  debaters: AgentConfig[],
  judgePrompt: string
): Promise<string> {
  // All agents solve the same problem independently
  const solutions = await Promise.all(
    debaters.map(async (agent) => ({
      name: agent.name,
      solution: await runAgent(agent, task),
    }))
  );
 
  // Judge evaluates all solutions
  const judgeInput = solutions
    .map(
      (s, i) =>
        `## Solution ${i + 1} (by ${s.name})\n${s.solution}`
    )
    .join("\n\n---\n\n");
 
  return runAgent(
    {
      name: "Judge",
      systemPrompt: judgePrompt,
    },
    `Task: ${task}\n\n${judgeInput}`
  );
}
 
const best = await debateConsensus(
  "What is the optimal pricing strategy for an AI SaaS product in 2026?",
  [
    {
      name: "Growth Strategist",
      systemPrompt:
        "You prioritize market share and rapid adoption. Favor lower pricing with usage-based upsell.",
    },
    {
      name: "Revenue Optimizer",
      systemPrompt:
        "You prioritize revenue per customer. Favor value-based pricing with premium tiers.",
    },
    {
      name: "Customer Advocate",
      systemPrompt:
        "You prioritize customer satisfaction and retention. Favor transparent, predictable pricing.",
    },
  ],
  "Evaluate each pricing strategy. Identify strengths, weaknesses, and blind spots. Synthesize the best elements into a recommended approach with clear reasoning."
);

Debate is the most expensive pattern — you're running N agents on the same task. Use it when correctness matters more than speed.

Build along: a research team in 150 lines

Here's a practical three-agent system: a Researcher that gathers data, a Writer that produces analysis, and an Editor that improves the output. This is the hierarchical pattern with a fixed workflow — the most common starting point for real multi-agent systems.

The scratchpad object captures every intermediate result, so when something goes wrong (and it will), you can inspect exactly where the pipeline broke down:

typescript
import { ChanlClient } from "@chanl-ai/sdk";
 
// — Types ---
 
interface AgentRole {
  name: string;
  systemPrompt: string;
}
 
interface Scratchpad {
  [key: string]: string;
}
 
// — Core agent runner ---
 
const chanl = new ChanlClient({
  apiKey: process.env.CHANL_API_KEY,
  model: "claude-sonnet-4-20250514",
});
 
async function callAgent(
  role: AgentRole,
  input: string
): Promise<string> {
  const start = Date.now();
  const agent = await chanl.agents.create({
    name: role.name,
    instructions: role.systemPrompt,
  });
  const response = await chanl.chat.send(agent.id, [
    { role: "user", content: input },
  ]);
  const output = response.content;
  const elapsed = Date.now() - start;
  console.log(
    `  [${role.name}] finished in ${elapsed}ms (${output.length} chars)`
  );
  return output;
}
 
// — Define the team ---
 
const researcher: AgentRole = {
  name: "Researcher",
  systemPrompt: `You are a research specialist. Given a topic:
1. Identify 5-8 key data points with specific numbers, dates, or percentages
2. Note the source or basis for each data point
3. Rate your confidence in each point (high/medium/low)
4. Flag any data points you're uncertain about
 
Output a structured list. Do NOT write prose — just organized findings.`,
};
 
const writer: AgentRole = {
  name: "Writer",
  systemPrompt: `You are an analysis writer. Given research findings:
1. Transform the data into a clear, structured 500-word analysis
2. Use ONLY the data provided — do not add claims beyond the research
3. Organize with clear sections and topic sentences
4. Note confidence levels from the research where relevant
 
If a finding is flagged as low-confidence, qualify it in your text.`,
};
 
const editor: AgentRole = {
  name: "Editor",
  systemPrompt: `You are a technical editor. Given a draft analysis:
1. Check every factual claim against the original research (provided below)
2. Flag any claim not supported by the research
3. Fix clarity, flow, and structure issues
4. Ensure the analysis is balanced — no overselling uncertain findings
5. Return the polished final version
 
Mark any changes you made with [EDITED: reason].`,
};
 
// — Orchestrator ---
 
async function runResearchTeam(topic: string): Promise<{
  finalOutput: string;
  scratchpad: Scratchpad;
  timing: Record<string, number>;
}> {
  const scratchpad: Scratchpad = {};
  const timing: Record<string, number> = {};
 
  console.log(`\nResearch Team: "${topic}"\n`);
 
  // Step 1: Research
  let start = Date.now();
  scratchpad.research = await callAgent(
    researcher,
    `Research this topic: ${topic}`
  );
  timing.research = Date.now() - start;
 
  // Step 2: Write (receives research as input)
  start = Date.now();
  scratchpad.draft = await callAgent(
    writer,
    `Write an analysis based on these research findings:\n\n${scratchpad.research}`
  );
  timing.writing = Date.now() - start;
 
  // Step 3: Edit (receives both research AND draft)
  start = Date.now();
  scratchpad.final = await callAgent(
    editor,
    `Edit this draft. Check all claims against the original research.\n\n## Original Research\n${scratchpad.research}\n\n## Draft to Edit\n${scratchpad.draft}`
  );
  timing.editing = Date.now() - start;
 
  const totalTime = Object.values(timing).reduce(
    (a, b) => a + b,
    0
  );
  console.log(`\nTotal pipeline: ${totalTime}ms`);
  console.log(
    `  Research: ${timing.research}ms | Write: ${timing.writing}ms | Edit: ${timing.editing}ms`
  );
 
  return {
    finalOutput: scratchpad.final,
    scratchpad,
    timing,
  };
}
 
// Run it
const { finalOutput, scratchpad } = await runResearchTeam(
  "AI infrastructure spending trends in 2026"
);
 
console.log("\n=== FINAL OUTPUT ===\n");
console.log(finalOutput);

That's about 120 lines of actual logic. The editor receives both the original research and the draft, so it can cross-reference claims against source data.

Here's the same system using Chanl's built-in orchestration. Instead of wiring the pipeline yourself, you declare the agents and let the SDK handle delegation, context passing, and synthesis:

typescript
import { ChanlClient } from "@chanl-ai/sdk";
 
const chanl = new ChanlClient({
  apiKey: process.env.CHANL_API_KEY,
  model: "claude-sonnet-4-20250514",
});
 
// Create specialized agents (these persist across runs)
const researcher = await chanl.agents.create({
  name: "Market Researcher",
  instructions: `You are a research specialist. Find 5-8 key data points with specific numbers.
Rate confidence. Flag uncertain claims. Output structured findings only.`,
  toolIds: [webSearchTool.id, urlReaderTool.id],
});
 
const writer = await chanl.agents.create({
  name: "Analysis Writer",
  instructions: `Transform research findings into clear 500-word analysis.
Use ONLY provided data. Qualify low-confidence findings.`,
});
 
const editor = await chanl.agents.create({
  name: "Technical Editor",
  instructions: `Cross-reference every claim against original research.
Flag unsupported claims. Fix clarity and flow.`,
});
 
// Run the full pipeline with built-in orchestration
const result = await chanl.orchestrator.run({
  agents: [researcher.id, writer.id, editor.id],
  task: "Find 2026 AI infrastructure market data and produce a polished analysis",
  pattern: "sequential",
});

The SDK version gives you persistent agents, built-in tool integration, and declarative orchestration. The raw version gives you full control. Both produce the same result — the SDK just handles the plumbing.

Shared state: how agents remember what others found

Agents share state through a shared scratchpad — a key-value store that all agents can read from and write to. Without it, you're threading outputs manually between agents, which breaks down the moment your pipeline isn't strictly linear.

In the research team above, we used a simple object as the scratchpad. That works for sequential pipelines. But what about parallel agents? Or when an agent needs to reference something from two steps ago, not just the immediately previous output?

You need a shared state store accessible by any agent at any point. This implementation adds authorship tracking and timestamps for debugging:

The raw pattern

typescript
class SharedScratchpad {
  private entries: Map<
    string,
    { value: string; author: string; timestamp: number }
  > = new Map();
 
  write(
    key: string,
    value: string,
    author: string
  ): void {
    this.entries.set(key, {
      value,
      author,
      timestamp: Date.now(),
    });
    console.log(`  [Scratchpad] ${author} wrote "${key}"`);
  }
 
  read(key: string): string | undefined {
    return this.entries.get(key)?.value;
  }
 
  readByAuthor(author: string): string[] {
    return Array.from(this.entries.values())
      .filter((e) => e.author === author)
      .map((e) => e.value);
  }
 
  // Get everything — useful for the final synthesis step
  dump(): Record<string, string> {
    const result: Record<string, string> = {};
    for (const [key, entry] of this.entries) {
      result[`${key} (by ${entry.author})`] = entry.value;
    }
    return result;
  }
}

Now each agent reads from and writes to the scratchpad instead of passing strings directly. The editor can access the original research findings directly — it doesn't have to rely on the writer passing everything through accurately:

typescript
const scratchpad = new SharedScratchpad();
 
// Researcher writes findings
const findings = await callAgent(researcher, topic);
scratchpad.write("research-findings", findings, "Researcher");
 
// Writer reads researcher's work, writes draft
const researchData = scratchpad.read("research-findings")!;
const draft = await callAgent(
  writer,
  `Write analysis from:\n\n${researchData}`
);
scratchpad.write("draft", draft, "Writer");
 
// Editor reads BOTH research and draft
const originalResearch = scratchpad.read("research-findings")!;
const draftText = scratchpad.read("draft")!;
const final = await callAgent(
  editor,
  `Edit this draft against the research.\n\nResearch:\n${originalResearch}\n\nDraft:\n${draftText}`
);
scratchpad.write("final-output", final, "Editor");
 
// Debug: see the full collaboration history
console.log(scratchpad.dump());

With Chanl's Memory API

Chanl's Memory API provides a persistent, searchable scratchpad that survives across sessions. Future runs can reference past research. The semantic search is the key advantage over a simple key-value store — agents can find relevant context by meaning rather than exact key names:

typescript
// Researcher writes findings to shared memory
await chanl.memory.create({
  key: "research-findings",
  value: researcherOutput,
  entityType: "agent",
  entityId: orchestrator.id,
  metadata: { source: "researcher", task: "market-data", topic },
});
 
// Writer reads researcher's findings via semantic search
const findings = await chanl.memory.search({
  query: "market data findings",
  entityType: "agent",
  entityId: orchestrator.id,
});
 
// Editor reads both research and draft
const allContext = await chanl.memory.search({
  query: "research findings and draft analysis",
  entityType: "agent",
  entityId: orchestrator.id,
});

The researcher writes about "AI infrastructure spending." The writer searches for "market data findings." The memory system connects these by semantic similarity — no need to agree on exact key names up front.

Google's A2A protocol

A2A (Agent-to-Agent) is Google's open protocol for agents to discover and communicate with other agents, the same way MCP lets agents discover and use tools. MCP is for agent-to-tool. A2A is for agent-to-agent. Together, they form the two halves of an interoperable agent ecosystem.

Google released the A2A spec in April 2025 with backing from over 50 partners including Salesforce, SAP, and Deloitte. The core idea: every agent publishes an Agent Card — a JSON document describing who it is and what it can do — at a well-known URL. Other agents read the card, decide if this agent can help, and send a structured task request.

Here's what an Agent Card looks like:

json
{
  "name": "Market Research Agent",
  "description": "Researches market data, competitive landscapes, and industry trends",
  "url": "https://agents.example.com/researcher",
  "version": "1.0.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": false
  },
  "skills": [
    {
      "id": "market-research",
      "name": "Market Research",
      "description": "Gathers and analyzes market data for a given industry or topic",
      "inputModes": ["text/plain"],
      "outputModes": ["text/plain", "application/json"]
    },
    {
      "id": "competitive-analysis",
      "name": "Competitive Analysis",
      "description": "Compares competitors in a given market segment",
      "inputModes": ["text/plain"],
      "outputModes": ["application/json"]
    }
  ]
}

The flow is straightforward: discover, then delegate. Agent A fetches the other agent's card, then sends a structured task request if there's a match:

GET /.well-known/agent.json Agent Card (skills, capabilities) Evaluate: can B help? POST /tasks (message + artifacts) Streaming updates (optional) Result artifacts (text, files, JSON) Agent A Agent B
A2A discovery and task delegation flow
typescript
// Basic A2A task request
interface A2ATask {
  id: string;
  message: {
    role: "user";
    parts: Array<{ type: "text"; text: string }>;
  };
  metadata?: Record<string, string>;
}
 
// Discover an agent
async function discoverAgent(
  agentUrl: string
): Promise<AgentCard> {
  const response = await fetch(
    `${agentUrl}/.well-known/agent.json`
  );
  return response.json();
}
 
// Send a task to a discovered agent
async function sendA2ATask(
  agentUrl: string,
  task: string
): Promise<string> {
  const response = await fetch(`${agentUrl}/tasks`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      id: crypto.randomUUID(),
      message: {
        role: "user",
        parts: [{ type: "text", text: task }],
      },
    }),
  });
  const result = await response.json();
  return result.artifacts?.[0]?.parts?.[0]?.text ?? "";
}

MCP + A2A: complementary standards

Think of it this way: MCP gives your agent hands (tools to interact with the world). A2A gives your agent colleagues (other agents to collaborate with). A research agent might use MCP to call a web search tool, then use A2A to delegate the writing task to a specialized writer agent running on a completely different platform.

MCP A2A Your Agent Tools: search, database, APIs Other Agents: writer, analyst, reviewer
MCP handles tool connections, A2A handles agent-to-agent collaboration

A2A is still early — the spec is evolving and real-world adoption is experimental. But the direction is clear: agents will discover and call other agents the same way they discover and call tools today. Even if you don't adopt A2A formally, designing your agents with clear Agent Card-style descriptions (name, skills, input/output formats) is good practice.

LangGraph, CrewAI, OpenAI Swarm: when frameworks earn their weight

Frameworks earn their weight when you need features beyond basic orchestration — persistent state machines, visual debugging, human-in-the-loop, or team collaboration. For 2-4 agents with straightforward collaboration, building from scratch gives you more control and fewer dependencies.

FrameworkArchitectureBest forWatch out
LangGraphAgents as nodes in a state graph, with typed state and conditional edgesComplex workflows with branching, retries, and human-in-the-loopSteep learning curve; graph mental model isn't intuitive for linear pipelines
CrewAIRole-based agents with natural language task descriptions and processesTeams that think in roles ("researcher," "writer") rather than codeMagic strings in role definitions can be fragile; less control over execution
OpenAI SwarmLightweight handoffs between agents with function-calling-based routingSimple agent-to-agent handoffs within OpenAI's ecosystemExperimental; no persistence, no state management, no production guarantees
AutoGenConversation-based multi-agent with group chat patternsResearch prototyping and multi-agent dialogue simulationVerbose setup; better for exploration than production
Raw codeDirect LLM calls with your own orchestration logicFull control, minimal dependencies, custom patternsYou own all the plumbing — retries, state, logging, error handling

Decision matrix

Build from scratch when:

  • You have 2-4 agents with a clear, fixed workflow
  • You need full control over prompts, timing, and error handling
  • Your team understands the underlying LLM APIs
  • You want to minimize dependencies and vendor lock-in

Adopt a framework when:

  • You need persistent state machines (conversations that pause and resume)
  • You need visual debugging of agent workflows (LangGraph Studio)
  • You need human-in-the-loop approval steps
  • Multiple developers need a shared abstraction for agent workflows
  • You need built-in retry logic, error recovery, and monitoring

The 150-line test. If you can implement your multi-agent system in ~150 lines of raw code (like we did above), you probably don't need a framework. If you find yourself reimplementing state management, retry logic, and workflow visualization, it's time to adopt one.

Here's the research team in LangGraph for comparison. The graph abstraction adds typed state and conditional edges — useful if you want the editor to reject a draft and loop it back to the writer, but overhead if your pipeline is always linear:

typescript
import { StateGraph, Annotation } from "@langchain/langgraph";
import { ChatAnthropic } from "@langchain/anthropic";
 
// Define typed state
const ResearchState = Annotation.Root({
  topic: Annotation<string>,
  research: Annotation<string>({ default: () => "" }),
  draft: Annotation<string>({ default: () => "" }),
  finalOutput: Annotation<string>({ default: () => "" }),
});
 
const model = new ChatAnthropic({
  modelName: "claude-sonnet-4-20250514",
});
 
// Define agent nodes
async function researchNode(
  state: typeof ResearchState.State
) {
  const result = await model.invoke([
    {
      role: "system",
      content: "You are a research specialist. Find key data points.",
    },
    {
      role: "user",
      content: `Research: ${state.topic}`,
    },
  ]);
  return { research: result.content };
}
 
async function writeNode(
  state: typeof ResearchState.State
) {
  const result = await model.invoke([
    {
      role: "system",
      content: "Transform research into a 500-word analysis.",
    },
    {
      role: "user",
      content: state.research,
    },
  ]);
  return { draft: result.content };
}
 
async function editNode(
  state: typeof ResearchState.State
) {
  const result = await model.invoke([
    {
      role: "system",
      content: "Edit the draft against the research. Fix issues.",
    },
    {
      role: "user",
      content: `Research:\n${state.research}\n\nDraft:\n${state.draft}`,
    },
  ]);
  return { finalOutput: result.content };
}
 
// Build the graph
const workflow = new StateGraph(ResearchState)
  .addNode("researcher", researchNode)
  .addNode("writer", writeNode)
  .addNode("editor", editNode)
  .addEdge("__start__", "researcher")
  .addEdge("researcher", "writer")
  .addEdge("writer", "editor")
  .addEdge("editor", "__end__");
 
const app = workflow.compile();
const result = await app.invoke({
  topic: "AI infrastructure spending in 2026",
});

LangGraph gives you typed state, visual debugging, and conditional edges (what if the editor rejects the draft and sends it back to the writer?). The tradeoff is the dependency and the graph mental model. For a linear pipeline, our 150-line version is simpler. For workflows with branching and loops, LangGraph starts earning its keep.

Production multi-agent: what breaks at scale

Multi-agent systems fail in five predictable ways. Testing each agent in isolation and then the full pipeline end-to-end catches most of these before production.

Failure mode 1: context loss between handoffs

The most common failure. Agent B doesn't receive all the context Agent A produced, or the context gets truncated because it exceeds the next agent's context window.

Fix: Structured handoffs with explicit context packaging. Don't pass raw output — pass a structured summary with the key data points the next agent needs.

Failure mode 2: cascading errors

Agent A produces subtly wrong output. Agent B takes it as ground truth and builds on it. By the final output, the original mistake has been amplified into something confidently wrong.

Fix: Validation steps between agents. Each agent should check the quality of its input before processing. The debate/consensus pattern catches this by independently verifying claims.

Failure mode 3: infinite loops

Agent A delegates to Agent B, which delegates back to Agent A. This happens more often than you'd think with hierarchical orchestrators that have vague delegation criteria. The fix is straightforward — track the delegation chain and bail on cycles:

typescript
function detectLoop(
  chain: string[],
  maxDepth: number = 5
): boolean {
  if (chain.length > maxDepth) return true;
  const seen = new Set<string>();
  for (const agent of chain) {
    if (seen.has(agent)) return true;
    seen.add(agent);
  }
  return false;
}

Failure mode 4: role boundary violations

The writer starts fact-checking. The researcher starts writing prose. Agents drift outside their specialization, producing lower-quality output in areas they weren't designed for.

Fix: Tight system prompts with explicit boundaries. "You are a researcher. Output structured findings ONLY. Do NOT write analysis or recommendations." Negative prompting — as covered in prompt engineering — is your best tool here.

Failure mode 5: observability gaps

The final output is bad, but you can't tell which agent caused the problem. Was the research wrong? Did the writer misinterpret it? Did the editor introduce an error?

Fix: Log every agent's input and output. The scratchpad pattern gives you this for free. In production, add latency tracking, token usage, and quality scores per agent.

Testing multi-agent pipelines

Test each agent in isolation first, then the full pipeline. Chanl's scenario testing lets you run the complete workflow against predefined inputs and score each agent's contribution independently:

typescript
// Test the full multi-agent pipeline
const execution = await chanl.scenarios.execute(scenarioId, {
  agentId: orchestrator.id,
  mode: "text",
  variables: { task: "Write a market analysis report" },
});
 
// Score each agent's contribution independently
const scores = await chanl.scorecards.evaluate({
  interactionId: execution.interactionId,
  scorecardId: multiAgentScorecard.id,
});
 
// Run regression tests across multiple scenarios
const batchResults = await chanl.scenarios.executeBatch({
  scenarioIds: [
    researchScenario.id,
    writingScenario.id,
    editingScenario.id,
  ],
  agentId: orchestrator.id,
});

For more on building eval frameworks from scratch, see our guide on evaluating AI agents.

What to monitor in production

MetricWhat it tells youAlert threshold
Per-agent latencyWhich agent is the bottleneck> 2x historical mean
Handoff success rateAre agents receiving valid input< 95%
Context window utilizationAre you hitting token limits> 80% of context window
Output quality per agentWhich agent is degradingScore drop > 10% vs baseline
Delegation depthAre you in a loop> configured max
Total pipeline latencyEnd-to-end user experience> SLA threshold
Token cost per pipeline runBudget tracking> 2x budget per task

Start simple, scale to multi-agent

The most common mistake with multi-agent systems isn't building them wrong — it's building them when you don't need them. A single agent with well-crafted prompts and the right tools handles 80% of use cases. Multi-agent earns its complexity when tasks genuinely require different expertise, when you need parallelism, or when adversarial verification is worth the cost.

Start with one agent. Push it until it fails. When you can articulate why it's failing — context pollution, instruction drift, role confusion — split into specialized agents using the patterns here.

Build the orchestrator from scratch first. If you find yourself reimplementing state management, retry logic, and workflow visualization, adopt a framework. If 150 lines of direct LLM calls solve the problem, keep it simple.

MCP standardizes how agents use tools. A2A will standardize how agents talk to each other. RAG gives agents access to your data. Multi-agent orchestration ties it all together into systems that are greater than the sum of their parts.

Build something. Start with two agents. See what they can do together that neither could do alone.

DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions