A support agent looked up a customer's order, found the right tracking number, and confidently quoted a return policy that had been updated two months ago. The tool worked. The API returned data. The customer heard a clear, specific, wrong answer — and filed a chargeback the next day.
This is the thing about tools for AI agents: they're not optional features. They're the difference between a chatbot that talks about doing things and an agent that actually does them. An agent without tools can sympathize with your shipping delay. An agent with tools can look up your tracking number, check the carrier's status API, and reschedule delivery — all in the same conversation.
But a bad tool system is worse than no tools at all. A chatbot that says "I don't have access to that information" is annoying. An agent that confidently acts on stale data, or calls the wrong API, or hangs for eight seconds waiting on a timeout — that's a customer you just lost.
This article teaches you to build a tool system from scratch. You'll write real, runnable TypeScript. By the end, you'll understand exactly how tool registries, execution loops, credential management, and monitoring work. Then we'll break it — intentionally — by looking at what happens when your three-tool prototype meets 200 concurrent customers in production.
Prerequisites & Setup
You'll need Node.js 18+, an Anthropic API key for Claude, and a terminal. All code is TypeScript — install the SDK and you're ready:
npm install @anthropic-ai/sdkCreate a .env file:
ANTHROPIC_API_KEY=sk-ant-...We'll use Claude's tool_use API for the execution loop. If you're new to how LLMs select and call tools, the function calling deep dive covers the mechanics in detail.
What Is a Tool System?
A tool system is the infrastructure that lets an AI agent take actions in the real world — not just generate text about them. It has four layers: a registry of available tools, an execution engine that calls them, an authentication layer that manages credentials, and monitoring that tracks what happened.
For customer-facing agents, the stakes on each layer are higher than for internal tools. When your internal Slack bot fails to fetch a Jira ticket, a developer shrugs and retries. When your support agent fails to look up an order while a frustrated customer is on the phone, that customer tells Twitter about it.
Here's what a tool call looks like end to end:
Three things happen that you don't get for free: the agent discovers what tools exist (registry), the tool call hits a real API with real credentials (execution + auth), and someone knows it happened (monitoring). Each of those is a system you need to build.
Build It: A Minimal Tool Registry
The tool registry is the foundation — it's where your agent learns what it can do. Every tool needs a name, a description the LLM can read, a schema defining what arguments it accepts, and a function that actually executes it.
Let's start with the interface:
interface ToolResult {
success: boolean;
data?: unknown;
error?: string;
}
interface Tool {
name: string;
description: string;
parameters: {
type: "object";
properties: Record<string, {
type: string;
description: string;
enum?: string[];
}>;
required: string[];
};
execute: (params: Record<string, unknown>) => Promise<ToolResult>;
}The description field is doing more work than it looks. The LLM reads it to decide whether to call the tool — if the description is vague, the agent picks the wrong tool. If it's too narrow, the agent never picks it when it should. This is the same clarity principle from prompt engineering: precision in the instruction determines precision in the output.
Now let's define three tools a real customer-facing agent would use:
const lookupOrder: Tool = {
name: "lookup_order",
description:
"Retrieves the current status, tracking number, carrier, and estimated " +
"delivery date for a customer order given its order ID. Use this when a " +
"customer asks about shipping, delivery, or order status.",
parameters: {
type: "object",
properties: {
orderId: {
type: "string",
description: "The order ID (format: ORD-XXXXX)",
},
},
required: ["orderId"],
},
execute: async (params) => {
const orderId = params.orderId as string;
// In production: call your order management API
const response = await fetch(
`https://api.example.com/orders/${orderId}/status`,
{ headers: { Authorization: `Bearer ${process.env.ORDER_API_KEY}` } }
);
if (!response.ok) {
return { success: false, error: `Order ${orderId} not found` };
}
const data = await response.json();
return { success: true, data };
},
};
const checkInventory: Tool = {
name: "check_inventory",
description:
"Checks real-time inventory availability for a product by SKU. Returns " +
"current stock count and whether the item is available for immediate " +
"shipping. Use when a customer asks if something is in stock.",
parameters: {
type: "object",
properties: {
sku: {
type: "string",
description: "Product SKU identifier",
},
},
required: ["sku"],
},
execute: async (params) => {
const sku = params.sku as string;
const response = await fetch(
`https://api.example.com/inventory/${sku}`,
{ headers: { Authorization: `Bearer ${process.env.INVENTORY_API_KEY}` } }
);
if (!response.ok) {
return { success: false, error: `SKU ${sku} not found` };
}
const data = await response.json();
return { success: true, data };
},
};
const scheduleCallback: Tool = {
name: "schedule_callback",
description:
"Schedules a callback from a human support agent at a specific date and " +
"time. Use when the customer's issue requires human assistance or when " +
"the customer explicitly asks to speak with a person.",
parameters: {
type: "object",
properties: {
customerPhone: {
type: "string",
description: "Customer phone number in E.164 format",
},
preferredTime: {
type: "string",
description: "Preferred callback time in ISO 8601 format",
},
reason: {
type: "string",
description: "Brief summary of why the callback is needed",
},
},
required: ["customerPhone", "preferredTime", "reason"],
},
execute: async (params) => {
const response = await fetch("https://api.example.com/callbacks", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.CALLBACK_API_KEY}`,
},
body: JSON.stringify(params),
});
if (!response.ok) {
return { success: false, error: "Failed to schedule callback" };
}
const data = await response.json();
return { success: true, data };
},
};Now, the registry itself — just a Map with registration and lookup:
class ToolRegistry {
private tools = new Map<string, Tool>();
register(tool: Tool): void {
this.tools.set(tool.name, tool);
}
get(name: string): Tool | undefined {
return this.tools.get(name);
}
listForLLM(): Array<{
name: string;
description: string;
input_schema: Tool["parameters"];
}> {
return Array.from(this.tools.values()).map((tool) => ({
name: tool.name,
description: tool.description,
input_schema: tool.parameters,
}));
}
}
const registry = new ToolRegistry();
registry.register(lookupOrder);
registry.register(checkInventory);
registry.register(scheduleCallback);The listForLLM() method outputs tool definitions in the format Anthropic's API expects. Now let's wire it up to Claude and build the execution loop — the Reason, Select, Execute, Observe cycle:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function handleCustomerMessage(
userMessage: string
): Promise<string> {
const tools = registry.listForLLM();
const messages: Anthropic.MessageParam[] = [
{ role: "user", content: userMessage },
];
// Keep looping until the agent produces a final text response
while (true) {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system:
"You are a customer support agent. Use the available tools to " +
"look up real data before answering. Never guess — if you can't " +
"find the information, say so honestly.",
tools,
messages,
});
// If the model wants to call tools, execute them and feed results back
if (response.stop_reason === "tool_use") {
// Add the assistant's response (includes tool_use blocks)
messages.push({ role: "assistant", content: response.content });
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type === "tool_use") {
const tool = registry.get(block.name);
if (!tool) {
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: `Tool "${block.name}" not found`,
is_error: true,
});
continue;
}
const result = await tool.execute(
block.input as Record<string, unknown>
);
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify(result),
is_error: !result.success,
});
}
}
messages.push({ role: "user", content: toolResults });
continue;
}
// Model is done — extract the text response
const textBlock = response.content.find(
(block) => block.type === "text"
);
return textBlock?.text ?? "I wasn't able to process that request.";
}
}
// Try it
const answer = await handleCustomerMessage(
"Where's my order ORD-48291?"
);
console.log(answer);That's a working tool system. The agent receives a customer message, examines the available tools, decides which one to call, executes it, reads the result, and formulates a response. If the tool fails, the error gets passed back to the agent as an observation, and it can decide how to handle it — retry, try a different tool, or explain the situation to the customer.
Build It: Authentication and Secrets
The code above has a problem you might have noticed: API keys are hardcoded in process.env references scattered across tool definitions. That works for a demo. In production with customer-facing agents, you need credentials that are centrally managed, rotatable without redeploying, and scoped so the Shopify tool can't accidentally access your Stripe keys.
Here's a credential store that separates secrets from tool definitions:
interface Credential {
id: string;
name: string;
headers: Record<string, string>; // e.g., { Authorization: "Bearer sk-..." }
expiresAt?: Date;
}
class CredentialStore {
private credentials = new Map<string, Credential>();
add(credential: Credential): void {
this.credentials.set(credential.id, credential);
}
get(id: string): Credential | undefined {
const cred = this.credentials.get(id);
if (!cred) return undefined;
// Check expiration
if (cred.expiresAt && cred.expiresAt < new Date()) {
console.warn(
`Credential "${cred.name}" expired at ${cred.expiresAt.toISOString()}`
);
return undefined;
}
return cred;
}
rotate(id: string, newHeaders: Record<string, string>): void {
const existing = this.credentials.get(id);
if (!existing) throw new Error(`Credential ${id} not found`);
this.credentials.set(id, { ...existing, headers: newHeaders });
}
}Now tools reference credential IDs instead of raw secrets:
interface SecureTool extends Tool {
credentialId: string; // Reference to credential store, not a raw key
}
async function executeSecureTool(
tool: SecureTool,
params: Record<string, unknown>,
credentialStore: CredentialStore
): Promise<ToolResult> {
const credential = credentialStore.get(tool.credentialId);
if (!credential) {
return {
success: false,
error: `Authentication unavailable for tool "${tool.name}"`,
};
}
// Inject credential headers into the execution context
const originalFetch = globalThis.fetch;
globalThis.fetch = async (input, init) => {
const headers = new Headers(init?.headers);
for (const [key, value] of Object.entries(credential.headers)) {
headers.set(key, value);
}
return originalFetch(input, { ...init, headers });
};
try {
return await tool.execute(params);
} finally {
// Restore original fetch
globalThis.fetch = originalFetch;
}
}This buys you three things. First, key rotation is a one-liner — credentialStore.rotate("shopify-prod", newHeaders) — and every tool using that credential picks up the change immediately. Second, you get expiration checking, so an expired key returns a clean error instead of a mysterious 401 from a downstream API. Third, tools and credentials are decoupled, which means the same tool definition can use different credentials per workspace or per customer.
In production, this store would be backed by an encrypted vault — AWS Secrets Manager, HashiCorp Vault, or a database with at-rest encryption. The pattern is the same regardless of backend: tools reference credential IDs, and auth headers are resolved at execution time. If you're running multi-tenant agents where different customers need different API keys, you'd add scope-based resolution — try the customer-specific credential first, fall back to the workspace default. This is exactly the secret management pattern that production agent platforms implement.
Build It: Execution Tracking
When a customer-facing agent calls a tool during a live conversation, you need to know what happened. Not eventually, not in a log file someone checks next week — right now, in a dashboard you can filter by tool, by time range, by success rate.
Here's the minimum tracking wrapper:
interface ExecutionRecord {
id: string;
toolName: string;
params: Record<string, unknown>;
result: ToolResult;
latencyMs: number;
timestamp: Date;
}
const executions: ExecutionRecord[] = [];
async function executeWithTracking(
tool: Tool,
params: Record<string, unknown>
): Promise<ToolResult> {
const start = Date.now();
const id = crypto.randomUUID();
try {
const result = await tool.execute(params);
const record: ExecutionRecord = {
id,
toolName: tool.name,
params,
result,
latencyMs: Date.now() - start,
timestamp: new Date(),
};
executions.push(record);
return result;
} catch (error) {
const errorResult: ToolResult = {
success: false,
error: error instanceof Error ? error.message : "Unknown error",
};
executions.push({
id,
toolName: tool.name,
params,
result: errorResult,
latencyMs: Date.now() - start,
timestamp: new Date(),
});
return errorResult;
}
}Now aggregate that into something useful:
interface ToolStats {
toolName: string;
totalCalls: number;
successfulCalls: number;
failedCalls: number;
successRate: number;
averageLatencyMs: number;
p95LatencyMs: number;
lastCalledAt: Date | null;
}
function getToolStats(toolName: string): ToolStats {
const records = executions.filter(
(e) => e.toolName === toolName
);
const successful = records.filter((e) => e.result.success);
const latencies = records.map((e) => e.latencyMs).sort(
(a, b) => a - b
);
return {
toolName,
totalCalls: records.length,
successfulCalls: successful.length,
failedCalls: records.length - successful.length,
successRate:
records.length > 0
? successful.length / records.length
: 0,
averageLatencyMs:
latencies.length > 0
? latencies.reduce((a, b) => a + b, 0) / latencies.length
: 0,
p95LatencyMs:
latencies.length > 0
? latencies[Math.floor(latencies.length * 0.95)]
: 0,
lastCalledAt:
records.length > 0
? records[records.length - 1].timestamp
: null,
};
}With tracking in place, you can answer questions that matter: "Is the order lookup tool slower today than yesterday?" "What percentage of inventory checks are failing?" "Which tool hasn't been called in 30 days?" These are the metrics that production monitoring dashboards surface — we're building the plumbing that feeds them.
The key insight is that every execution record ties back to a conversation. In a real system, you'd include conversationId and agentId fields too, so when a customer complains about wrong information, you can trace exactly which tool call returned what data, how long it took, and whether the result was stale.
Watch It Break: Production Failure Modes
Everything you've built so far works. Three tools, a registry, credential management, execution tracking — if you're running a single agent with a handful of customers, this holds up fine.
Then you grow.
Here are five ways this system fails when it meets real production traffic. Each one is a problem that teams discover after they've already shipped, usually at the worst possible time.
Problem 1: Tool Sprawl
You start with 3 tools. Product wants the agent to handle returns, so you add initiate_return and get_return_status. Marketing wants it to answer pricing questions, so you add get_pricing and check_promotions. The shipping team adds track_package, estimate_delivery, and change_delivery_address. Six months later, you have 47 tools in the registry.
The LLM now reads 47 descriptions before every decision. Tool selection accuracy drops. A customer asks "Can I return this?" and the agent calls check_inventory instead of initiate_return because the descriptions are similar enough to confuse the model. Another customer asks about a promotion and the agent calls get_pricing when it should have called check_promotions.
The fix is toolsets — curated groups of related tools assigned to specific contexts. Your support agent gets the Support toolset (order lookup, returns, callbacks). Your sales agent gets the Sales toolset (pricing, promotions, inventory). Each agent sees 8-12 tools instead of 47, and selection accuracy goes back up.
This is the same principle as prompt engineering instruction hierarchy: less noise means better decisions. Research shows tool selection accuracy degrades above 15-20 tools in a single context. Toolsets keep you under that threshold without sacrificing capability.
Problem 2: Versioning Hell
You update lookup_order to add a returnEligible field to the response. The API starts returning the new schema. But agents currently mid-conversation have already seen the old tool definition — they don't know returnEligible exists, so they don't mention it. Or worse: you change the required parameters, and an agent mid-conversation sends the old parameter format to the new API. The customer gets a cryptic error or a partial response.
In the DIY system we built, there's no concept of tool versions. When you update a tool, the change is live instantly for everyone — including agents that started their conversation with the old version.
Production systems handle this with tool versioning and graceful deprecation. A tool update creates a new version. Existing conversations keep using the version they started with. New conversations get the latest. Old versions are deprecated with a sunset date, not deleted without warning.
Problem 3: Auth Rotation
Your Shopify API key expires. In the credential store we built, credentialStore.get("shopify-prod") returns undefined because the expiration check catches it. Good — you get a clean error instead of a mysterious 401.
But clean error or not, every tool that uses Shopify just broke simultaneously. If you have 200 customers mid-conversation and four of your tools depend on Shopify, that's potentially 200 people hearing "I'm sorry, I'm having trouble looking that up right now."
The fix is multi-layered: automated rotation alerts before keys expire (not after), fallback credentials that activate on primary failure, and circuit breakers that detect widespread auth failures and switch to graceful degradation ("Let me connect you with a human agent who can look that up") instead of repeated failed attempts.
Problem 4: Latency Cascading
A tool takes 8 seconds because the upstream API is slow. In a chat interface, 8 seconds is annoying. On a voice call, 8 seconds is an eternity of silence. The customer says "Hello? Are you still there?" The agent, still waiting for the API, says nothing. The customer hangs up.
The executeWithTracking wrapper we built measures latency, but it doesn't enforce it. There's no timeout. No fallback. No way to tell the customer "I'm looking that up for you, one moment" while the tool runs.
Production systems need latency budgets: this tool must respond in under 3 seconds for voice, under 5 seconds for chat. If it misses the budget, return a partial result or a fallback response. For voice agents especially, filler responses ("Let me check on that for you...") need to fire while the tool is still executing — which means the tool execution has to be non-blocking. The relationship between latency and customer satisfaction isn't linear — it's a cliff. Research consistently shows that response times above 5 seconds in voice interactions cause dramatic drops in completion rates.
Problem 5: The Monitoring Gap
A tool silently returns stale data for three days. The API it calls is up, returning 200 OK, but the data behind it hasn't been refreshed since a migration that broke the sync job. Your execution tracking shows 100% success rate and normal latency. Everything looks green.
Then a customer tweets: "Your AI told me a product was in stock and I drove 45 minutes to the store. It hasn't been in stock for a week."
Success rate and latency aren't enough. You need result validation — sanity checks on what tools return, not just whether they return. Is the lastUpdated timestamp recent? Does the inventory count match a reasonable range? Are the prices non-negative? These are domain-specific checks that your generic tracking wrapper doesn't handle, and they're the difference between monitoring that catches outages and monitoring that catches wrong answers.
This is where agent evaluation meets tool monitoring. You can't evaluate whether an agent gave the right answer without knowing whether its tools gave it the right data.
What Production Actually Looks Like
You've built the fundamentals. You understand how tool registries work, how execution loops chain tool calls into conversations, how credentials need to be managed, and what goes wrong at scale. Now let's look at what a production system provides on top of those foundations.
Chanl's SDK and CLI wrap the same concepts you built — registry, execution, auth, monitoring — in infrastructure that handles the scaling problems we just described.
Tool management via CLI:
# List all tools in your workspace
chanl tools list
# Create a tool from a JSON definition
chanl tools create -f lookup-order.json
# Group tools into a toolset for your support agent
chanl toolsets create -n "Support Tools" --tools tool_abc,tool_def,tool_ghi
# Assign the toolset to an agent
chanl agents update agent_123 --toolsets ts_456
# Test a tool with sample input before going live
chanl tools execute tool_abc --input '{"orderId": "ORD-48291"}'SDK integration for programmatic access:
import { PlatformClient } from "@chanl-ai/platform-sdk";
const client = new PlatformClient({
apiKey: process.env.CHANL_API_KEY,
});
// List tools — filtered by type, status
const tools = await client.tools.list({
type: "http",
isEnabled: true,
});
// Create a tool from code
await client.tools.create({
name: "lookup_order",
description: "Retrieves order status by order ID",
type: "http",
inputSchema: {
type: "object",
properties: {
orderId: { type: "string", description: "Order ID" },
},
required: ["orderId"],
},
configuration: {
http: {
method: "GET",
url: "https://api.example.com/orders/{{orderId}}/status",
headers: {
Authorization: "Bearer {{ORDER_API_KEY}}",
},
},
},
});
// Group tools into a toolset
await client.toolsets.create({
name: "Support Tools",
description: "Tools for customer support agents",
toolIds: ["tool_abc", "tool_def", "tool_ghi"],
});What you get that you didn't build:
Centralized credential store with rotation. Secrets are encrypted at rest and resolved at execution time. When a key rotates, every tool using that credential picks up the change — no redeployment, no conversation interruption.
Toolset grouping. Assign different tool bundles to different agents. Your support agent sees support tools. Your sales agent sees sales tools. The LLM never has to pick from 47 options. We covered the full toolset architecture in AI Agent Tools: MCP, OpenAPI, and Tool Management.
Built-in execution tracking with latency metrics. Every tool call is recorded with timing, success/failure, input, and output. Aggregated into dashboards with success rate trends, latency percentiles, and anomaly detection.
OpenAPI import. Point at an OpenAPI spec and get tools auto-generated — one tool per operation, with schemas, descriptions, and auth configuration derived from the spec. No manual JSON authoring for existing APIs.
MCP integration. Tools are exposed through the Model Context Protocol, so any MCP-compatible client — Claude, ChatGPT, VS Code, your custom agent — can discover and call them without custom integration. If you're new to MCP, the MCP Explained tutorial walks through building an MCP server from scratch.
Enable/disable without deletion. Flip a tool off and it immediately disappears from agent conversations. Flip it back on and it's restored with all configuration intact. Instant rollback when something goes wrong.
Persistent memory across conversations. Tools can read and write to agent memory, so your support agent remembers that this customer called about the same issue yesterday and already tried the suggested fix.
The frame here isn't "throw away what you built." It's: you understand every piece now. The registry, the execution loop, the credential pattern, the monitoring layer. When you use a platform that handles these at scale, you know what it's doing under the hood — and you can tell when it's doing it wrong.
When to Build vs. Buy
Build it yourself when you're learning how tool systems work (which you just did), when you're prototyping with fewer than 5 tools, when you have a single agent, or when the agent is internal-only and the cost of failure is low.
Use a platform when your agent is customer-facing, when you're managing more than 10 tools across multiple agents, when you need centralized auth management and key rotation, when you need monitoring with alerting, or when you need versioning so tool updates don't break mid-conversation sessions.
The honest truth: most teams start by building. It's faster for a prototype, and you learn things about your domain that no platform documentation teaches. The transition happens when you realize you're spending more time maintaining tool infrastructure than building agent capabilities. That usually happens somewhere around the 10-tool, 3-agent mark — when you're debugging why the inventory API failed at 2am instead of improving how your agent handles refund conversations.
The build teaches you what matters. The platform handles what scales.
Where to Go from Here
Tools are what separate a chatbot from an agent. Without them, an AI can only talk about the world. With them, it can act in it — look up orders, check availability, schedule callbacks, initiate returns. The infrastructure underneath — registry, execution, auth, monitoring — determines whether those actions are reliable enough to face real customers.
You've built each layer from scratch. You understand tool selection, credential injection, execution tracking, and the five failure modes that hit when you scale. Whether you continue building your own system or use production infrastructure, that understanding makes you a better builder. You'll write better tool descriptions because you know how the LLM selects from them. You'll design better error handling because you've seen what latency cascading does to a voice call. You'll ask the right questions about auth and monitoring because you've built the naive versions and watched them fail.
If you're continuing in the Learning AI series, RAG from Scratch covers the knowledge retrieval layer that complements tools — giving agents access to your documents, not just your APIs. And AI Agent Evals shows how to measure whether your tools-equipped agent actually gives correct answers.
Build agents that actually do things
Chanl handles tool management, credential storage, MCP hosting, and execution monitoring — so you focus on building agent capabilities, not infrastructure.
Explore Chanl Tools- Anthropic Tool Use Documentation — Claude API Reference
- OpenAI Function Calling Guide — GPT API Documentation
- Model Context Protocol Specification — MCP Documentation
- MCP Adoption Statistics — MCP Manager (2025-2026 data)
- A Year of MCP: 97M+ Monthly Downloads — Pento (2025 Review)
- Gartner: >40% of Agentic AI Projects Will Be Canceled by 2027
- OWASP Top 10 for Agentic Applications (2026)
- State of MCP Server Security 2025 — Astrix
- JSON Schema Specification — json-schema.org
- Latency and Customer Satisfaction in Voice AI — Sub-300ms Standard
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



