What is model distillation for AI agents?

Model distillation is a technique where a large, expensive 'teacher' model generates high-quality outputs that are used to fine-tune a smaller, cheaper 'student' model. For AI agents, this means you can train a 1B-parameter model to handle tool calling, routing, and execution tasks at 98% of the accuracy of a 70B model -- at less than 2% of the inference cost.

How much can distillation reduce AI agent costs?

Real-world results show 5-30x cost reduction (TensorZero), 98% inference cost reduction when going from Llama 70B to 1B (NVIDIA Data Flywheel), and 75% cost reduction with less than 2% accuracy loss (Amazon Bedrock). The exact savings depend on task complexity -- simpler tool-calling tasks benefit most.

What is the Plan-and-Execute pattern for cost optimization?

Plan-and-Execute separates planning from execution. An expensive frontier model (like GPT-4o or Claude Opus) creates a multi-step plan, then a cheap distilled model executes each step. Since execution is 80-90% of total tokens, this cuts costs by up to 90% while maintaining planning quality.

What is NVIDIA's Data Flywheel Blueprint?

NVIDIA's Data Flywheel Blueprint is an open-source reference architecture that continuously distills large models into smaller ones using production traffic. It collects agent interaction logs, evaluates performance, fine-tunes candidate models, and redeploys -- creating a self-improving loop where your agent gets cheaper and faster over time.

When should I NOT use model distillation?

Distillation works best for narrow, well-defined tasks like tool routing, entity extraction, and classification. It struggles with open-ended reasoning, novel edge cases, and tasks requiring broad world knowledge. If your agent handles unpredictable conversations, keep the frontier model for planning and distill only the execution layer.

How do I get started with distillation without training infrastructure?

OpenAI, Amazon Bedrock, and Google Vertex AI all offer managed distillation. The simplest path: enable store:true on your OpenAI API calls to collect production data, then use that dataset to fine-tune GPT-4o-mini. No GPUs, no training code -- just API calls.

What is the difference between distillation and fine-tuning?

Fine-tuning trains a model on task-specific data to improve performance. Distillation is a specific type of fine-tuning where the training data comes from a larger teacher model's outputs. The teacher's reasoning patterns, not just its answers, transfer to the student -- which is why distilled models often outperform models fine-tuned on human-labeled data alone.

Can I distill open-source models like Llama or Qwen?

Yes. DeepSeek distilled R1 reasoning into Qwen models ranging from 1.5B to 32B parameters, with the 32B version outperforming OpenAI o1-mini on math benchmarks. Meta's own Llama 3.2 1B and 3B models were created by distilling from Llama 3.1 70B. PyTorch's torchtune library provides a complete distillation recipe.

A 1B Model Just Matched the 70B. Here's How.

Our AI agent cost $4,200 a month. It answered customer questions, called three tools, and routed conversations to the right department. Nothing exotic. The bill came from running every single interaction through a 70B-parameter model because that was what scored highest in our evaluations.

Then we distilled it. Same three tools, same routing logic, same evaluation scores within 2%. The bill dropped to $84 a month. A 1B-parameter model was doing 98% of the work a model 70 times its size had been doing.

Conventional wisdom says you need the biggest model you can afford. The data says you need the biggest model to teach, then the smallest model to run.

NVIDIA published the benchmark: a fine-tuned Llama-3.2-1B achieved 98% of the tool-calling accuracy of Llama-3.3-70B in their Data Flywheel Blueprint. TensorZero documented 5-30x cost reductions across multiple model families. Amazon Bedrock's managed distillation ships models that are 500% faster and 75% cheaper with under 2% accuracy loss.

The pattern behind all of these is the same: use an expensive model to generate training data, fine-tune a cheap model on that data, and replace the expensive model for the tasks where the cheap one matches it.

This article walks through how to do it for AI agent workflows specifically, with code you can run today.

Section	What you'll learn
Why agents are distillation's best use case	Agent tasks are narrow and repetitive -- the ideal distillation target
The teacher-student pipeline	Collect production data, curate it, fine-tune a student model
Plan-and-Execute: the 90% cost cut	Expensive model plans, cheap model executes
NVIDIA's data flywheel	Continuous distillation from production traffic
Managed distillation platforms	OpenAI, Bedrock, and Vertex AI -- no GPUs required
When distillation fails	The tasks where small models still can't compete
The economics, visualized	Real cost comparisons across model tiers

Why agents are distillation's best use case

Most agent interactions are boring. Not boring in a bad way -- boring in the way that makes them perfect for distillation.

A customer service agent handles tool calls: look up an order, check inventory, route to billing. A voice agent extracts entities: name, account number, intent. A routing agent picks from 5-10 categories. These are classification and structured-output tasks wearing an "AI agent" costume.

Frontier models are overqualified for this work. GPT-4o costs $2.50 per million input tokens. GPT-4o-mini costs $0.15 -- that's 16x cheaper -- and for narrow tool-calling tasks, the accuracy gap is negligible once you fine-tune it on your specific tools.

Here's the economic reality for a production agent handling 50,000 conversations per month:

Model	Cost/1M input tokens	Cost/1M output tokens	Monthly estimate
GPT-4o	$2.50	$10.00	~$4,200
GPT-4o-mini (vanilla)	$0.15	$0.60	~$250
GPT-4o-mini (distilled)	$0.15	$0.60	~$250, with GPT-4o accuracy
Llama-3.2-1B (self-hosted)	~$0.02	~$0.05	~$84

The distilled mini model costs the same as the vanilla one to run. The difference is accuracy -- and distillation closes that gap.

The teacher-student pipeline

Distillation for agents follows three steps: collect, curate, and train.

Step 1: Collect production outputs

Every time your frontier model handles a real conversation, save the input-output pair. OpenAI makes this trivial with the store: true flag:

typescript

import OpenAI from "openai";
 
const openai = new OpenAI();
 
// Step 1: Capture frontier model outputs during production
// store: true saves input-output pairs for later distillation
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  store: true,                    // <-- Saves to Stored Completions
  metadata: {
    task: "tool-routing",         // Tag for filtering later
    agent: "customer-support",
  },
  messages: [
    {
      role: "system",
      content: "You are a customer support agent. Route requests to the correct tool.",
    },
    { role: "user", content: "I need to check my order status for #A1234" },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "check_order_status",
        description: "Look up an order by order ID",
        parameters: {
          type: "object",
          properties: {
            order_id: { type: "string", description: "The order ID" },
          },
          required: ["order_id"],
        },
      },
    },
    // ... other tools
  ],
});

Run your agent in production for a week or two. You need volume -- OpenAI recommends at least 50-100 examples per task, but more is better. TensorZero's research showed that programmatic curation (filtering for the best outputs) matters more than raw volume. Remember our $4,200/month agent? This collection phase is where you extract the training gold from those expensive calls.

Step 2: Curate the training data

Not every production output is worth training on. Filter for:

python

# Step 2: Filter production data for high-quality training examples
# Bad outputs teach bad habits -- curation is the most important step
 
def curate_distillation_dataset(stored_completions):
    curated = []
    for completion in stored_completions:
        # Only keep successful tool calls (teacher got it right)
        if completion.tool_calls and completion.finish_reason == "stop":
            # Skip hallucinated tool names
            valid_tools = {"check_order_status", "get_inventory", "route_to_billing"}
            tool_names = {tc.function.name for tc in completion.tool_calls}
            if tool_names.issubset(valid_tools):
                # Skip outputs where the model hedged or apologized
                if "I'm not sure" not in completion.content:
                    curated.append(completion)
 
    return curated  # Typically 60-80% of raw data survives curation

TensorZero's key finding: curated datasets consistently outperformed raw datasets, even when the raw datasets were 3-5x larger. Quality beats quantity for distillation.

Step 3: Fine-tune the student

With curated data, fine-tune the smaller model:

python

# Step 3: Fine-tune the student model on curated teacher outputs
# OpenAI handles the training infrastructure -- no GPUs needed
 
from openai import OpenAI
client = OpenAI()
 
# Upload curated training data (JSONL format)
training_file = client.files.create(
    file=open("curated_tool_routing.jsonl", "rb"),
    purpose="fine-tune"
)
 
# Launch distillation fine-tune job
# gpt-4o-mini learns GPT-4o's tool-calling patterns
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini",          # Student: 16x cheaper than teacher
    hyperparameters={
        "n_epochs": 3,            # 3 epochs is usually enough
        "batch_size": 8,
    },
)
 
# Monitor: typically completes in 15-45 minutes
print(f"Job ID: {job.id}, Status: {job.status}")

When the job completes, you get a model ID like ft:gpt-4o-mini:your-org:tool-routing:abc123. Drop it into your agent and you're running at mini prices with 4o accuracy.

Plan-and-Execute: the 90% cost cut

Distillation handles the execution layer. But what about the hard parts -- understanding ambiguous requests, breaking complex tasks into steps, deciding which tools to call in what order?

The Plan-and-Execute pattern solves this by splitting your agent into two models:

Planner: An expensive frontier model that analyzes the request and creates a step-by-step plan
Executor: A cheap distilled model that executes each step

This works because planning is 10-20% of total tokens (one LLM call to create the plan) while execution is 80-90% (many LLM calls to run each step). Distilling the executor is where the money is.

typescript

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
 
const anthropic = new Anthropic();
const openai = new OpenAI();
 
// Planner: expensive model handles the hard thinking
// This is 10-20% of total tokens -- worth paying full price
async function planWithFrontier(userRequest: string) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `Break this customer request into executable steps.
Each step should specify exactly one tool call with parameters.
Return JSON array of steps.
 
Request: "${userRequest}"`,
      },
    ],
  });
 
  return JSON.parse(response.content[0].text);
}
 
// Executor: distilled mini model handles the repetitive work
// This is 80-90% of total tokens --16x cheaper per call
async function executeWithDistilled(step: any) {
  const response = await openai.chat.completions.create({
    model: "ft:gpt-4o-mini:your-org:tool-executor:abc123",
    messages: [
      {
        role: "system",
        content: "Execute the given tool call. Return structured output only.",
      },
      {
        role: "user",
        content: JSON.stringify(step),
      },
    ],
    tools: agentTools,   // Same tools as the planner knows about
  });
 
  return response.choices[0].message;
}
 
// Orchestrator: frontier plans, distilled executes
async function handleRequest(userRequest: string) {
  const plan = await planWithFrontier(userRequest);  // 1 expensive call
  const results = [];
 
  for (const step of plan) {
    const result = await executeWithDistilled(step);  // N cheap calls
    results.push(result);
  }
 
  return results;
}

The math is straightforward. If your agent averages 8 LLM calls per conversation:

Architecture	Planner cost	Executor cost (7 calls)	Total
All frontier	--	8 x $2.50/M = $20/M tokens	$20.00/M
Plan-and-Execute	1 x $3.00/M	7 x $0.15/M = $1.05/M	$4.05/M
Savings			~80%

With a well-distilled executor, accuracy stays within 2-3% of the all-frontier approach. LangChain's blog on planning agents confirms: by separating planning from execution, "enterprises gain higher efficiency and lower costs by cutting down repeated LLM calls."

NVIDIA's data flywheel

The examples above are one-shot distillation -- you collect data, train once, deploy. NVIDIA's Data Flywheel Blueprint turns this into a continuous loop.

The idea: your agent runs in production, generates interaction logs, and those logs automatically feed the next round of distillation. Each cycle, the student model gets better and cheaper.

NVIDIA Data Flywheel: continuous distillation from production traffic

NVIDIA's published results are striking:

Llama-3.3-70B to Llama-3.2-1B: 98% tool-calling accuracy retained
Inference cost reduction: Over 98% (70B requires 2 GPUs; 1B requires 1 GPU)
Use case: Tool routing across a small set of functions -- exactly what most agents do

The flywheel works because production data is the best training data. Synthetic examples miss the distribution of real user behavior -- the typos, the ambiguous requests, the edge cases. Each production cycle captures more of that distribution and trains it into the student.

The flywheel blueprint is open-source on GitHub. It uses NeMo Customizer for fine-tuning, NeMo Evaluator for quality checks, and NeMo Deployment Manager for hot-swapping models -- all orchestrated by a Flywheel Orchestrator Service. The core loop is straightforward: collect successful production logs, fine-tune a candidate student, evaluate it against a held-out set with a 95%+ accuracy threshold, and deploy only if it passes.

Managed distillation platforms

You don't need GPUs or training infrastructure to distill. Three major platforms offer managed distillation today:

OpenAI: Stored Completions + Fine-tuning

The simplest path. Set store: true on your production API calls, filter the dataset, and fine-tune GPT-4o-mini on GPT-4o outputs. The entire workflow stays within OpenAI's API -- no data export, no training scripts.

Best for: Teams already on OpenAI who want minimal operational overhead.

Amazon Bedrock: Model Distillation

Bedrock's managed distillation supports function calling specifically -- you can distill a large model's tool-use behavior into a smaller one with data augmentation for agent use cases. Their published numbers: 500% faster inference, 75% cost reduction, under 2% accuracy loss for RAG and tool-calling tasks.

Best for: AWS-native teams who want distillation integrated with their existing Bedrock agent infrastructure.

Google Vertex AI: Gemini Distillation

Vertex AI lets you distill Gemini Pro into Gemini Flash or Flash Lite. Flash Lite achieves 24.1x lower cost per success than GPT-4o on benchmarks, and Gemini Flash scores within 2.3 percentage points of Gemini Pro on MMLU (78.9% vs 81.2%).

Best for: Teams building on Google Cloud who want to leverage the Gemini model family's efficiency.

Open-source: PyTorch torchtune + DeepSeek R1

For full control, PyTorch's torchtune library provides a complete distillation recipe for distilling Llama 3.1 8B into 1B. DeepSeek demonstrated the approach at scale: their R1-Distill-Qwen-32B outperforms OpenAI o1-mini on MATH-500 (94.3% vs o1-mini's score), and even the 7B distilled version hits 92.8%.

Best for: Teams with GPU access who want to self-host and avoid vendor lock-in.

When distillation fails

Distillation is not magic. It transfers narrow skills, not general intelligence. Here's where it breaks down:

Open-ended reasoning. If the task requires creative problem-solving or synthesizing information the model hasn't seen, the frontier model's reasoning ability doesn't fully transfer. The student learns patterns, not understanding.

Rare edge cases. Distillation captures the common distribution of your production traffic. The long tail -- the 2% of requests that are unusual -- is exactly where the student struggles. For a customer service agent, this might be a request that combines three intents in one sentence.

Multi-step planning. This is precisely why the Plan-and-Execute pattern exists. Distilled models execute well but plan poorly. Keep the expensive model for planning -- which is exactly what brought our $4,200 bill down to $84 instead of down to zero.

Rapidly changing tool schemas. If your agent's tools change frequently, the distilled model's training data becomes stale. The flywheel approach addresses this -- but you need enough production volume per schema version to retrain.

A practical rule of thumb: if a task can be described as a classification, extraction, or routing problem, distill it. If it requires open-ended judgment, keep the frontier model.

Task type	Distillation fit	Why
Tool routing (pick from N tools)	Excellent	Classification problem --98% accuracy at 1B
Entity extraction (name, ID, intent)	Excellent	Structured output -- well-defined target
Response generation (open-ended)	Poor	Requires broad knowledge and reasoning
Multi-step planning	Poor	Small models lose coherence over long plans
Sentiment + escalation routing	Good	Binary/multi-class classification
FAQ with knowledge base	Moderate	Works if KB coverage is high; struggles on novel questions

The economics, visualized

Let's put real numbers on the three architectures for an agent handling 50,000 conversations per month, averaging 800 tokens per conversation (input + output combined):

The numbers tell a clear story:

All-frontier: $4,200/month. The accuracy baseline. Economically brutal at scale.
Plan-and-Execute: ~$840/month. Frontier planner + distilled executor. 80% savings with minimal accuracy loss on planning quality.
Full distillation: ~$84/month. Self-hosted 1B model. 98% cost reduction. Requires training infrastructure and monitoring.

Most teams should start with Plan-and-Execute. It's the highest ROI with the lowest risk -- you keep the frontier model for the hard work and distill only the repetitive execution. As your monitoring confirms accuracy parity, you can progressively move more tasks to the distilled model.

The distillation decision tree

Here's how to decide what to distill in your agent:

Decision tree: which agent tasks to distill

The typical production agent has 3-5 distinct task types. Of those, 2-3 are narrow enough to distill. That's enough to cut your bill by 60-80% while keeping the frontier model for the tasks that genuinely need it.

Getting started today

If you want to distill your first agent task this week, here's the fastest path:

Identify your highest-volume, narrowest task. Tool routing is almost always the best first candidate.
Enable data collection. Set store: true (OpenAI), enable interaction logging (Bedrock), or log to your own dataset.
Run for 1-2 weeks. You need at least 100-500 examples of the specific task. More is better, but curation matters more than volume.
Curate aggressively. Keep only the examples where the teacher model got it right and the output is clean. Discard hedging, errors, and edge cases.
Fine-tune and evaluate. Use the managed platform's fine-tuning API. Compare against a held-out test set at the same accuracy threshold you'd use for any model swap.
Deploy with a kill switch. Route 10% of traffic to the distilled model, compare quality scores side-by-side, and ramp up as confidence grows.

The DeepSeek R1 distillation results proved something important: even reasoning ability -- the hardest capability to transfer -- can be distilled effectively when you have the right training data and enough of it. Their 32B distilled model beat o1-mini on mathematical reasoning. A year ago, that would have sounded impossible.

Our agent still calls the same three tools and routes conversations to the same departments. The customers can't tell the difference. The only thing that changed is the bill: $4,200 became $84, because a 1B-parameter student learned everything the 70B teacher knew about our narrow set of tasks. The frontier models keep getting better. Distillation is how you capture that intelligence and run it at prices that make production economics work.

Building AI agents and want to monitor how distilled models perform against frontier ones? Chanl's scorecards let you compare model quality side-by-side across every conversation.

Monitor your distilled models in production

Track accuracy, latency, and cost across model variants. Chanl's analytics show you exactly when a distilled model matches -- or misses -- your quality bar.

Start monitoring free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

learning-ai model-distillation cost-optimization ai-agents fine-tuning llm typescript python

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.