ChanlChanl
Knowledge & Memory

Your RAG Returns Wrong Answers. Upgrading the Model Won't Help

Most RAG quality problems are retrieval problems, not model problems. Bad chunking, wrong embeddings, and missing re-ranking cause more hallucinations than model capability gaps.

DGDean GroverCo-founderFollow
March 26, 2026
7 min read
Person examining documents through a magnifying glass

A team I spoke with last month was running GPT-4o for their customer support agent. Their knowledge base had 400 documents. The agent kept giving wrong answers about their refund policy.

Their fix? Upgrade to Claude Opus. Triple the per-token cost. The wrong answers got more articulate. They didn't get more correct.

This happens constantly. A RAG system returns bad answers, and the first instinct is to blame the model. Swap it for a bigger one. Fine-tune it. Throw more parameters at the problem. It almost never works, because the model was never the problem.

The Model Isn't Your Problem

Generation is the last step in the RAG pipeline, and it's downstream of everything else. The model receives chunks of text and synthesizes an answer from them. That's it. If the chunks don't contain the right information, the model has two options: say "I don't know" or fill the gap with something plausible. Most models choose the second option, because that's what they're optimized to do.

This is the part teams miss. When your RAG agent confidently states the wrong refund policy, it's not because the model lacks capability. It's because the retrieval pipeline handed it a chunk from a 2024 policy document instead of the current one. Or it handed the model a chunk that was cut off mid-sentence, missing the critical exception clause.

Upgrading from GPT-4o to Opus or Gemini won't fix that. You'll just get wrong answers delivered with better grammar.

Enterprise deployments bear this out. RAG reduces the 40-60% factual error rate seen in vanilla LLM chatbots to under 10%, but nearly all remaining errors trace back to retrieval failures, not generation failures. The model works fine when it gets the right context. The pipeline upstream is what breaks.

The Real Culprits

Three retrieval problems cause the vast majority of RAG quality failures. None of them are solved by a better model.

Bad chunking destroys context before it reaches the model

Fixed-size chunking is the default in most RAG tutorials and frameworks. Set a token limit (usually 500), split the document, add some overlap, embed, done.

The problem is that documents aren't written in 500-token units. A refund policy might have a main rule in one paragraph and a critical exception two paragraphs later. Fixed chunking splits them into separate chunks. The model retrieves the rule without the exception and delivers a confidently incomplete answer.

This is worse than a wrong answer. It's a partially correct answer that sounds authoritative, so nobody questions it until a customer gets burned.

Recursive chunking, which splits first by sections, then paragraphs, then sentences, preserves more structure. Semantic chunking goes further by detecting topic boundaries using embedding similarity, keeping conceptually unified text together. Recent benchmarks show recursive splitting at 512 tokens with overlap remains a strong default, but the right strategy depends entirely on your content. Legal docs need section-aware splitting. FAQ pages need question-answer pair preservation. One-size-fits-all chunking is the single biggest quality destroyer in most RAG systems.

Wrong embedding model misses your domain

Your embedding model converts text into vectors. If the model doesn't understand your domain's vocabulary, semantically similar content won't land near each other in vector space.

A general-purpose embedding model trained on web text will treat "ARM processor" and "arm injury" as related. It'll put "401(k) rollover" far from "retirement account transfer" because the surface text looks different. (Our Learning AI series covers how embeddings and vector similarity actually work if you want the technical foundations.) Domain-specific content requires either a domain-tuned embedding model or, at minimum, one trained on diverse enough data to handle your terminology.

This is especially brutal for industries with heavy jargon: legal, medical, financial services, manufacturing. The embedding model is the lens through which your entire knowledge base is perceived. A blurry lens means blurry retrieval, no matter how sharp the model on the other end.

Pure vector search fails on exact terms

Vector search is excellent at "vibes." Query about customer complaints and it'll find documents about user feedback, support tickets, and satisfaction surveys. That semantic flexibility is the whole point.

But users don't always search by vibes. They search for "error code E-4012." They search for "Model X Pro 2026 warranty." They search for "Section 7.3.2 of the service agreement."

Pure vector search handles these terribly. The embedding for "E-4012" doesn't reliably land near the document that mentions error code E-4012, because there's no semantic relationship to exploit. It's a literal string match problem being solved by a semantic similarity tool.

This is why production RAG systems are moving to hybrid search: vector similarity for meaning, BM25 keyword matching for exact terms. The numbers are hard to argue with. Studies show BM25 alone retrieves relevant documents 62% of the time. Vector search alone hits 71%. Hybrid search with re-ranking reaches 87%.

The Fixes That Actually Work

Each failure point in the retrieval pipeline has a well-understood fix. None of them require a bigger model.

ProblemDefault ApproachBetter ApproachImprovement
ChunkingFixed 500-token splitsRecursive or semantic chunking at natural boundariesPreserves context, reduces partial-answer hallucinations
EmbeddingsGeneral-purpose model (e.g., text-embedding-ada-002)Domain-tuned or multilingual model matched to contentBetter clustering of domain-specific concepts
SearchPure vector similarity (top-K)Hybrid search: vector + BM25 keyword matching20-30% retrieval accuracy improvement
RankingReturn top-K results as-isTwo-stage: retrieve top-20, re-rank with cross-encoder, keep top-3Up to 28% NDCG improvement, catches misranked relevant docs
FreshnessRe-index manually when someone remembersVersion-controlled docs with freshness weightingPrevents stale content from outranking current policy

Re-ranking deserves special attention because it's the highest-impact fix most teams haven't tried. A cross-encoder re-ranker (like Cohere Rerank or an open-source BGE model) examines each query-document pair together, not independently. This catches cases where a document is highly relevant but didn't score well in initial retrieval because the query phrasing didn't match. Adding re-ranking to an existing pipeline typically improves accuracy 20-35% with only 200-500ms of additional latency.

The catch is that these fixes are cumulative. Hybrid search alone helps. Hybrid search plus re-ranking helps more. But if your chunks are shredding context at the source, better search and ranking just surface broken chunks more efficiently.

Fix chunking first. Then search. Then ranking. That's the order. (Chanl's knowledge base handles all three layers so you're not building custom retrieval infrastructure from scratch.)

How to Diagnose Before You Spend

Before upgrading your model, spending a weekend on re-ranking infrastructure, or switching vector databases, do this: look at what your retrieval pipeline actually returns.

Pull the last 50 queries where users reported wrong answers. For each one, inspect the chunks that were retrieved. Not the final generated answer. The chunks. Ask two questions:

First, does the retrieved content contain the correct answer? If the right information isn't in the chunks, no model on earth will generate the right answer. This is a retrieval problem, full stop.

Second, is the correct chunk present but ranked too low? If the answer appears in chunk number 8 but you're only passing the top 3 to the model, you have a ranking problem. Re-ranking fixes this.

If the right chunk is retrieved and ranked highly, and the model still generates a wrong answer, then you have a generation problem. Upgrade the model. But in my experience, this is the root cause less than 20% of the time.

Separate retrieval quality measurement from generation quality measurement. Tools like analytics dashboards and automated scorecards help here, giving you visibility into what's happening at each stage of the pipeline rather than just measuring the end result. Without that separation, you're debugging a five-stage pipeline by looking only at the output.

We covered the mechanics of building these retrieval pipelines in RAG from Scratch, including chunking strategies and embedding selection. And if you're running into problems where RAG alone isn't covering your production needs, the knowledge base bottleneck article digs into the broader architecture around freshness, governance, and structured data as a complementary retrieval tier.

Fix Retrieval First. Upgrade Models Second

GPT-4o-mini with great retrieval will beat Claude Opus with bad retrieval. Every time.

The math is simple. If your retrieval pipeline returns the right chunks 90% of the time, even a mid-tier model generates correct answers for most queries. If your pipeline returns the right chunks 50% of the time, Opus will hallucinate for the other 50% just as confidently as 4o-mini would.

The model is the last mile. Retrieval is the road. Build the road first.

Once retrieval is solid, pairing it with tools that verify facts against live data sources and persistent memory that tracks what your agent has learned across conversations closes the remaining gaps. But none of that matters if the chunks feeding your model are wrong.

Stop blaming the model. Start inspecting your chunks.

And if you're also wondering why your agent can call 50 tools but can't remember what a customer said yesterday, that's a different gap entirely. If you're wrestling with tool calling differences across providers, MCP is solving that fragmentation at the protocol level.

RAG That Works Out of the Box

Chanl's knowledge base handles chunking, hybrid search, and retrieval quality so you don't have to build the pipeline yourself.

Try Knowledge Base
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions