Naive RAG reached ~60% accuracy in our evals. Contextual RAG (semantic chunking + hybrid search + reranking) reached 88–92%. Hybrid search (70% semantic, 30% keyword) improved retrieval recall by 15–20% on keyword-heavy queries. Cross-encoder reranking added 5–8% accuracy at 100–200ms latency cost. Use hybrid + lightweight reranking by default; add cross-encoder only for high-stakes flows.

Naive RAG—embed, retrieve top-K, generate—struggled in production. Semantic similarity didn't guarantee relevance; keyword-heavy queries (product names, error codes) underperformed. Here's what moved the needle.

How we got here

We ran our naive RAG setup for several months before hitting a wall. User feedback was consistent: answers were often wrong or incomplete when the question required information from multiple places. We traced the failures: retrieval was returning semantically similar chunks, but "similar" didn't mean "contains the answer." A query about "refund policy for enterprise customers" might retrieve chunks about "refund" and "enterprise" separately, but the specific combination—enterprise refund policy—lived in a third chunk that didn't rank highly for either term alone.

We also noticed that keyword-heavy queries (e.g., product names, error codes) performed poorly with pure semantic search. Embedding models smooth over exact matches; "ERR_502" and "502 error" might not embed close enough to surface the right troubleshooting doc. Pure keyword search, on the other hand, missed paraphrased questions. Users don't always use the same words as the documentation.

This raised a larger question: can we combine semantic and keyword signals without doubling latency and complexity?

Hybrid search: semantic + keyword

Hybrid search (70% semantic, 30% keyword) improved retrieval recall by 15–20% on our eval set. Latency overhead: 20–50ms. Biggest gains on product names, error codes, technical identifiers.

Hybrid search combines vector similarity with BM25/full-text. Semantic catches paraphrased queries; keyword catches exact matches. We maintain both embedding and full-text indexes. Recommend hybrid for technical docs, support KBs, product documentation; semantic-only suffices for conversational prose.

Reranking: filtering before generation

Cross-encoder reranking added 5–8% accuracy over hybrid alone, at 100–200ms latency cost. Lightweight keyword-overlap reranking is fast; catches obvious mismatches (e.g., "billing" chunk for "shipping" query).

Retrieve more (e.g., top-20), rerank to 3–5 before the LLM. Cross-encoders: slower, more accurate. Keyword overlap: fast, less effective for conceptual relevance. Training-free reranking (LLM confidence signals): 10–20% NDCG in benchmarks; adds LLM calls, may negate latency gains—not deployed by us yet.

Contextual retrieval: the full stack

Full stack (semantic chunking + hybrid + reranking) raised accuracy from ~60% to ~88–92% in our eval. Query expansion helps ambiguous questions; multiplies cost and noise—use selectively for support Q&A.

Contextual retrieval means: semantic chunking, hybrid search, reranking, optional query expansion. Multiple indexes and merge logic add complexity. Numbers depend on domain and data quality.

The Meterra approach

We've landed on a pipeline that defaults to hybrid search + lightweight reranking. We add cross-encoder reranking only for high-stakes flows (e.g., legal or compliance Q&A) where latency is less critical. We use semantic chunking everywhere; we've deprecated fixed-size chunking for new deployments.

The key is treating each layer as optional. Start with hybrid search. If recall is good but precision is low, add reranking. If queries are ambiguous, experiment with query expansion. Measure at each step—don't add complexity without evidence it helps.

Given how RAG has evolved, we recommend:

1. Move to hybrid search if you haven't. The recall gains for keyword-heavy content are substantial. Most vector databases and search platforms support it natively.

2. Add reranking before generation. Retrieve more, rerank to fewer. Even a simple keyword-overlap reranker helps. Cross-encoders are worth it when accuracy matters more than latency.

3. Evaluate retrieval in isolation. Measure recall@K and precision@K before and after each change. If retrieval is the bottleneck, improving the model or prompt won't help.

4. Document your pipeline. Chunking strategy, retrieval config, reranking logic—version and document it. When answers degrade, you need to know what changed.

RAG in 2026 isn't "RAG is dead"—it's "naive RAG is dead." The systems that work are more sophisticated, but the principles are the same: retrieve well, chunk sensibly, and design for retrieval failure.

About Meterra

Meterra is an AI & software development company specializing in custom AI agents, LLM integration, custom software, and cloud-native infrastructure. We build production-ready systems for startups, SMBs, and enterprises—from RAG pipelines and agentic workflows to Kubernetes and multi-cloud operations.

Learn more Contact us

Continue reading

Mar 13, 2026AICloud

The Atomic Unit of Intelligence: How Tokens Define Economy

Mar 9, 2026AIAgentEnterprise

How we got here

Hybrid search: semantic + keyword

Reranking: filtering before generation

Contextual retrieval: the full stack

The Meterra approach

About Meterra

Continue reading

The Atomic Unit of Intelligence: How Tokens Define Economy

The Software Stack of an AI-Native Company

Find your best next step

How we got here

Hybrid search: semantic + keyword

Reranking: filtering before generation

Contextual retrieval: the full stack

The Meterra approach

What we recommend

About Meterra

Continue reading

The Atomic Unit of Intelligence: How Tokens Define Economy

The Software Stack of an AI-Native Company

Find your best next step