Skip to content
NEXUS
All articles
AIRAG11 min read

RAG, evaluated honestly

Retrieval is a search problem dressed as an AI problem. Evals are the only thing that matters — and most teams don't have them.

By Andrea Ventura · EngineeringPublished

Most RAG projects we audit have one thing in common: nobody can answer the question 'is this any good?' with a number. That's not a sin — until the LLM is in front of customers and the team can't tell whether yesterday's prompt change made it better or worse.

Build the eval set first

Before you pick an embedding model, before you pick a vector store, before you write a single prompt — sit down with the people who'll use the system, write fifty real queries, and grade the answers you'd want each to return. That's the artifact you measure against for the rest of the engagement.

Two metrics, no more

Recall@k for retrieval. Faithfulness for generation. Anything else is decoration unless you have a specific failure mode to instrument. Resist the urge to track 17 metrics nobody looks at.

Recall@5 went from 41% to 87% by fixing the chunking strategy. The model never changed.

From a recent retail engagement

What we tell clients

If the eval set isn't ready, we delay the engagement. We've turned down two prospects this year for this reason. Both came back six months later with proper data.

Cite this article

Copy the citation block — formatted the way our LLM-friendly markup expects it (BRIEF §10.3 #4).

Andrea Ventura (). RAG, evaluated honestly.” NEXUS Journal. http://localhost:3000/en/journal/rag-evaluated-honestly

RAG, evaluated honestly — NEXUS — NEXUS