RAG, evaluated honestly
Retrieval is a search problem dressed as an AI problem. Evals are the only thing that matters — and most teams don't have them.
Most RAG projects we audit have one thing in common: nobody can answer the question 'is this any good?' with a number. That's not a sin — until the LLM is in front of customers and the team can't tell whether yesterday's prompt change made it better or worse.
Build the eval set first
Before you pick an embedding model, before you pick a vector store, before you write a single prompt — sit down with the people who'll use the system, write fifty real queries, and grade the answers you'd want each to return. That's the artifact you measure against for the rest of the engagement.
Two metrics, no more
Recall@k for retrieval. Faithfulness for generation. Anything else is decoration unless you have a specific failure mode to instrument. Resist the urge to track 17 metrics nobody looks at.
Recall@5 went from 41% to 87% by fixing the chunking strategy. The model never changed.
What we tell clients
If the eval set isn't ready, we delay the engagement. We've turned down two prospects this year for this reason. Both came back six months later with proper data.
Cite this article
Copy the citation block — formatted the way our LLM-friendly markup expects it (BRIEF §10.3 #4).
Andrea Ventura (). “RAG, evaluated honestly.” NEXUS Journal. http://localhost:3000/en/journal/rag-evaluated-honestly