A RAG evaluation playbook before your document assistant goes live
A step-by-step tutorial for building grounded RAG test sets, citation checks, retrieval diagnostics, freshness rules, and regression gates.
A RAG assistant should not go live because three sample questions worked in a demo. Retrieval-augmented generation needs an evaluation loop that tests source coverage, answer grounding, citation quality, freshness, refusal behavior, and regressions after every prompt, chunking, or model change.
Why this matters now
RAG evaluation tooling has matured around complementary layers. Ragas focuses on metrics for retrieval and generation quality. LangSmith provides tracing and evaluation concepts for chains and agents. LlamaIndex documents evaluation patterns inside its framework. Haystack offers a production-oriented pipeline model. The tools differ, but the core questions are stable: did we retrieve the right evidence, did the model use it faithfully, and can users trust the answer path?
Selection frame
Start with a small but realistic test set. Include questions whose answer is in one document, questions that require combining documents, questions with stale or conflicting information, and questions that should be refused. Label the expected source, acceptable answer, and unacceptable answer. This is slow work, but it prevents the team from optimizing a chatbot around happy-path examples.
Practical implementation path
- Collect real questions. Use support tickets, sales questions, onboarding docs, analyst notes, or search logs. Synthetic questions are useful later, but the first set should come from the workflow users actually run.
- Label evidence, not just answers. For each question, mark the source document or passage that supports the answer. A RAG system that answers correctly from the wrong evidence is still brittle.
- Track retrieval diagnostics. Measure whether the right document appears in top results before generation. If retrieval fails, answer scoring will not tell you how to fix chunking, metadata, filters, or embeddings.
- Score citations manually at first. Check whether citations support claims and whether the answer includes unsupported details. Automate later after the rubric is stable.
- Run regression gates. Every model, prompt, parser, embedding, reranker, or chunking change should run the same tests. Store results so the team can see drift.
Evaluation checklist
- Answer faithfulness. Does the answer stay within retrieved evidence instead of adding plausible but unsupported claims?
- Retrieval recall. Do expected source documents appear early enough for the generator to use them?
- Citation usefulness. Can a user click the citation and verify the claim quickly?
- Refusal quality. Does the assistant say when the corpus does not contain enough evidence?
Common failure modes
- Only measuring final answers. Final answer scores hide retrieval problems and make fixes slower.
- Over-tuning to tiny evals. A ten-question set can catch regressions but should not become the only definition of quality.
- Forgetting freshness. Document assistants become risky when policies, prices, or procedures change faster than the index updates.
Working decision record
Before choosing a vendor or open-source project for this workflow, write a one-page decision record. It should name the business owner, user group, data involved, expected output, review owner, and the reason the workflow belongs in the tutorials lane rather than a neighboring category. Add the source links that shaped the decision, including Ragas documentation, LangSmith evaluation concepts, and LlamaIndex evaluation guide, and note which claims came from vendor documentation versus your own pilot. This prevents a future reviewer from mistaking a marketing claim for field evidence.
The record should also state what will not be automated in the first release. That boundary is easy to skip, but it is often the most useful part of the document. If the workflow touches rag, evaluation, retrieval, and quality, write down the situations where the tool should ask for clarification, hand off to a person, or stop. Those negative cases make adoption safer and give the team a way to compare tools like Ragas, LlamaIndex, Haystack, and Promptfoo without being distracted by polished demos.
Pilot plan
Run the first pilot with a narrow group and a fixed task set. A good pilot lasts long enough to see repeated behavior but short enough to shut down quickly if quality is poor. Use ten to twenty representative tasks, keep the source material stable, and capture every failure in the same format: user goal, input, tool response, expected response, severity, suspected cause, and proposed fix. If a tool requires special setup, include setup time in the score. A system that performs well only after undocumented tuning will be hard to hand to another team.
At the end of the pilot, make a decision using evidence rather than enthusiasm. Keep a small table with quality, latency, cost, review burden, data exposure, integration work, and maintenance owner. If the tool wins on quality but loses on governance or operations, that is not a failure; it is a signal that the first deployment should stay narrower. If the tool loses on the core task, do not rescue it with a broader roadmap. Move on and preserve the lessons in the decision record.
Procurement and maintenance notes
For commercial tools, ask how data is stored, how model providers are selected, how retention works, and whether admin controls match the risk tier. For open-source tools, inspect release cadence, issue quality, license, maintainer activity, and whether the project can be deployed in your environment. In both cases, the maintenance question matters as much as the feature list: who upgrades it, who watches failures, who owns user feedback, and who has permission to turn it off.
Treat the first production release as a monitored workflow. Define a review date before launch, not after problems appear. Keep logs, source versions, prompts, configuration, and evaluation results together so the team can explain what changed when quality moves. This is especially important for AI tools because model behavior, vendor policies, and integration surfaces can change without the same visibility as traditional software releases.
Reader handoff
After reading, choose one concrete next action: shortlist two tools, write a pilot task set, clean the source data, or create an approval checklist. Do not leave the article as general research. The value comes from turning the framework into a small artifact your team can review. Save that artifact beside the tool record, then revisit it after the first pilot so the decision improves with evidence rather than memory.
Operating cadence
Create a weekly RAG quality review until launch. Look at failed questions by failure type: missing document, bad chunk, weak reranking, unsupported synthesis, formatting issue, or stale index. After launch, sample production conversations and add recurring failures to the eval set. This turns user pain into a quality asset rather than a support queue only.
ToolVerse connections
ToolVerse can help compare Ragas, LlamaIndex, Haystack, Promptfoo, and related retrieval projects. Review each tool for metrics, tracing support, integration effort, and whether it works with your serving stack before standardizing.
Bottom line
RAG quality is a system property. The answer, citation, retrieval result, index freshness, and refusal behavior all need to work together. Evaluate that system before users build trust in it.