RAG Evaluation Guide: Metrics, Frameworks, and Python Examples

What RAG evaluation is

RAG evaluation measures whether a retrieval-augmented generation system retrieves the right evidence, uses that evidence faithfully, cites it correctly, refuses unsupported questions, and performs reliably in production. A complete evaluation framework separates retrieval metrics, generation metrics, end-to-end task success, and operational reliability.

Retrieval-Augmented Generation, or RAG, is useful because it can connect a language model to a controlled knowledge source before the model writes an answer. That extra retrieval step creates a new responsibility: the answer should be traceable to the evidence that was retrieved.

A good RAG answer is not only fluent. It should answer the question, cite the right sources, avoid unsupported claims, acknowledge missing evidence, and remain stable as documents, users, indexes, prompts, and models change.

This is why RAG evaluation belongs inside the broader discipline of LLM evaluation, but it deserves its own workflow. General LLM evaluation asks whether a language model system meets the product contract. RAG evaluation asks a more specialized question: did the system find and use the right evidence?

RAG evaluation framework from user query to retriever, retrieved chunks, reranker, LLM generation, and final answer with metrics attached to each stage. — The evaluation surface spans the retriever, retrieved evidence, reranker, generator, final answer, and operational behavior.

Why RAG evaluation should not be reduced to one score

The original idea behind this article is still the most important one: RAG evaluation is not a score. A single score can summarize a dashboard, but it cannot explain which part of the pipeline failed.

A bad answer can come from several different places. The retriever may miss the correct document. The reranker may bury the best chunk. The generator may ignore the evidence. The answer may cite a source that only partly supports the claim. The system may be correct offline but too slow, expensive, or stale in production.

Those failures need different fixes. Retrieval problems often require indexing, chunking, metadata, embedding, query rewriting, or reranking work. Generation problems often require prompt, context layout, refusal, citation, or grader changes. Production problems often require monitoring, cache, timeout, index refresh, and release-gate changes.

A useful RAG evaluation framework therefore keeps separate measurements for separate failure modes. Teams can still create a release summary, but the summary should preserve enough detail for debugging.

RAG evaluation architecture

A practical architecture starts by logging the full evaluation record. Store the user question, generated query, retrieved document IDs, retrieved chunk text or references, reranker scores, final context, answer, citations, expected behavior, model and prompt versions, retrieval index version, latency, and cost.

Then evaluate the record in layers. The scorecard below is a compact way to decide what each layer should measure and what failures it is meant to catch.

Layer	What is evaluated	Example metrics	Common failures
Retrieval	Whether the right evidence was found	Recall@K, Precision@K, MRR, nDCG	Missing source, weak chunk ranking
Evidence quality	Whether evidence is current and authoritative	Freshness, version match, duplicate rate	Stale policy, wrong document version
Generation	Whether the answer follows the evidence	Faithfulness, groundedness, completeness	Unsupported claims, missing caveats
Citation and refusal	Whether claims are supported and unsupported questions are refused	Citation precision, citation coverage, no-answer accuracy	Fake or partial citations
Production	Whether the pipeline is stable and efficient	Latency, cost, timeout rate, drift	Cost spikes, stale indexes, unstable responses

This architecture is deliberately diagnostic. It helps answer questions like: did the system retrieve the right source but fail to quote it, or did it never retrieve the source at all?

Building a golden evaluation dataset

A golden evaluation dataset is a curated set of cases that represents what the system must handle. It should include ordinary successful questions, edge cases, conflict cases, stale-source cases, ambiguous questions, missing-evidence cases, and production failures that should never regress.

Each case should make the expected behavior explicit. For RAG, that usually means source IDs, distractor source IDs, expected refusal or clarification behavior, and a failure label that explains why the case exists.

Download the synthetic examples:

The included JSONL file has synthetic cases for stale documents, conflicting sources, missing evidence, ambiguous acronyms, partial citation support, correct refusals, clarification requests, duplicate chunks, multi-document synthesis, wrong policy versions, incorrect retrieval, and correct retrieval with incorrect generation risk.

Do not treat a golden set as a static benchmark. The best evaluation sets evolve. Every meaningful production failure should either map to an existing case or become a new one.

Retrieval evaluation metrics

Retrieval metrics ask whether the system found useful evidence before generation. They can be computed without calling an LLM when each case has a set of relevant document or chunk IDs.

Recall@K

Recall@K measures how much of the known relevant evidence appears in the top K retrieved results. It is helpful when missing the correct source is the main risk.

Precision@K

Precision@K measures how much of the top K context is actually relevant. It is helpful when the generator is being distracted by noisy chunks.

Reciprocal rank and MRR

Reciprocal rank is 1 divided by the position of the first relevant result. Mean Reciprocal Rank, or MRR, averages that value across cases. It rewards systems that rank useful evidence early.

nDCG

nDCG is useful when relevance is graded rather than binary. It gives more credit when highly relevant documents appear earlier in the ranking.

The complete source file is available at assets/rag-evaluation/retrieval_metrics.py. Here is the core excerpt:

def recall_at_k(retrieved_ids, relevant_ids, k):
    """Return the share of relevant documents retrieved in the top k results."""
    relevant = _validate_relevant(relevant_ids)
    top_k = _validate_retrieved(retrieved_ids, k)
    if not top_k:
        return 0.0
    hits = len(set(top_k) & relevant)
    return hits / len(relevant)


def precision_at_k(retrieved_ids, relevant_ids, k):
    """Return the share of top k retrieved documents that are relevant."""
    relevant = _validate_relevant(relevant_ids)
    top_k = _validate_retrieved(retrieved_ids, k)
    if not top_k:
        return 0.0
    hits = len(set(top_k) & relevant)
    return hits / len(top_k)


def reciprocal_rank(retrieved_ids, relevant_ids):
    """Return 1/rank for the first relevant result, or 0.0 when none is found."""
    relevant = _validate_relevant(relevant_ids)
    _validate_retrieved(retrieved_ids, max(len(retrieved_ids), 1))
    for rank, doc_id in enumerate(retrieved_ids, start=1):
        if doc_id in relevant:
            return 1.0 / rank
    return 0.0

Run it locally with:

python assets/rag-evaluation/retrieval_metrics.py

Generation evaluation metrics

Generation evaluation asks whether the final answer is correct and supported by the retrieved evidence. This is separate from retrieval. A system can retrieve the right source and still write the wrong answer.

Faithfulness

Faithfulness checks whether answer claims are supported by the retrieved context. It is one of the most important generation checks for RAG because unsupported claims are often written in confident language.

Groundedness

Groundedness checks whether important answer statements can be traced back to evidence. In some systems, this can be evaluated claim by claim.

Completeness

Completeness asks whether the answer covers the key parts of the question. A short answer can be faithful but incomplete.

Answer relevance

Answer relevance asks whether the response directly addresses the user request. This catches cases where the model stays on topic but does not solve the task.

LLM-as-judge methods can help with semantic review, but they should be calibrated with examples and paired with deterministic checks where possible. Use exact source IDs, schemas, regexes, and business rules before asking another model to make a subjective judgment.

Citation and no-answer evaluation

Citations are not decoration. In RAG systems, citations are part of the trust contract. Citation evaluation should check whether each cited source exists, whether it supports the claim, whether important claims have coverage, and whether the answer avoids citing irrelevant or partial evidence.

No-answer evaluation is equally important. A RAG system should refuse or ask for clarification when the corpus does not support an answer. This is not a model personality preference; it is a product safety behavior.

Useful checks include:

Citation precision: cited sources actually support the claims they are attached to.
Citation coverage: important answer claims have supporting citations.
No-answer accuracy: unsupported questions are refused instead of answered from weak evidence.
Clarification accuracy: ambiguous questions trigger a clarifying question rather than a guessed answer.

End-to-end task evaluation

Layered metrics explain failures, but users experience the whole workflow. End-to-end task evaluation asks whether the RAG system solved the user problem under the constraints of the product.

For a support assistant, that might mean the user got the correct policy answer with a source and no private information leaked. For an internal search assistant, it might mean the user found the right document and did not need to run a second query. For a document workflow, it might mean the answer was accurate enough for a reviewer to approve or escalate.

End-to-end evaluation should include pass/fail outcomes, severity labels, reviewer notes, and links to the lower-level retrieval and generation measurements. The point is not to hide details behind a final score. The point is to connect detailed diagnostics to a release decision.

Production monitoring

Offline evaluation catches regressions before release. Production monitoring catches drift after release. RAG systems are especially sensitive to changing documents, changing user language, changing models, and stale indexes.

Monitor at least these dimensions:

Latency, cost, timeout rate, and retry rate.
Index freshness, source version, and stale-source usage.
Citation quality, citation coverage, and unsupported-claim reports.
No-answer and clarification behavior.
User corrections, escalations, thumbs-down feedback, and reviewer disagreement.
Retrieval drift by query type, source family, or embedding/index version.

When monitoring finds a real failure, convert it into a golden dataset case. That closes the loop between production evidence and future release gates.

Framework and tool comparison

Different RAG evaluation tools are useful for different jobs. The table below is intentionally high-level and source-linked. Metrics with similar names across frameworks are not necessarily directly comparable because implementations, prompts, judges, reference requirements, and aggregation choices can differ.

Approach or tool	Useful when	What to verify before relying on it	Official source
RAGAS	You want RAG-focused metrics such as context precision, context recall, response relevancy, and faithfulness.	Check which metrics require references, which use LLM calls, and how prompts or models are configured.	RAGAS metrics docs
DeepEval	You want an evaluation framework that treats the retriever and generator as separate parts of a RAG pipeline.	Check metric definitions, required inputs, judge model settings, and CI behavior for your test suite.	DeepEval RAG evaluation guide
RAGChecker	You want fine-grained diagnostic evaluation across the full pipeline, retriever, and generator.	Check input format requirements, model dependencies, and whether its diagnostic metrics match your review needs.	RAGChecker repository
LangSmith	You want datasets, experiments, evaluators, traces, and production feedback loops around RAG applications.	Check how your stack sends traces, how datasets are curated, and which evaluators are offline or online.	LangSmith RAG tutorial
Promptfoo	You want declarative tests for retrieval and output generation, including custom assertions that match your application.	Check provider setup, assertion semantics, and whether tests cover retrieval separately from final output.	Promptfoo RAG guide

Use framework scores as measurement tools, not as universal truth. Keep a small hand-reviewed calibration set so that metric changes remain interpretable.

A practical implementation workflow

Define the product contract. Decide which sources are authoritative, which answers require citations, and when the system must refuse or ask for clarification.
Create a golden dataset. Include relevant source IDs, distractors, expected behavior, failure type, and review notes.
Evaluate retrieval first. Measure Recall@K, Precision@K, MRR, nDCG, duplicate rate, freshness, and version match.
Evaluate generation second. Check faithfulness, groundedness, completeness, answer relevance, citations, refusals, and clarifications.
Add end-to-end task gates. Connect layered scores to release decisions for the actual workflow.
Monitor production. Track drift, stale sources, latency, cost, timeouts, corrections, and escalation patterns.
Feed failures back into tests. Turn meaningful incidents into new cases before the next release.

A strong RAG evaluation system should make improvement easier. When a case fails, the team should know whether to inspect retrieval, chunking, reranking, prompt context, generation, citation mapping, refusal behavior, or production infrastructure.

Frequently asked questions

What is RAG evaluation?

RAG evaluation measures whether a retrieval-augmented generation system finds the right evidence, uses that evidence faithfully, cites it correctly, refuses unsupported questions, and stays reliable in production.

What metrics should be used for RAG evaluation?

Use separate metrics for retrieval, generation, citation behavior, refusal behavior, task success, and operations. Common retrieval metrics include Recall@K, Precision@K, MRR, and nDCG. Generation checks often include faithfulness, groundedness, completeness, and answer relevance.

How do you evaluate retrieval separately from generation?

Evaluate retrieval by comparing returned document IDs or chunks against a known relevant set. Evaluate generation by checking whether the final answer is correct, complete, supported by the retrieved evidence, and properly cited.

What is a golden dataset for RAG?

A golden dataset for RAG is a curated set of questions, relevant source IDs, distractor sources, expected behavior, and failure labels used to test retrieval, generation, citations, refusals, and clarifications before release.

Can RAG evaluation be automated?

RAG evaluation can be partly automated with retrieval metrics, schema checks, citation checks, deterministic validators, and LLM-based graders. Human review is still useful for ambiguous, high-risk, or newly discovered failure cases.

What should be monitored in production?

Production RAG monitoring should track answer success, citation quality, refusal accuracy, stale-source usage, retrieval drift, latency, cost, timeout rate, index freshness, user corrections, and failure clusters.

FAQ structured data mirrors the visible FAQ above. It does not guarantee FAQ rich results.

RAG Evaluation: Metrics, Test Sets, Frameworks, and Production Monitoring

TL;DR

Article Highlights

Contents

What RAG evaluation is

Why RAG evaluation should not be reduced to one score

RAG evaluation architecture

Building a golden evaluation dataset

Retrieval evaluation metrics

Recall@K

Precision@K

Reciprocal rank and MRR

nDCG

Generation evaluation metrics

Faithfulness

Groundedness

Completeness

Answer relevance

Citation and no-answer evaluation

End-to-end task evaluation

Production monitoring

Framework and tool comparison

A practical implementation workflow

Frequently asked questions

TL;DR

Article Highlights

Contents

What RAG evaluation is

Why RAG evaluation should not be reduced to one score

RAG evaluation architecture

Building a golden evaluation dataset

Retrieval evaluation metrics

Recall@K

Precision@K

Reciprocal rank and MRR

nDCG

Generation evaluation metrics

Faithfulness

Groundedness

Completeness

Answer relevance

Citation and no-answer evaluation

End-to-end task evaluation

Production monitoring

Framework and tool comparison

A practical implementation workflow

Related reading

Frequently asked questions

Subscribe to Yangming Li's Newsletter

Related writing

LLM Evaluation Framework

AI Engineering

Testing and Evaluating Copilot Agents