Pillar Page

LLM Evaluation

A practical pillar page for evaluating LLM systems, RAG workflows, Copilot-style agents, structured extraction, uncertainty, schema validation, and production monitoring.

Start here

LLM evaluation should begin with the product risk, not with a leaderboard metric. A chat assistant, a retrieval answer system, a document extraction workflow, and an agent that calls tools all fail differently. The evaluation design has to match the failure mode. Yangming Li's writing emphasizes repeatable test sets, explicit output contracts, deterministic validators, human review, and monitoring after launch.

Start with a small golden set: ordinary examples that must pass, edge cases that reveal ambiguity, and no-answer cases that punish invention. Then add graders only after the contract is clear. Model-based graders can help with semantic comparison, but schema validation, exact matching, source checks, and business rules should carry the first layer of trust.

A useful LLM evaluation framework

A useful framework has four layers. The first layer is task definition: what the user is trying to accomplish, what sources are allowed, what the model may produce, and when it must refuse or escalate. The second layer is test data: stable fixtures, expected outputs, scenario labels, and reasons each case exists. The third layer is grading: exact match for canonical fields, schema validation for structure, retrieval checks for grounding, semantic comparison for meaning, and custom rules for product constraints. The fourth layer is operations: dashboards, regression history, sampled human review, incident notes, and release decisions.

This structure keeps evaluation from becoming a single score detached from reality. A system can be fluent and still unsafe. It can retrieve a relevant source and still misread it. It can produce valid JSON and still choose the wrong category. Good evaluation makes those differences visible.

RAG evaluation

RAG evaluation needs to inspect both retrieval and generation. Retrieval quality asks whether the system found the right evidence: source recall, chunk relevance, metadata quality, freshness, deduplication, and access control. Generation quality asks whether the answer used the evidence correctly: factual consistency, citation coverage, answer completeness, uncertainty handling, and refusal behavior when the corpus does not support an answer.

The most common production failure modes are quiet. The system may answer from an outdated chunk, cite a source that only partially supports the claim, summarize across conflicting documents without saying so, or retrieve from the right domain but the wrong policy version. A RAG test set should include stale documents, conflicting sources, missing evidence, ambiguous acronyms, and questions where the correct answer is to ask for clarification.

Schema validation for AI agents

Agent evaluation becomes much clearer when the output is structured. If an agent extracts events, writes tickets, updates a tracker, or routes a request, the response should pass a contract before it can affect downstream systems. JSON schema validation can check required fields, enums, date formats, null handling, array lengths, and field-level descriptions. Deterministic checks can then enforce business rules that are not purely syntactic.

This is not only a backend concern. The schema is also a product artifact. It tells reviewers what the agent is trying to do, shows where uncertainty belongs, and makes human corrections reusable as evaluation data. The agent can draft; the reviewer decides.

Common production failure modes

Evaluation should deliberately search for near misses. In RAG, a near miss might retrieve the right policy family but the wrong version. In extraction, it might choose a plausible category that does not match the approved taxonomy. In an agent workflow, it might call a tool successfully while using a stale assumption. These failures are dangerous because they look professional in a demo and only become obvious when a reviewer checks evidence.

A good LLM evaluation framework therefore includes negative cases, no-answer cases, conflict cases, malformed inputs, and examples where the correct behavior is escalation. It also keeps track of why a case exists. Without that reason, future maintainers may delete the exact scenario that protects the system from repeating an old incident.

Architecture notes

For a lightweight implementation, store evaluation cases as versioned data with fields for scenario, input, expected behavior, grading method, and failure rationale. Run deterministic checks first because they are cheap and explainable. Then use model-based graders for semantic comparison where exact matching is too brittle. Keep the results linked to prompt versions, model settings, retrieval index versions, and schema versions so a regression can be traced to a real change.

What to monitor after launch