AI Agent Evaluation

Copilot Agent Golden Test Set: Scenarios, Graders, and Regression Checks

A Copilot agent becomes easier to trust when its important behaviors are written down as repeatable test cases. This page shows how to design that test set without duplicating the broader Copilot agent evaluation guide.

Contents

Why a golden test set matters
A practical test case schema
Scenario types to include
Grader design
Release gates
FAQ

Why a golden test set matters

Copilot and AI agents fail differently from ordinary chatbots. They may call tools, parse JSON, update records, retrieve sources, ask for clarification, or escalate to a human. A golden test set turns those expected behaviors into repeatable checks.

The goal is not to write a huge benchmark. The goal is to protect the workflows that matter most: common user requests, fragile tool paths, known incidents, ambiguous inputs, no-answer cases, and high-risk actions.

A practical test case schema

Keep each case structured enough that a script, reviewer, or CI job can understand it.

{
  "case_id": "agent_001",
  "scenario": "expense_policy_lookup",
  "user_message": "Can a contractor expense a home office monitor?",
  "expected_behavior": "Answer from contractor policy and cite the policy source.",
  "required_tool_calls": ["search_policy"],
  "forbidden_tool_calls": ["create_reimbursement"],
  "expected_source_ids": ["contractor_expense_policy"],
  "requires_refusal": false,
  "requires_clarification": false,
  "failure_type": "wrong_policy_or_source"
}

Scenario types to include

Scenario	What it catches	Example grader
Happy path	The agent completes a normal task.	Expected answer and source are present.
Ambiguous request	The agent guesses instead of asking a question.	Clarification required.
Missing evidence	The agent invents an answer.	Refusal or escalation required.
Tool error	The agent hides or mishandles failed tool calls.	Retry, fallback, or honest failure message.
High-risk action	The agent performs an action without confirmation.	Draft-only or human approval required.

Grader design

Use layered graders. Deterministic checks should come first because they are cheap and explainable: JSON schema validity, required tool calls, forbidden tool calls, source IDs, state changes, and exact refusal flags. Then use semantic graders for answer quality, tone, and completeness.

A strong case usually has a short natural-language expected behavior plus machine-checkable fields. That combination helps reviewers understand the case while still allowing automation.

Release gates

Before changing prompts, tool schemas, routing logic, model settings, or retrieval indexes, run the golden set and compare failures by scenario. A practical first release gate can be simple:

No high-risk case regresses.
No forbidden tool call appears.
Schema validity stays stable.
Known refusal and clarification cases still pass.
New production incidents become new test cases.

This page targets golden test set design. For the broader evaluation workflow, read Testing and Evaluating Copilot Agents and the parent LLM Evaluation Framework.

FAQ

What is a golden test set for a Copilot agent?

A golden test set is a curated set of realistic agent scenarios with expected behavior, required sources or tools, grader fields, and failure labels used to catch regressions before release.

How many cases should an agent test set start with?

Start small with 20 to 50 high-value cases that cover common paths, known failures, no-answer behavior, tool errors, and high-risk workflows. Expand it with production failures over time.

Should Copilot agent tests use only LLM judges?

No. Use deterministic checks for schema validity, tool calls, source IDs, refusal rules, and state changes first. Add LLM judges only for semantic quality that cannot be checked reliably with rules.