Testing and Evaluating Copilot Agents
The broader Copilot evaluation workflow.
A Copilot agent becomes easier to trust when its important behaviors are written down as repeatable test cases. This page shows how to design that test set without duplicating the broader Copilot agent evaluation guide.
Copilot and AI agents fail differently from ordinary chatbots. They may call tools, parse JSON, update records, retrieve sources, ask for clarification, or escalate to a human. A golden test set turns those expected behaviors into repeatable checks.
The goal is not to write a huge benchmark. The goal is to protect the workflows that matter most: common user requests, fragile tool paths, known incidents, ambiguous inputs, no-answer cases, and high-risk actions.
Keep each case structured enough that a script, reviewer, or CI job can understand it.
{
"case_id": "agent_001",
"scenario": "expense_policy_lookup",
"user_message": "Can a contractor expense a home office monitor?",
"expected_behavior": "Answer from contractor policy and cite the policy source.",
"required_tool_calls": ["search_policy"],
"forbidden_tool_calls": ["create_reimbursement"],
"expected_source_ids": ["contractor_expense_policy"],
"requires_refusal": false,
"requires_clarification": false,
"failure_type": "wrong_policy_or_source"
}
| Scenario | What it catches | Example grader |
|---|---|---|
| Happy path | The agent completes a normal task. | Expected answer and source are present. |
| Ambiguous request | The agent guesses instead of asking a question. | Clarification required. |
| Missing evidence | The agent invents an answer. | Refusal or escalation required. |
| Tool error | The agent hides or mishandles failed tool calls. | Retry, fallback, or honest failure message. |
| High-risk action | The agent performs an action without confirmation. | Draft-only or human approval required. |
Use layered graders. Deterministic checks should come first because they are cheap and explainable: JSON schema validity, required tool calls, forbidden tool calls, source IDs, state changes, and exact refusal flags. Then use semantic graders for answer quality, tone, and completeness.
A strong case usually has a short natural-language expected behavior plus machine-checkable fields. That combination helps reviewers understand the case while still allowing automation.
Before changing prompts, tool schemas, routing logic, model settings, or retrieval indexes, run the golden set and compare failures by scenario. A practical first release gate can be simple:
This page targets golden test set design. For the broader evaluation workflow, read Testing and Evaluating Copilot Agents and the parent LLM Evaluation Framework.
A golden test set is a curated set of realistic agent scenarios with expected behavior, required sources or tools, grader fields, and failure labels used to catch regressions before release.
Start small with 20 to 50 high-value cases that cover common paths, known failures, no-answer behavior, tool errors, and high-risk workflows. Expand it with production failures over time.
No. Use deterministic checks for schema validity, tool calls, source IDs, refusal rules, and state changes first. Add LLM judges only for semantic quality that cannot be checked reliably with rules.