Testing and Evaluating Copilot Agents

This article uses a fully fictional and generalized example. It does not describe any specific organization, internal process, employee information, or production system.

Why Testing Matters

Building an agent in Microsoft Copilot Studio has become much easier than building a traditional custom chatbot. A maker can define instructions in natural language, connect knowledge sources, expose tools, and test a working interaction in the design surface. That speed is useful, but it also creates a trap: the first successful answer can feel like proof of reliability.

For operational use cases, the harder question is not whether an agent can produce one plausible answer. The harder question is whether it can produce accurate, consistent, auditable output across a distribution of realistic inputs, including inputs that are ambiguous, incomplete, adversarial, or simply boring in the way real business text tends to be.

Copilot Studio gives teams several layers of feedback: test chat during authoring, agent evaluation with reusable test sets, run-level results, activity maps, resource usage, analytics after release, and review surfaces for transcript and activity inspection. Those features are most useful when wrapped in an engineering process: data contracts, test fixtures, deterministic validators, release gates, and human-in-the-loop controls.

A Fictional Structured Extraction System

Imagine a team that receives already approved internal communication text. The team wants help identifying whether the message contains a structured update signal. The communication might mention a person starting a new role, a temporary assignment, a person leaving a role, a responsibility change, or no relevant event at all.

A Copilot-assisted system can read the approved text and draft structured fields for review:

event_type: canonical category, such as new_role, interim_assignment, leaving_role, responsibility_change, or none.
role_or_responsibility_area: normalized role, function, or portfolio text.
effective_date: ISO 8601 date when explicit, or null when not recoverable.
confidence: coarse confidence class or score.
requires_human_review: a mandatory boolean gate.
uncertainty_notes: concise explanation of missing evidence, ambiguity, or assumptions.

The goal is not to automate business decisions. The goal is to reduce manual tracking effort while keeping reviewers in control. A safe architecture is:

Approved input text
  -> Copilot agent extraction
  -> JSON schema validation
  -> deterministic business-rule checks
  -> human review queue
  -> reviewed tracker update
  -> monitoring and evaluation feedback loop

The important boundary is that the agent drafts; the reviewer decides. That distinction should appear in the agent instructions, the review UI, the evaluation criteria, and the access model.

Start With An Output Contract

For structured extraction, a prose answer is usually the wrong interface. A reviewer can read prose, but downstream systems need stable fields. The first technical artifact should be an output contract that defines allowed categories, required fields, null behavior, and review gates.

A simplified contract might look like this:

{
  "items": [
    {
      "event_type": "new_role",
      "person_reference": "string_or_redacted_reference",
      "role_or_responsibility_area": "string",
      "effective_date": "YYYY-MM-DD or null",
      "confidence": "high | medium | low",
      "requires_human_review": true,
      "uncertainty_notes": "string"
    }
  ]
}

This is not just formatting. The contract gives evaluators something concrete to inspect. It also gives deterministic validators a way to reject malformed responses before a human sees them. For example, a response can fail validation if it invents an unsupported event_type, emits free text instead of JSON, uses a non-ISO date, omits requires_human_review, or returns a final update instead of a draft.

The contract should also define the correct representation of no result:

{
  "items": []
}

That empty array is a valid business outcome. It means "no relevant update found." Without this convention, evaluators and reviewers may treat every blank result as a failure, even when the agent did the right thing.

From Manual Testing To Evaluation Sets

Most agent projects start with manual testing. A builder asks a few questions, checks the output, adjusts the instructions, and tries again. This is useful during authoring because it exposes conversation flow, topic routing, tool availability, and obvious instruction gaps.

Manual testing is not enough for release confidence. It is not repeatable, it is easy to overfit to the last prompt tried, and it rarely captures the edge cases that cause operational damage. The next step is a versioned evaluation set: a collection of input examples, expected outputs, expected tool or capability use, and grading methods.

Smoke Test Set

The smoke test set checks whether the core behavior still works. It should be small, clear, and boring. Good smoke examples include one clear new appointment, one clear interim assignment, one clear leaving-role case, one clear responsibility change, and one clear no-event case.

Most smoke cases should pass. If they fail, something foundational is broken: the instructions, schema, knowledge source availability, tool authentication, or routing logic.

Challenge Test Set

The challenge set is where the system earns trust. It should include ambiguity and near misses:

Multiple events in one communication.
An effective date implied by "next quarter" or "after transition" but not explicitly stated.
A title implied by context but never directly named.
A recruitment update where no appointment has occurred.
A portfolio change without a formal role change.
Conflicting names, dates, or departments.
A message that looks important but should produce {"items": []}.

Failures in this set are not automatically bad. They are information. They tell the team where to add clearer instructions, stronger schema validation, better escalation rules, or mandatory review.

Version The Golden Dataset

A test set becomes more valuable when treated like code. Each row should have a stable ID, a scenario label, an input fixture, an expected structured output, evaluation methods, and a reason the case exists. Without that metadata, teams forget why an example was added and may delete exactly the case that guards against an old regression.

case_id: update_signal_017
scenario: ambiguous_effective_date
input_text: "Approved fictional communication text goes here."
expected_output:
  items:
    - event_type: responsibility_change
      effective_date: null
      confidence: low
      requires_human_review: true
      uncertainty_notes: "Timing is implied but not explicit."
evaluators:
  - schema_validation
  - compare_meaning
  - exact_match:event_type
  - custom:human_review_required_when_date_missing
owner_note: "Protects against forcing inferred dates into the tracker."

This style makes the evaluation corpus inspectable. It also supports change review: if someone updates the agent instructions, changes a knowledge source, or modifies an action, the test set can show what improved and what regressed.

Choose Evaluation Methods By Failure Mode

Different evaluation methods answer different questions. A generic quality score is useful, but it should not be the only gate for structured AI systems. The grading method must match the risk.

Evaluation method	Best use	Risk if used alone
General quality	Broad relevance, completeness, and groundedness.	Can pass a fluent response that misses a required field.
Compare meaning	Semantic equivalence when wording differs from the expected answer.	May tolerate structural differences that matter to a downstream system.
Keyword match	Required labels, role names, policy terms, or event categories.	Can reward keyword presence without verifying correct interpretation.
Text similarity	Loose comparison of expected and actual answer text.	Can be misleading for JSON where field-level correctness matters.
Exact match	Canonical labels, standardized codes, fixed JSON fragments, and empty output.	Too brittle for free-form notes or semantically equivalent wording.
Capability use	Whether the agent used expected resources, tools, topics, or knowledge paths.	Correct tool use does not prove the final extraction is correct.
Custom	Business-specific rules such as "missing dates require review."	Needs careful rubric design and periodic calibration.

A practical release gate combines several checks. For example: the response must be valid JSON, all smoke tests must pass exact checks for event_type, all no-event cases must return items: [], challenge cases with missing dates must set requires_human_review: true, and capability-use checks must pass when a case requires a specific knowledge source or tool.

Empty Output Can Be Correct

One of the most common evaluation mistakes is treating empty output as automatically bad. In extraction systems, an empty result can be the best possible answer. If the source text contains no relevant update, the expected response should be:

{
  "items": []
}

A meaning-based evaluator may correctly recognize this as "no event found." A broad quality evaluator may see the same response as incomplete because it contains no explanation. That does not mean the agent is wrong. It means the evaluator is mismatched to the task.

For no-event cases, I would use schema validation, exact match for the empty array, and a custom check that the agent did not invent an event. A reviewer note can still explain the result, but the machine-readable contract should remain clean.

Add Deterministic Validators Outside The Agent

LLM graders are useful, but they should not carry all quality responsibility. Deterministic validators should run before review and before any write action. These validators are ordinary software checks:

JSON parses successfully.
Top-level object has exactly the expected shape.
event_type belongs to the allowed enum.
effective_date is either null or a valid ISO date.
confidence belongs to high, medium, or low.
requires_human_review is true for low confidence, multiple events, missing dates, or conflicting evidence.
The agent never returns direct instructions to update a record without review.

These checks are intentionally simple. They reduce the problem space before a human reviewer or semantic evaluator has to think. They also make audit logs easier to read because a failed validation has a precise reason.

Near Misses Matter More Than Obvious Failures

The most useful evaluation cases are often near misses. An obvious failure is easy to reject. A near miss looks plausible but quietly creates bad data.

Examples include:

The agent passes general quality but omits the effective date.
The agent identifies the right role but assigns new_role instead of responsibility_change.
The agent extracts a date from surrounding prose even though the communication did not state it as the effective date.
The agent handles single-event messages but loses one item in a multi-event message.
The test dataset has an unrealistic expected answer, causing the evaluator to punish the correct cautious behavior.

Near misses should feed a triage loop. Some indicate prompt or instruction changes. Some indicate schema changes. Some mean the system needs stricter human review. Some mean the test set itself is wrong. The key is to label the failure mode instead of only recording pass or fail.

A Release Gate For Agent Changes

Agent behavior changes when instructions change, model settings change, knowledge sources change, tool permissions change, or downstream schemas change. Treat those changes as releases. A lightweight release gate can look like this:

1. Update instructions, tools, knowledge, or schema in a development environment.
2. Run smoke test set.
3. Run challenge test set.
4. Export or archive evaluation results.
5. Compare with previous accepted run.
6. Review failed and regressed cases.
7. Approve, rollback, or add new tests.
8. Publish only when required gates pass.

The gate should be stricter for high-risk fields. A low-stakes note can tolerate some wording variation. A canonical category, date, or review flag should be graded more tightly. If an agent has write actions, the release gate should also verify that write actions cannot trigger without the intended approval state.

Post-Publish Monitoring

Evaluation before release is necessary but incomplete. Real users will send inputs that the test set did not anticipate. After publishing, teams should monitor agent analytics, transcripts where appropriate, activity paths, tool errors, fallback rates, and reviewer overrides.

For a structured extraction system, I would track:

Volume by event type.
Percentage of outputs requiring human review.
Reviewer acceptance, edit, and rejection rates.
Common validation failures.
Cases where the agent used an unexpected knowledge source or capability.
Latency and error rates for any tool or connector calls.
Regression trends after instruction or data-source updates.

Monitoring should produce action. If reviewers repeatedly correct the same event type, add a challenge test. If validation failures cluster around dates, improve the date policy. If activity review shows unexpected tool calls, tighten capability instructions or permissions.

Governance As Product Design

Governance is not a layer added after the system is built. It is part of the product design. Before scaling an agent, teams should answer concrete questions:

What data is approved as agent input?
Which knowledge sources are authoritative for this system?
Is the agent read-only, draft-only, or allowed to trigger actions?
Which outputs require human review every time?
Who owns prompt, tool, schema, and knowledge-source changes?
What is the rollback plan after a failed release?
How are uncertain, low-confidence, or conflicting cases escalated?
How long are evaluation results, transcripts, exports, and review decisions retained?

Good governance makes the system easier to operate. It clarifies ownership, reduces reviewer anxiety, and gives engineers a concrete basis for system boundaries.

A Practical Pattern

The practical pattern is simple but powerful:

Prototype
  -> output contract
  -> smoke tests
  -> challenge tests
  -> schema validation
  -> custom review rules
  -> release gate
  -> human review
  -> analytics and activity monitoring
  -> new tests from real failures

This is how an agent moves from demo to dependable production support. The agent remains useful because it reduces manual effort. The system remains trustworthy because it is tested, constrained, reviewed, and monitored.

Testing and Evaluating Copilot Agents: From Demo to Reliable AI System

TL;DR

Article Highlights

Contents

Why Testing Matters

A Fictional Structured Extraction System

Start With An Output Contract

From Manual Testing To Evaluation Sets

Smoke Test Set

Challenge Test Set

Version The Golden Dataset

Choose Evaluation Methods By Failure Mode

Empty Output Can Be Correct

Add Deterministic Validators Outside The Agent

Near Misses Matter More Than Obvious Failures

A Release Gate For Agent Changes

Post-Publish Monitoring

Governance As Product Design

A Practical Pattern

References

TL;DR

Article Highlights

Contents

Why Testing Matters

A Fictional Structured Extraction System

Start With An Output Contract

From Manual Testing To Evaluation Sets

Smoke Test Set

Challenge Test Set

Version The Golden Dataset

Choose Evaluation Methods By Failure Mode

Empty Output Can Be Correct

Add Deterministic Validators Outside The Agent

Near Misses Matter More Than Obvious Failures

A Release Gate For Agent Changes

Post-Publish Monitoring

Governance As Product Design

A Practical Pattern

References

Related Reading

Subscribe to Yangming Li's Newsletter

Related writing