LLM Evaluation AI Agents

The Most Important Part of AI Agents Is Not Prompting. It Is Evaluation.

By Yangming Li

Published June 15, 20269 min readAI systems essay

Written by Yangming Li
Light Dark
Share on LinkedIn

TL;DR

Better prompts and stronger models can make an agent look impressive in a demo, but production readiness comes from evaluation: custom eval sets, trajectory inspection, calibrated judges, regression tests, operational metrics, and human review where risk requires it.

Article Highlights

  • Public benchmarks are useful capability signals, but they cannot prove reliability in a specific business workflow.
  • Agent evaluation should inspect both final outcomes and the trajectory: tool choice, parameters, retrieval, retries, state changes, latency, and cost.
  • LLM-as-judge can help with open-ended tasks, but the judge needs rubrics, calibration, and expert review samples.
  • Eval-driven development turns agent improvement from a feeling into evidence that can guide shipping decisions.

Contents

Over the past year, AI Agents have become one of the most discussed topics in applied AI. Many teams are experimenting with stronger models, better prompts, tool calling, workflow orchestration, memory, retrieval, and automation.

These are all important. But when I review Agent projects, I often notice the same pattern: teams spend a lot of time making the Agent “work” in a demo, but much less time proving whether it works reliably enough for production.

A demo can make an Agent look impressive. A stronger model can make it feel smarter. A better prompt can make the output look cleaner. A new tool can make the system appear more capable.

But none of these answers the most important questions:

  • Is the Agent actually getting better?
  • Can it complete the task consistently?
  • Does it still work after we change the prompt, model, tool schema, or workflow?
  • When it fails, do we know why?
  • Can we trust it enough to put it in front of real users?

This is why I believe the core of AI Agent development is not prompting. It is evaluation.

Public Benchmarks Are Only the Starting Point

Public benchmarks are useful. They help compare models and provide a general signal about capability. Benchmarks such as software engineering tasks, reasoning tasks, or tool-use tasks can help filter out weak systems.

But public benchmarks are not enough to decide whether an Agent is ready for your business.

The reason is simple: your production environment is different.

  • Your users have different goals.
  • Your tools have different constraints.
  • Your data has different edge cases.
  • Your workflows have different failure modes.
  • Your business has different risk tolerance.

A model or Agent can perform well on a public benchmark and still fail in a real customer support workflow, internal analytics workflow, compliance workflow, healthcare workflow, financial workflow, or enterprise operations workflow.

Public benchmarks can tell you whether a system has general capability. They cannot prove that the system is reliable in your specific environment. That proof has to come from your own evaluation system.

A Custom Eval Set Is a Real Engineering Asset

A strong Agent team eventually needs its own eval set.

This eval set should not be made only from imagined test cases. It should come from real user behavior, real business workflows, real historical failures, and real edge cases.

A useful eval set usually has several characteristics.

First, it reflects the real distribution of tasks. It should include what users actually ask for, not only what the team expects users to ask.

Second, it has layered coverage. It should include simple tasks, common tasks, complex tasks, boundary cases, ambiguous requests, and high-risk scenarios.

Third, it is versioned. Every meaningful change to the Agent should be evaluated against the same set of cases so the team can compare performance over time.

Fourth, it improves continuously. Production failures should not disappear into Slack messages or one-off bug fixes. They should become new eval cases and enter the regression suite.

In this sense, an eval set is not just a testing artifact. It is part of the product infrastructure. The closer your eval set is to the real business distribution, the more useful your Agent evaluation becomes.

Endpoint Evaluation Is Not Enough

Many teams evaluate Agents by looking only at the final answer. This is called endpoint evaluation. It asks whether the final output is correct.

Endpoint evaluation is necessary. For example, if a booking Agent says a reservation has been made, the system should verify that the reservation actually exists and that the date, time, location, and number of people are correct.

But for Agents, endpoint evaluation is not enough.

Agents are multi-step systems. They reason, call tools, retrieve information, update state, and sometimes interact with external systems. Many failures happen during the process, not only at the final answer.

  • An Agent may choose the wrong tool.
  • It may pass the wrong parameter.
  • It may miss an important context item.
  • It may repeat the same action unnecessarily.
  • It may arrive at a correct-looking answer through an unsafe or unreliable path.
  • It may complete the task but with unacceptable latency or cost.

If we only evaluate the final answer, we miss the real source of failure.

This is why trajectory evaluation is so important. Trajectory evaluation looks at how the Agent completed the task. It examines tool selection, tool parameters, intermediate reasoning, retrieval behavior, repeated attempts, state changes, and whether the Agent followed the expected constraints.

For Agent systems, the process matters as much as the final output.

LLM-as-Judge Is Useful, but the Judge Must Also Be Evaluated

LLM-as-Judge is a practical technique, especially for open-ended tasks such as research, summarization, report generation, customer support, and complex reasoning.

However, an LLM judge is not automatically objective.

  • It can have position bias.
  • It can prefer longer answers.
  • It can reward fluent but unsupported responses.
  • It can favor outputs that resemble its own style.
  • It can be inconsistent when the rubric is unclear.

The solution is not to avoid LLM judges completely. The solution is to evaluate them carefully.

A good LLM judge should be guided by a clear rubric. It should score specific dimensions instead of giving a vague overall judgment. For higher-risk tasks, teams should sample cases for human review and compare whether the judge's decisions match expert judgment.

In other words, the evaluator also needs to be evaluated.

The goal is not to replace human judgment entirely. The goal is to use automation where it is reliable and use human review where calibration is needed.

Agent Evaluation Should Cover Multiple Layers

A practical Agent evaluation system should not rely on a single score. At minimum, I think Agent evaluation should cover five layers.

1. Outcome Evaluation

Did the Agent actually complete the task? The system should verify the final state, not just the final message.

For example, if an Agent claims it updated a record, created a ticket, sent a message, or completed a booking, the evaluation should check whether that action truly happened and whether the final state is correct.

2. Tool Use Evaluation

Did the Agent use the right tools? Were the tool parameters correct? Did it avoid unnecessary or forbidden actions?

For tool-using Agents, wrong tool calls are one of the most common sources of failure. A good evaluation system should inspect tool behavior, not just text output.

3. Trajectory Evaluation

Was the execution path reasonable? Did the Agent complete the task efficiently? Did it get stuck, loop, retry unnecessarily, or drift away from the original goal?

This is especially important for multi-step workflows, where the final output alone does not explain what happened.

4. Quality and Rubric Evaluation

For open-ended tasks, quality needs to be evaluated with a rubric.

For example, a research Agent may be evaluated on completeness, grounding, source quality, factual consistency, structure, and whether it addresses the user's actual intent.

A rubric turns subjective judgment into a more consistent evaluation process.

5. Human Calibration

Automated evaluation is useful, but it needs calibration.

Human review helps validate whether the scoring system reflects real business expectations. This is especially important for subjective, high-impact, or ambiguous tasks.

Capability Eval and Regression Eval Are Different

Another useful distinction is between capability evaluation and regression evaluation.

Capability evaluation asks: What can this Agent do now?

These evals should include challenging tasks that expose the Agent's current limits. The goal is not to achieve a perfect pass rate immediately. The goal is to reveal where the system still needs improvement.

Regression evaluation asks: Can this Agent still do what it used to do?

These evals act as a safety net. Every time the team changes the prompt, model, memory, tool schema, routing logic, or workflow, the Agent should be tested against existing regression cases.

Both are necessary.

Capability eval helps the system move forward. Regression eval prevents the system from sliding backward. A mature Agent team needs both.

Evaluation Should Help Business Decisions, Not Just Model Scoring

A common mistake is treating evaluation as a model leaderboard.

But in production, the best Agent is not always the one with the highest accuracy score.

Business teams also care about latency, cost, reliability, observability, safety, compliance, user experience, and escalation paths.

An Agent with 85% accuracy but very high latency and high cost may be less useful than an Agent with slightly lower accuracy but much better speed, lower cost, and clearer failure handling.

This is why Agent evaluation should include operational metrics, such as:

  • success rate,
  • regression rate,
  • tool accuracy,
  • latency,
  • cost per run,
  • escalation rate,
  • failure categories,
  • and human review rate.

Evaluation is not just about asking, “Which model scored higher?”

It is about asking, “Is this system ready for this use case?”

Start Small: 20-50 High-Quality Eval Cases

Early-stage teams do not need hundreds of eval cases on day one.

A more practical approach is to start with 20-50 high-quality cases. These should come from real user scenarios, important workflows, and known failure modes.

Each eval case should define:

  • the user task,
  • the initial environment state,
  • the available tools,
  • the expected final state,
  • the pass/fail criteria,
  • the behaviors that are not allowed,
  • and whether the trajectory should be inspected.

A good eval case is not just a sentence. It is a reproducible experiment specification.

This allows every Agent run to become something that can be compared, audited, and improved.

Eval-Driven Development

For Agent systems, I believe teams should move toward eval-driven development.

Instead of building an Agent first and then informally testing whether it feels good, teams should define what the Agent must be able to do, how success will be measured, and what failure modes are unacceptable.

Then evaluation can guide improvements to prompts, tool schemas, retrieval, memory, routing, model choice, and workflow logic.

After each change, the team reruns the eval suite and checks whether the system actually improved.

This changes Agent development from intuition-driven iteration to evidence-driven iteration.

Without evaluation, improvement is just a feeling. With evaluation, improvement becomes measurable.

Conclusion

It is now relatively easy to build an Agent that works once in a demo.

It is much harder to build an Agent that works reliably in production.

The real questions are:

  • Can it complete tasks consistently?
  • Can it handle different user inputs?
  • Can it use tools correctly?
  • Can it avoid unsafe actions?
  • Can it recover from failure?
  • Can we understand why it failed?
  • Can we prove that a new version is better than the old one?

These are evaluation questions.

Prompting decides how the Agent starts. Tools decide what the Agent can access. Workflows decide how the Agent runs. But evaluation decides whether the Agent is trustworthy enough to ship.

For production AI Agents, evaluation is not a side task.

It is the engineering foundation.