Skip to content
AUTH

Why Agent Evaluation Is Hard

Traditional software is usually deterministic. The same input reliably produces the same output. This makes testing straightforward:

Input: add(2, 2)
Expected Output: 4

Developers can write clear unit tests and assert exact matches.

AI agents behave very differently. Most agents are built on top of large language models, which are inherently probabilistic. Even with the same prompt and context, they can produce different responses across runs.

This non-determinism is just the beginning of the evaluation challenge.


Non-Deterministic Outputs

Language models sample from probability distributions over tokens. Small differences in sampling or temperature can lead to meaningfully different outputs, even when all are “correct.”

Example:

Prompt: “Explain how GPUs accelerate transformer training.”

Possible valid responses:

All three answers are reasonable, but they differ in style, emphasis, and structure. Traditional exact-match testing breaks down immediately.


Outcome vs Trajectory Evaluation

This leads to one of the most important distinctions in agent evaluation:

Evaluation TypeFocusWhat It Misses
Outcome EvaluationFinal answer / resultQuality of reasoning and process
Trajectory EvaluationFull reasoning path + tool usageWhether the agent arrived at the answer reliably

Example:

Both agents produce the same outcome, but only Agent A demonstrates reliable reasoning. Evaluating only outcomes hides serious flaws in the agent’s process.


Additional Sources of Complexity

Evaluating agents is hard for several other reasons:


Why Traditional Testing Falls Short

Unit tests, integration tests, and exact output matching work poorly for agents because:

This is why modern agent evaluation has evolved into a dedicated engineering discipline involving trajectory analysis, LLM-as-Judge systems, human evaluation, and custom benchmarks.


What Makes a Good Agent Evaluation

Effective evaluation usually considers multiple dimensions:

A complete evaluation pipeline combines automated metrics, trajectory analysis, and human judgment.


Looking Ahead

In this article we explored why evaluating AI agents is fundamentally harder than evaluating traditional software, focusing on non-determinism, the importance of trajectories versus outcomes, and the many sources of complexity in agent behavior.

In the next article we will examine LLM-as-a-Judge, a widely used technique where language models themselves are used to evaluate agent performance.

→ Continue to 9.2 — LLM-as-a-Judge