Trajectory Evaluation

Two agents can produce the same final answer but follow dramatically different reasoning paths.

Trajectory evaluation analyzes the full sequence of steps an agent takes — its thoughts, tool calls, observations, and decisions — rather than judging only the final output.

This approach reveals critical insights about reasoning quality, efficiency, tool usage, and reliability that outcome-only evaluation misses.

Why Trajectories Matter More Than Outcomes

Consider this scenario:

Agent A: Performs a targeted search → retrieves relevant sources → logically analyzes data → produces correct answer.
Agent B: Makes multiple irrelevant searches → hallucinates facts → makes a lucky guess → produces the same correct answer.

Both agents have the same outcome, but only Agent A demonstrates reliable, trustworthy reasoning. Evaluating only the final answer would hide serious flaws in Agent B’s process.

Trajectory evaluation helps identify these differences and guides meaningful improvements.

What Is an Agent Trajectory?

An agent trajectory is the complete record of an agent’s decision-making process:

User Question
   ↓
Thought: I need current GPU benchmarks
   ↓
Action: web_search("H100 vs RTX 5090 benchmarks 2026")
   ↓
Observation: Retrieved benchmark data
   ↓
Thought: Compare training throughput and cost
   ↓
Action: calculate_price_performance_ratio()
   ↓
Final Answer

Each step (thought, action, observation) provides valuable signal for evaluation.

Key Dimensions of Trajectory Evaluation

Modern trajectory evaluation typically measures several dimensions:

Dimension	What It Measures	Why It Matters
Efficiency	Number of steps, tool calls, tokens used	Resource usage and speed
Tool Selection Quality	Was the right tool chosen at each step?	Appropriate tool usage
Reasoning Quality	Logical consistency and soundness	Reliability of thought process
Error Recovery	How well the agent handles mistakes	Robustness
Safety & Compliance	Did the agent respect constraints?	Security and policy adherence

A comprehensive evaluation combines these into a balanced score.

Efficiency Metrics

Common efficiency metrics include:

Trajectory length — total number of reasoning + action steps
Tool call count — how many external tools were invoked
Token efficiency — total tokens consumed
Wall-clock time — actual time to completion

Shorter, more direct trajectories are generally better, but extremely short ones may indicate guessing rather than reasoning.

Tool Usage Quality

Evaluate whether the agent:

Chose the most appropriate tool for each subtask
Used tools correctly (right parameters, proper error handling)
Avoided redundant or unnecessary tool calls

Example poor tool usage: Using a calculator for a web search task.

Reasoning Quality

Assess whether the agent’s thoughts show:

Logical consistency
Proper use of retrieved information
Avoidance of hallucinations
Clear step-by-step progression

LLM-as-a-Judge is frequently used here, with structured rubrics that score individual reasoning steps.

Example Trajectory Evaluation Pipeline

A practical pipeline usually includes:

Logging the full trajectory (thoughts, actions, observations)
Applying automated metrics (efficiency, tool accuracy)
Using LLM judges for qualitative assessment of reasoning
Aggregating results into dashboards and reports

This combination provides both quantitative metrics and qualitative insights.

Looking Ahead

In this article we explored Trajectory Evaluation — why analyzing the full reasoning path is essential for understanding agent performance beyond simple outcome scoring.

In the next article we will examine Building Evaluation Pipelines, which covers how to automate large-scale agent testing using benchmarks, regression suites, and continuous evaluation systems.

→ Continue to 9.4 — Building Evaluation Pipelines