Trajectory Evaluation
Two agents can produce the same final answer but follow dramatically different reasoning paths.
Trajectory evaluation analyzes the full sequence of steps an agent takes — its thoughts, tool calls, observations, and decisions — rather than judging only the final output.
This approach reveals critical insights about reasoning quality, efficiency, tool usage, and reliability that outcome-only evaluation misses.
Why Trajectories Matter More Than Outcomes
Consider this scenario:
- Agent A: Performs a targeted search → retrieves relevant sources → logically analyzes data → produces correct answer.
- Agent B: Makes multiple irrelevant searches → hallucinates facts → makes a lucky guess → produces the same correct answer.
Both agents have the same outcome, but only Agent A demonstrates reliable, trustworthy reasoning. Evaluating only the final answer would hide serious flaws in Agent B’s process.
Trajectory evaluation helps identify these differences and guides meaningful improvements.
What Is an Agent Trajectory?
An agent trajectory is the complete record of an agent’s decision-making process:
User Question ↓Thought: I need current GPU benchmarks ↓Action: web_search("H100 vs RTX 5090 benchmarks 2026") ↓Observation: Retrieved benchmark data ↓Thought: Compare training throughput and cost ↓Action: calculate_price_performance_ratio() ↓Final AnswerEach step (thought, action, observation) provides valuable signal for evaluation.
Key Dimensions of Trajectory Evaluation
Modern trajectory evaluation typically measures several dimensions:
| Dimension | What It Measures | Why It Matters |
|---|---|---|
| Efficiency | Number of steps, tool calls, tokens used | Resource usage and speed |
| Tool Selection Quality | Was the right tool chosen at each step? | Appropriate tool usage |
| Reasoning Quality | Logical consistency and soundness | Reliability of thought process |
| Error Recovery | How well the agent handles mistakes | Robustness |
| Safety & Compliance | Did the agent respect constraints? | Security and policy adherence |
A comprehensive evaluation combines these into a balanced score.
Efficiency Metrics
Common efficiency metrics include:
- Trajectory length — total number of reasoning + action steps
- Tool call count — how many external tools were invoked
- Token efficiency — total tokens consumed
- Wall-clock time — actual time to completion
Shorter, more direct trajectories are generally better, but extremely short ones may indicate guessing rather than reasoning.
Tool Usage Quality
Evaluate whether the agent:
- Chose the most appropriate tool for each subtask
- Used tools correctly (right parameters, proper error handling)
- Avoided redundant or unnecessary tool calls
Example poor tool usage: Using a calculator for a web search task.
Reasoning Quality
Assess whether the agent’s thoughts show:
- Logical consistency
- Proper use of retrieved information
- Avoidance of hallucinations
- Clear step-by-step progression
LLM-as-a-Judge is frequently used here, with structured rubrics that score individual reasoning steps.
Example Trajectory Evaluation Pipeline
A practical pipeline usually includes:
- Logging the full trajectory (thoughts, actions, observations)
- Applying automated metrics (efficiency, tool accuracy)
- Using LLM judges for qualitative assessment of reasoning
- Aggregating results into dashboards and reports
This combination provides both quantitative metrics and qualitative insights.
Looking Ahead
In this article we explored Trajectory Evaluation — why analyzing the full reasoning path is essential for understanding agent performance beyond simple outcome scoring.
In the next article we will examine Building Evaluation Pipelines, which covers how to automate large-scale agent testing using benchmarks, regression suites, and continuous evaluation systems.
→ Continue to 9.4 — Building Evaluation Pipelines