Skip to content
AUTH

Trajectory Evaluation

Two agents can produce the same final answer but follow dramatically different reasoning paths.

Trajectory evaluation analyzes the full sequence of steps an agent takes — its thoughts, tool calls, observations, and decisions — rather than judging only the final output.

This approach reveals critical insights about reasoning quality, efficiency, tool usage, and reliability that outcome-only evaluation misses.


Why Trajectories Matter More Than Outcomes

Consider this scenario:

Both agents have the same outcome, but only Agent A demonstrates reliable, trustworthy reasoning. Evaluating only the final answer would hide serious flaws in Agent B’s process.

Trajectory evaluation helps identify these differences and guides meaningful improvements.


What Is an Agent Trajectory?

An agent trajectory is the complete record of an agent’s decision-making process:

User Question
Thought: I need current GPU benchmarks
Action: web_search("H100 vs RTX 5090 benchmarks 2026")
Observation: Retrieved benchmark data
Thought: Compare training throughput and cost
Action: calculate_price_performance_ratio()
Final Answer

Each step (thought, action, observation) provides valuable signal for evaluation.


Key Dimensions of Trajectory Evaluation

Modern trajectory evaluation typically measures several dimensions:

DimensionWhat It MeasuresWhy It Matters
EfficiencyNumber of steps, tool calls, tokens usedResource usage and speed
Tool Selection QualityWas the right tool chosen at each step?Appropriate tool usage
Reasoning QualityLogical consistency and soundnessReliability of thought process
Error RecoveryHow well the agent handles mistakesRobustness
Safety & ComplianceDid the agent respect constraints?Security and policy adherence

A comprehensive evaluation combines these into a balanced score.


Efficiency Metrics

Common efficiency metrics include:

Shorter, more direct trajectories are generally better, but extremely short ones may indicate guessing rather than reasoning.


Tool Usage Quality

Evaluate whether the agent:

Example poor tool usage: Using a calculator for a web search task.


Reasoning Quality

Assess whether the agent’s thoughts show:

LLM-as-a-Judge is frequently used here, with structured rubrics that score individual reasoning steps.


Example Trajectory Evaluation Pipeline

A practical pipeline usually includes:

  1. Logging the full trajectory (thoughts, actions, observations)
  2. Applying automated metrics (efficiency, tool accuracy)
  3. Using LLM judges for qualitative assessment of reasoning
  4. Aggregating results into dashboards and reports

This combination provides both quantitative metrics and qualitative insights.


Looking Ahead

In this article we explored Trajectory Evaluation — why analyzing the full reasoning path is essential for understanding agent performance beyond simple outcome scoring.

In the next article we will examine Building Evaluation Pipelines, which covers how to automate large-scale agent testing using benchmarks, regression suites, and continuous evaluation systems.

→ Continue to 9.4 — Building Evaluation Pipelines