Testing autonomous AI agents is the part of the stack that most teams underestimate. We have spent the last two years getting comfortable with prompt evaluation, golden datasets, and the occasional LLM judge. None of that is enough once the system starts planning, calling tools, and looping over its own decisions. Below is my synthesis of the best material I have read on agent evaluation recently, plus the approach that is actually working in production for Capitol Trades Tracker, the agentic AI app I run for tracking congressional stock trades.

Why classic input output testing breaks

The clearest framing comes from Comet’s piece on agent evaluation. Traditional testing assumes a single input maps to a single output. Agents do not behave that way. They branch, retry, recover, and sometimes solve the right problem the wrong way. A pass or fail on the final answer hides everything interesting about how the agent got there.

The agent evaluation deep dive makes the same point when it separates outcome evaluation from trajectory evaluation. Outcome tells you the agent finished. Trajectory tells you whether it should be trusted to finish again next time. Both matter, and they fail in different ways.

The four layers I care about

After reading through the agent evals guide, the Evaluating AI Agents manual, and the AI Evals Roadmap by Hamel Husain and friends, I keep coming back to four layers that need their own tests:

Layer What it measures Why it matters
Final outcome Did the agent solve the task Easy to score, easy to game
Trajectory Which tools were called, in what order, with what arguments Where most real bugs live
Planning quality Is the plan coherent and well decomposed Catches reasoning failures before they ship
Runtime behavior Latency, cost, retries, hallucinated tool calls, silent failures Determines whether the agent is viable in production

If you only test the first one, you ship an agent that passes your evals and quietly burns money in production.

Trajectory evaluation is the unlock

The single biggest shift for me was treating trajectories as the primary unit of evaluation. The agent evals guide describes this well, and the O11yBench benchmark takes it further by measuring agents on real observability workflows like log triage and incident response. The benchmark scores the path, not just the conclusion. That matches what I see when reviewing agent traces. A correct answer reached through six redundant tool calls is a failure waiting to happen at scale.

Practical version of this in code looks like trace based assertions. Capture the full execution, then write checks like:

assert trace.contains_tool_call("search_logs")
assert trace.tool_call_count("retry_query") <= 1
assert trace.total_tokens < 8000

This is closer to integration testing than unit testing. That is the point.

LLM judges are useful but not load bearing

The agent evals guide and the roadmap article both spend time on LLM as judge patterns. They work for grading open ended responses where a rubric is hard to encode. They are unreliable as the only signal. The pattern I trust is rubric scoring with a small, fixed rubric per task type, calibrated against human labels on a sample. Anything beyond that drifts.

The AlphaEval approach pushes this further by grounding evaluation in real business workflows across software engineering, finance, and operations. The lesson is that synthetic benchmarks tell you the agent can do tasks. Real workflow benchmarks tell you the agent can do your tasks.

Production is its own test environment

The Reinventing.ai piece on production testing argues that synthetic benchmarks systematically miss the failure modes that matter. I agree. Evaluation drift is real. The distribution of user requests in week six rarely matches the distribution you designed evals for in week one.

What I do in production:

  • Sample a fixed percentage of live traces every day and score them with the same rubric used in CI.
  • Alert on trajectory anomalies, not just error rates. A 30 percent jump in average tool calls per task is a bug, even if nothing crashes.
  • Keep a small human-in-the-loop review queue for the lowest-confidence runs. The cost is low and the signal is the best you can get.

This is the same loop the Evaluating AI Agents manual recommends, and it matches what Husain calls operational evaluation in the roadmap article.

My assertive take

Most teams testing agents today are still doing prompt evals dressed up as agent evals. That is not enough. Here is what I think actually works, in order of impact:

  1. Trace everything from day one. If you cannot replay an agent run end to end, you cannot evaluate it. OpenTelemetry style instrumentation is non-negotiable.
  2. Score trajectories, not just outputs. Tool correctness, call order, retry behavior, and token budget belong in your test suite alongside final answers.
  3. Build a real-workflow eval set. Twenty hand-curated tasks from your actual product beat two thousand synthetic ones. AlphaEval and O11yBench are right about this.
  4. Run evals in CI and in production. The same rubric, the same scoring code, sampled live. Drift is the default state of any LLM system.
  5. Use LLM judges sparingly. Calibrate against humans, keep rubrics short, never let a judge be the only gate.
  6. Treat reliability as a product feature. Latency, cost, and consistency are part of correctness for an agent. A 90 percent accurate agent that costs four dollars per run is broken.

The teams shipping reliable agents are not the ones with the cleverest prompts. They are the ones who treat evaluation as engineering infrastructure, with the same seriousness they would give a database migration or a payments pipeline. That is the bar.

If you want a single starting point, read the agent evals guide for the conceptual frame, then go straight to the Evaluating AI Agents manual for the operational playbook. Everything else is variations on those two themes.