How to test LLM agent tool selection deterministically in CI

Tool-using agents fail in a particularly dangerous way: they return a plausible-looking success message while having called the wrong tool entirely — delete_record instead of read_record, send_email instead of draft_email, write_file instead of append_file. The final text looks fine; the side effect is catastrophic. Standard output-text assertions miss this class of bug completely. The fix is to assert the tool-call trace, not the response string.

Why output-text assertions aren’t enough

When an agent wraps a tool call, the LLM generates both a natural-language response and a structured tool invocation. These are decoupled. A model can say “I’ve retrieved your record” while having dispatched delete_record(id=42). Eval frameworks that score semantic similarity of the final answer will pass this case with flying colors. You need invariants on the trace — the ordered sequence of tool names, argument shapes, and call counts that the agent actually dispatched.

Deterministic tool-call invariants

A deterministic invariant is a rule that produces the same pass/fail verdict every run, with no LLM-as-judge in the loop. For tool selection, useful invariants include:

exact tool name match — the agent called read_record, not any other tool
tool call count — exactly one call, not two (no redundant writes)
argument presence — the id parameter was present and non-null
tool ordering — authenticate was called before fetch_data
tool exclusion — delete_* was never called during a read-only workflow

These are binary. They don’t require a second LLM to judge whether the answer “seems right.” That makes them reproducible, diff-able in PRs, and defensible in a compliance review — properties that hosted LLM-as-judge platforms structurally cannot offer for this use case.

Catching the wrong-tool bug before it ships

The workflow is: write a case that specifies the expected tool trace, run it against your agent adapter, fail the build if the trace diverges. No human in the loop, no flaky scoring.

Start by installing the runner and smoke-testing against a live model:

pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

This confirms your environment is wired up. Then point it at your actual case files and agent adapter:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The --adapter flag takes a Python import path to a callable that wraps your agent and returns a structured result including the tool-call trace. The runner compares that trace against the invariants declared in each case file and writes a pass/fail report to signoff.md.

What a tool-selection case looks like

Each case declares the scenario, the expected tool calls, and any exclusions. A read-only lookup case would assert that read_record was called exactly once, that delete_record was never called, and that the id argument was populated. A multi-step case can assert ordering: authenticate must precede fetch_data. These invariants are evaluated against the raw trace your adapter returns — no interpretation, no scoring rubric.

The OWASP Agentic Top 10 explicitly flags unsafe tool invocation and privilege escalation as top risks for agentic systems. Tool-selection invariants are a direct, testable control against those risks: you can point to a case file and a green CI badge and say “this agent cannot call destructive tools in this workflow.”

Locking it into CI

Once cases pass locally, add the GitHub Action so every push is gated:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

The action exits non-zero on any invariant failure, blocking the merge. Because the verdict is deterministic, the same commit hash will produce the same result on every runner — no variance from a judge model’s temperature, no “it passed yesterday” debugging sessions.

The practical payoff

The wrong-tool bug is silent, high-severity, and common as agents gain more tools. Text-output evals don’t catch it. LLM-as-judge evals add cost, latency, and non-determinism without solving it. A trace-level invariant check is cheap (it’s a string comparison), runs in seconds, and produces a binary signal that fits naturally into a PR gate. Add it before you add more tools to your agent, not after an incident.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt