Deterministic LLM agent evaluation without LLM-as-judge
Shipping an LLM agent to production requires a quality gate you can actually defend — and that’s where LLM-as-judge approaches quietly break down.
The reproducibility problem with LLM-as-judge
When you use a second language model to evaluate your first one, you inherit all the same failure modes you were trying to catch. The judge is non-deterministic: run the same evaluation twice and you can get different verdicts. Temperature, sampling, and prompt sensitivity mean a borderline response might pass on Monday and fail on Wednesday with no code change on your end. That’s not a CI gate — that’s a coin flip with extra steps.
The bias problem compounds this. LLM judges tend to favor responses that are longer, more confident, or stylistically similar to their own training distribution. A terse-but-correct refusal to a prompt injection attempt can score lower than a verbose-but-wrong response that sounds authoritative. For security-sensitive behaviors — exactly the cases that matter most for production sign-off — this is a serious liability.
There’s also the audit problem. When a compliance reviewer or security team asks “how do you know your agent won’t leak PII in this scenario?”, “we ran it through GPT-4 and it seemed fine” is not a defensible answer. You need a reproducible, inspectable assertion that produces the same result every time.
What deterministic evaluation actually looks like
Deterministic eval replaces the judge model with explicit, rule-based assertions over the agent’s output and behavior:
- Required keywords: the response must contain specific strings (e.g., a refusal phrase, a required disclaimer)
- Forbidden keywords: the response must not contain specific strings (e.g., internal system prompt contents, PII patterns)
- Regex matching: structured output validation, phone/email/credential pattern detection
- Refusal detection: did the agent correctly decline a prompt that should be refused?
- Tool-call trace invariants: did the agent call the right tools, in the right order, without calling tools it shouldn’t have touched?
These assertions are deterministic by construction. The same case, the same agent output, the same result — every single run. You can check them into version control, diff them, and point to them in a sign-off document.
This approach maps directly to the OWASP Agentic Top 10 risk categories: prompt injection, excessive agency, tool misuse, data exfiltration, and insecure output handling all have concrete, testable behavioral signatures you can assert against deterministically.
Try it in one command
The fastest way to see this in practice is to run the built-in sample cases against a live model:
pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o
This runs a small set of pre-built cases — including prompt injection probes and refusal checks — and gives you a pass/fail result with a per-case breakdown. No judge model, no scoring rubric, no ambiguity.
Running against your own agent
Point the runner at your own case directory and agent adapter:
agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md
The --adapter flag takes a Python import path to a callable that wraps your agent. The runner feeds each case to your agent, collects the response and tool-call trace, evaluates every assertion, and writes a signoff.md you can attach to a pull request or release ticket. The report is deterministic: the same agent behavior produces the same report, making it suitable for regulated environments where you need an audit trail.
Putting it in CI
The natural home for this gate is your pull request pipeline:
# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -e .
- uses: weiseer/agent-eval-action@v1
with:
cases: ./cases
adapter: my_pkg.evals:agent
env:
OPENAI_API_KEY: $
This blocks merges when behavioral regressions are introduced — the same way unit tests block broken logic. Unlike hosted LLM-as-judge platforms, there’s no per-evaluation API cost for the judge layer, no non-determinism to average away with multiple runs, and no black-box scoring to explain to a reviewer.
The practical tradeoff
Deterministic assertions require you to specify what correct behavior looks like, which takes more upfront thought than asking a judge model to “rate quality 1-10.” That specificity is the point. Vague quality scores don’t block deployments; precise behavioral invariants do. For security properties, refusal behavior, and tool-call correctness — the things that actually matter for production sign-off — deterministic assertions are the only approach that gives you a reproducible, defensible answer.
Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt