A free CI gate for AI agents vs LLM-as-judge evals

Deterministic pass/fail assertions are quietly solving a problem that LLM-as-judge evaluation has struggled with since the moment teams tried to wire it into CI: reproducibility.

Why LLM-as-judge breaks as a build gate

LLM-as-judge evaluation works by sending your agent’s output to a second language model and asking it to score quality. The approach is flexible and handles open-ended outputs well, but it carries three structural problems when you need a hard CI gate:

Non-determinism. The same output can score 7/10 on one run and 5/10 on the next. Temperature, model version drift, and prompt sensitivity all introduce variance. A gate that flips without a code change is not a gate — it’s noise.

Cost at scale. Every push triggers a secondary inference call for every eval case. On a team running dozens of cases across feature branches, that cost compounds fast.

Auditability. When a compliance reviewer or a security team asks “why did this build pass?”, “the judge model said so” is a hard answer to defend. Deterministic assertions produce a diff-able, human-readable report that maps each case to a specific assertion and a specific result.

Hosted LLM-as-judge platforms add dashboards and trend lines on top of this, which is useful for research and regression analysis — but it’s a different product category than a lightweight, free CI gate.

What deterministic evaluation looks like instead

Deterministic agent evaluation replaces the judge model with explicit assertions over structured outputs: did the agent call the right tool, did it refuse when it should have refused, did it stay within the declared scope, did it leak data it shouldn’t have touched?

These checks are functions, not prompts. They return True or False. Run them twice on the same output and you get the same answer. That property is what makes a gate defensible.

The agent-eval-runner package takes this approach and aligns its case library to the OWASP Agentic Top 10 — the emerging standard for agentic security risks including prompt injection, excessive agency, and insecure tool use. Covering those categories with deterministic checks means your CI report maps directly to a recognized risk taxonomy, which matters when you need to show due diligence.

Try it in two minutes

Install the runner and fire a quick smoke test against a hosted model to see the output format:

pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

This runs a small bundled case set and prints a pass/fail summary to stdout. No account, no dashboard, no data leaving your environment except the inference call you already control.

For a full run against your own agent, point the runner at a cases directory and an adapter function that wraps your agent:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The --report flag writes a Markdown file you can attach to a pull request, a release, or a compliance record. Each row maps a case ID to an assertion result and an optional failure reason — plain text, version-controllable, no proprietary format.

Wiring it into GitHub Actions

The GitHub Action wraps the same runner so you get a blocking check on every push and pull request with no additional infrastructure:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

The action exits non-zero on any assertion failure, which blocks the merge. Because the assertions are deterministic, a red build means something in your agent’s behavior changed — not that the judge model was having a bad day.

Choosing the right tool for the job

LLM-as-judge evaluation is genuinely useful for exploratory work: comparing model outputs qualitatively, catching regressions in tone or helpfulness, or evaluating tasks where the correct answer is legitimately ambiguous. Use it for that.

For a CI gate — something that blocks a deploy, signs off a release, or satisfies a security review — you want assertions that are free, fast, reproducible, and traceable to a risk standard. Deterministic evaluation is that tool. The starter pack covers five OWASP-aligned cases at no cost, and the full 28-case pack covers the complete Agentic Top 10 surface area.

The gap between “we eval in notebooks” and “eval blocks bad deploys” is smaller than it looks. A YAML file and a pip install is the whole infrastructure.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt