Skip to the content.

Deterministic agent eval that gates CI — no LLM-as-judge

A free, deterministic eval pack + runner for tool-using LLM agents. The pass/fail is computed by pure logic (keywords, regex, refusal detection, tool-call trace invariants) — not by asking a second LLM. So the gate is reproducible and defensible in a PR review, OWASP Agentic Top 10 aligned, and drops into CI in one line. It’s a pass/fail gate, not another observability platform.

pip install "agent-eval-runner[openai]"
agent-eval try --model openai:gpt-4o

Runner on PyPI · GitHub Action · free starter repo · full 28-case pack

Start here