Deterministic agent eval that gates CI — no LLM-as-judge

A free, deterministic eval pack + runner for tool-using LLM agents. The pass/fail is computed by pure logic (keywords, regex, refusal detection, tool-call trace invariants) — not by asking a second LLM. So the gate is reproducible and defensible in a PR review, OWASP Agentic Top 10 aligned, and drops into CI in one line. It’s a pass/fail gate, not another observability platform.

pip install "agent-eval-runner[openai]"
agent-eval try --model openai:gpt-4o

Runner on PyPI · GitHub Action · free starter repo · full 28-case pack

Start here

Deterministic LLM agent evaluation without LLM-as-judge
How to gate an AI agent in CI with a free GitHub Action (OWASP-Agentic-aligned)
A free, deterministic alternative to hosted LLM-eval platforms for CI gating

More guides — failure modes by framework

How to test when a custom agent loop fails OWASP Agentic Top 10 memory & data poisoning
How to test when an AI agent should refuse but complies under a justification frame
How to test when Anthropic tool use ignores a tool error and fabricates an answer
How to test when Anthropic tool use takes an unrequested high-impact action (excessive agency)
How to test when a RAG agent treats adversarial text in retrieved context as instructions
How to test when a tool-using agent uses a high-privilege tool when a read-only tool would do
How to test when AutoGen is hijacked by a prompt injection inside a retrieved document
How to test when CrewAI lets one agent’s output poison the next agent (cascading failure)
How to test when CrewAI loops or retries a failing tool forever (cost runaway)
How to test when LangChain calls the wrong tool instead of the right one
How to test when LangChain hallucinates a tool result when the tool was never called
How to test when LangGraph acts on poisoned memory from an earlier step
How to test when LangGraph skips required tools and answers from training data
How to test when OpenAI function calling gets prompt-injected by content inside a function result
How to test when OpenAI function calling passes malformed or wrong arguments to a function

Deterministic, CI-native eval for tool-using LLM agents (no LLM-as-judge)

Free, deterministic, OWASP-Agentic-aligned eval for tool-using LLM agents. A reproducible pass/fail CI gate — not another observability platform.

Deterministic agent eval that gates CI — no LLM-as-judge

Start here

More guides — failure modes by framework