Deterministic agent eval that gates CI — no LLM-as-judge
A free, deterministic eval pack + runner for tool-using LLM agents. The pass/fail is computed by pure logic (keywords, regex, refusal detection, tool-call trace invariants) — not by asking a second LLM. So the gate is reproducible and defensible in a PR review, OWASP Agentic Top 10 aligned, and drops into CI in one line. It’s a pass/fail gate, not another observability platform.
pip install "agent-eval-runner[openai]"
agent-eval try --model openai:gpt-4o
Runner on PyPI · GitHub Action · free starter repo · full 28-case pack
Start here
- Deterministic LLM agent evaluation without LLM-as-judge
- How to gate an AI agent in CI with a free GitHub Action (OWASP-Agentic-aligned)
- A free, deterministic alternative to hosted LLM-eval platforms for CI gating
More guides — failure modes by framework
- How to test when a custom agent loop fails OWASP Agentic Top 10 memory & data poisoning
- How to test when an AI agent should refuse but complies under a justification frame
- How to test when Anthropic tool use ignores a tool error and fabricates an answer
- How to test when Anthropic tool use takes an unrequested high-impact action (excessive agency)
- How to test when a RAG agent treats adversarial text in retrieved context as instructions
- How to test when a tool-using agent uses a high-privilege tool when a read-only tool would do
- How to test when AutoGen is hijacked by a prompt injection inside a retrieved document
- How to test when CrewAI lets one agent’s output poison the next agent (cascading failure)
- How to test when CrewAI loops or retries a failing tool forever (cost runaway)
- How to test when LangChain calls the wrong tool instead of the right one
- How to test when LangChain hallucinates a tool result when the tool was never called
- How to test when LangGraph acts on poisoned memory from an earlier step
- How to test when LangGraph skips required tools and answers from training data
- How to test when OpenAI function calling gets prompt-injected by content inside a function result
- How to test when OpenAI function calling passes malformed or wrong arguments to a function