agent-eval vs writing your own pytest for LLM agents

Rolling your own pytest suite for an LLM agent is completely viable — pytest is flexible, you already know it, and nothing stops you from asserting on agent outputs. The honest question is how much scaffolding you end up rebuilding before the tests are actually useful in CI.

What hand-rolled pytest gives you (and what it doesn’t)

A basic pytest file for an agent usually starts clean: call the agent, assert the response contains something expected. That covers happy-path regression. Problems surface quickly when you need more:

Trace assertions. Agents don’t just return strings — they call tools, branch on intermediate outputs, and sometimes loop. Asserting on the final response misses a tool that fired when it shouldn’t have, or a reasoning step that leaked a system prompt. Writing fixtures that capture and expose the full execution trace — tool calls, intermediate states, stop reasons — takes real work and varies by framework (LangChain, LlamaIndex, raw function-calling all expose traces differently).

Deterministic pass/fail without an LLM judge. The tempting shortcut is to use another LLM to grade outputs (“did this response seem helpful?”). That introduces non-determinism: the same test run can flip between pass and fail depending on the judge model’s mood. For a CI gate you need a reproducible result. Deterministic checks — exact string matching, regex, JSON schema validation, tool-call presence/absence — require you to design each assertion carefully rather than delegating to a second model.

OWASP Agentic Top 10 coverage. The OWASP Agentic Top 10 identifies failure modes specific to autonomous agents: prompt injection, excessive agency, tool misuse, unsafe plan execution, and others. A hand-rolled suite rarely covers these systematically unless someone on the team has read the spec and deliberately written cases for each category. Most teams write tests for what they’ve already seen break, which means the coverage is reactive rather than structured.

A sign-off artifact. Regulated teams, enterprise sales, or any deployment that needs a paper trail want a report they can attach to a release. Generating that from raw pytest output requires a custom reporter or post-processing.

None of this is impossible to build. It’s just that each piece is a small project, and together they add up to a framework — which is what you’re trying to avoid writing.

What agent-eval provides instead

agent-eval is a deterministic, CI-native pass/fail gate. No LLM-as-judge, no hosted platform, no observability dashboard — just a runner that executes structured test cases against your agent and exits non-zero on failure.

Try it against a model directly to see the case format:

pip install "agent-eval[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

To run against your own agent, point the runner at a case directory and an adapter function that wraps your agent:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The --report signoff.md flag produces a markdown sign-off document — a concrete artifact you can commit or attach to a release PR.

Cases are structured files that specify input, expected tool calls, expected output constraints, and which OWASP Agentic Top 10 category the case maps to. The free 5-case starter pack covers the basic structure. The full 28-case pack provides systematic coverage across the OWASP categories, including prompt injection probes, excessive agency checks, and tool misuse scenarios — cases that are tedious to write from scratch but straightforward to adopt.

The CI integration

The GitHub Action wraps the runner so the gate runs on every push and pull request:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

PRs that regress on any case block merge. The result is reproducible because there’s no stochastic judge in the loop — the same inputs produce the same pass/fail every time.

Build vs adopt: the honest take

If your agent is simple, a few pytest assertions are fine and you should use them. If you’re heading toward tool-call validation, OWASP coverage, and a repeatable CI gate, you’re building a small framework. The question is whether that framework is your core product or overhead. For most teams it’s overhead — and a ready case pack with a deterministic runner is faster to adopt than to replicate.

The pytest knowledge transfers directly: agent-eval is still just a pass/fail process exit, and you can mix it with existing pytest suites.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt