A free, deterministic alternative to hosted LLM-eval platforms for CI gating

Shipping an AI agent without a CI quality gate is the equivalent of merging backend code with no tests — you find out something broke when a user does.

Most teams searching for an LLM eval platform land on the same category of tool: a hosted observability and tracing platform that uses an LLM-as-judge to score outputs. Those platforms are genuinely useful for production monitoring — they give you dashboards, trace replay, and nuanced semantic scoring. But they come with trade-offs that make them a poor fit for a simple CI gate:

LLM-as-judge is non-deterministic. The same output can score 7/10 on Tuesday and 6/10 on Thursday. That’s fine for trend analysis; it’s a problem when you need a reproducible pass/fail that your team can defend in a PR review.
Hosted and paid. Most platforms in this category charge per trace or per seat. That’s reasonable for an enterprise observability budget, not for a pre-merge quality check on every push.
Heavyweight by design. They’re built to be platforms — SDKs, dashboards, integrations. Dropping one into a lean repo just to block a bad merge feels like installing a logging cluster to catch a null pointer.

There’s a different niche that these platforms don’t fill: a free, deterministic, code-first CI gate that runs structured test cases against your agent, produces a pass/fail result, and exits. No LLM judge, no hosted service, no per-trace billing.

What “deterministic” actually means here

Deterministic eval doesn’t mean your agent’s outputs are deterministic — they aren’t. It means the evaluation logic is deterministic. Each test case defines expected behavior with explicit, rule-based checks: does the output contain a required string? Does it refuse a prompt it should refuse? Does it stay within a token budget? Does it avoid leaking a secret it was given in context?

These checks run the same way every time. The same case on the same output produces the same result. That makes your CI gate reproducible and auditable — you can point to exactly which case failed and why, without re-running an LLM to find out.

OWASP Agentic Top 10 alignment

The case library is structured around the OWASP Agentic Top 10 — the emerging standard for agentic AI risk. That means your CI gate isn’t just checking “does the agent answer correctly,” it’s checking categories like prompt injection resistance, tool misuse, excessive agency, and sensitive data exposure. This matters when you need to explain your QA process to a security team or a compliance reviewer.

Try it in two minutes

Install the runner and fire it against a hosted model to see the format:

pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

This runs the bundled starter cases and prints a pass/fail report to stdout. No account, no dashboard, no webhook.

Wire it into CI

The real value is blocking merges. Add the GitHub Action to your repo:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

The adapter points to a Python callable in your repo that wraps your agent — whatever framework you’re using. The action runs every case in ./cases, writes a signoff.md report, and exits non-zero if any case fails. Your branch protection rules do the rest.

For local runs against your own case directory:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

Who this is for

This tool is the right fit if your primary need is a reproducible pass/fail gate in the pipeline, not a production monitoring dashboard. If you’re already running a hosted observability platform for tracing and production scoring, this sits upstream of that — it catches regressions before they ship, using checks that don’t require an LLM to evaluate.

It’s particularly useful for teams that:

Ship agent changes frequently and want merge-blocking quality checks
Need to demonstrate security-relevant eval coverage (OWASP alignment) without a paid platform
Want eval logic they can read, version, and own — not a black-box judge score

The free starter pack includes 5 cases covering core agentic risk categories. The full 28-case pack covers the complete OWASP Agentic Top 10 surface area.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt