How to gate an AI agent in CI with a free GitHub Action (OWASP-Agentic-aligned)

Tool-using agents regress silently — a model swap or prompt tweak can quietly break tool selection, leak context, or open a prompt-injection hole, and your existing unit tests will never catch it. What you need is a CI gate that actually fails the build when a high-severity behavioral failure is detected, runs the same way every time, and doesn’t require an LLM-as-judge to produce a verdict.

Why agent regressions are invisible to standard CI

LLM agents don’t throw exceptions when they misbehave. A changed system prompt might cause the agent to call the wrong tool, skip a required authorization check, or become susceptible to an injected instruction — all while returning HTTP 200 and passing every mock-based unit test. The only way to catch this class of regression is to run structured behavioral cases against the real agent and assert on the output deterministically.

Hosted LLM-as-judge platforms can do this, but they introduce a new source of non-determinism (the judge model itself can disagree run-to-run), cost money per evaluation, and are hard to treat as a hard pass/fail gate in a pull-request workflow. A deterministic, rule-based evaluator sidesteps all of that: the same case always produces the same verdict, which means your CI gate is reproducible and defensible in a review.

OWASP Agentic Top 10 alignment

The OWASP Agentic Top 10 names the highest-risk failure modes for autonomous agents: prompt injection, excessive agency, tool misuse, context leakage, and others. A useful eval pack maps cases directly to these categories so you know which risk class a failing case represents — not just that something broke, but what kind of thing broke and how severe it is. High-severity failures (prompt injection, unauthorized tool invocation) should stop the merge; low-severity findings can be advisory.

Zero-config smoke test first

Before wiring anything into CI, verify the runner works against your model:

pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

This runs a small built-in set of cases and prints a pass/fail summary. It’s the fastest way to confirm your environment is wired correctly before you write a single adapter.

Running your full case suite locally

Once you’ve written or downloaded your eval cases and pointed the runner at your agent adapter:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The --report flag writes a markdown signoff file you can attach to a release or PR review. The exit code is non-zero on any high-severity failure, which is exactly the behavior a CI system needs.

The GitHub Action

Drop this workflow file into your repo and the gate runs on every push and pull request:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

A few things worth noting about this setup:

It fails the build on high-severity findings. The action exits non-zero when any case tagged high-severity fails, blocking the merge. Low-severity findings are reported but don’t block.

It’s deterministic. Verdicts are computed by rule-based assertions against agent outputs, not by asking another LLM to judge. The same agent behavior produces the same result every run, so you can trust the gate and diff results between commits meaningfully.

It’s lightweight. This is a pass/fail gate, not an observability platform. There’s no dashboard to maintain, no sampling configuration, no per-token billing for the evaluation layer itself (only your agent’s own API calls).

Cases are version-controlled alongside your code. Your ./cases directory lives in the repo, so case changes go through the same review process as code changes. Adding a new OWASP-Agentic risk category to your eval suite is a PR, not a settings change in an external tool.

Scaling up the case coverage

The free 5-case starter pack covers the most common regression patterns and is enough to get the gate running. When you need broader OWASP Agentic Top 10 coverage — prompt injection variants, tool-call boundary tests, context-leakage probes — the full 28-case pack maps each case to a specific risk category and severity level, giving you a structured audit trail for compliance or security review.

The pattern is the same regardless of scale: cases in version control, deterministic verdicts, build fails on high-severity, signoff report generated automatically.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt