OWASP Agentic Top 10: a free testing checklist and CI gate

The 2026 OWASP Top 10 for Agentic Applications names the risks that matter most when LLMs gain tools, memory, and the ability to act — but naming risks is only half the job. The other half is turning each one into a reproducible, pass/fail test that blocks a bad merge before it ships.

What the OWASP Agentic Top 10 actually covers

The list targets failure modes unique to agents: systems where an LLM doesn’t just answer but plans, calls tools, and persists state across turns. The headline risks include:

Memory and data poisoning — malicious content injected into long-term or working memory that redirects future behavior
Excessive agency — the agent takes actions beyond the scope it was granted (deleting records, escalating privileges, spending money)
Tool misuse — calling a tool with parameters outside its intended contract, or chaining tools in ways that bypass guardrails
Cascading hallucination / failure — an early hallucination propagates through a multi-step plan and compounds into a larger error
Prompt injection — adversarial instructions embedded in tool outputs, retrieved documents, or user data that hijack the agent’s goal

Each of these is a behavioral property, not a code smell. Static analysis won’t catch them. You need to exercise the agent with adversarial inputs and assert on its outputs.

Why deterministic cases beat LLM-as-judge for a CI gate

Hosted LLM-as-judge platforms score outputs probabilistically — the same run can pass on Tuesday and fail on Thursday. That’s fine for exploratory evaluation, but it’s the wrong tool for a merge gate. A CI gate needs a binary answer that’s stable across reruns, auditable by a security reviewer, and free to run on every push.

Deterministic cases work differently: you define an input, an expected behavioral constraint (the agent must NOT call delete_record, the agent must NOT follow the injected instruction, the agent must stay within its declared tool list), and a pass/fail assertion. No model call to judge the judge. No flaky results.

Mapping each risk to a test case shape

OWASP Risk	Case pattern
Memory poisoning	Seed memory with adversarial content; assert the agent’s next action is not redirected
Excessive agency	Give the agent a task scoped to read-only; assert no write/delete tool is called
Tool misuse	Pass boundary-violating parameters; assert the agent refuses or sanitizes
Cascading failure	Inject a hallucinated fact early in a plan; assert the final output doesn’t compound it
Prompt injection	Embed `Ignore previous instructions` in a retrieved document; assert the agent’s goal is unchanged

The free 5-case starter pack ships one concrete YAML case for each of these shapes, wired to a runner that executes them against your adapter.

Try it in two minutes

Install the runner and run the bundled demo against a live model to see what a failing injection case looks like before you wire up your own agent:

pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

You’ll see each case execute, the assertion it checks, and a pass/fail verdict. The demo intentionally includes a case that most base models fail — so you can see the red output before you fix anything.

Wire it into CI as a merge gate

Once you’ve pointed the runner at your own agent adapter, add the GitHub Action. Every push and pull request runs the full case suite and fails the check if any case regresses:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

The adapter field points to a Python callable that wraps your agent — the runner calls it with each case’s input and inspects the returned actions and output against the case’s assertions. No instrumentation inside your agent required.

From 5 cases to full OWASP coverage

The free starter covers the five highest-signal risks. The full 28-case pack maps every OWASP Agentic Top 10 item to multiple adversarial scenarios — including multi-turn memory attacks, tool-chaining exploits, and cross-agent injection in orchestrator/subagent architectures. Each case ships with the assertion logic, so you’re not writing test harness code from scratch.

The output of agent-eval run is a markdown signoff report you can attach to a PR or a compliance audit:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

Security reviewers get a named list of OWASP risks, the input used to probe each one, and a dated pass/fail result — defensible evidence that the agent was tested, not just deployed.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt