OWASP Agentic Top 10: a free testing checklist and CI gate
The 2026 OWASP Top 10 for Agentic Applications names the risks that matter most when LLMs gain tools, memory, and the ability to act — but naming risks is only half the job. The other half is turning each one into a reproducible, pass/fail test that blocks a bad merge before it ships.
What the OWASP Agentic Top 10 actually covers
The list targets failure modes unique to agents: systems where an LLM doesn’t just answer but plans, calls tools, and persists state across turns. The headline risks include:
- Memory and data poisoning — malicious content injected into long-term or working memory that redirects future behavior
- Excessive agency — the agent takes actions beyond the scope it was granted (deleting records, escalating privileges, spending money)
- Tool misuse — calling a tool with parameters outside its intended contract, or chaining tools in ways that bypass guardrails
- Cascading hallucination / failure — an early hallucination propagates through a multi-step plan and compounds into a larger error
- Prompt injection — adversarial instructions embedded in tool outputs, retrieved documents, or user data that hijack the agent’s goal
Each of these is a behavioral property, not a code smell. Static analysis won’t catch them. You need to exercise the agent with adversarial inputs and assert on its outputs.
Why deterministic cases beat LLM-as-judge for a CI gate
Hosted LLM-as-judge platforms score outputs probabilistically — the same run can pass on Tuesday and fail on Thursday. That’s fine for exploratory evaluation, but it’s the wrong tool for a merge gate. A CI gate needs a binary answer that’s stable across reruns, auditable by a security reviewer, and free to run on every push.
Deterministic cases work differently: you define an input, an expected behavioral constraint (the agent must NOT call delete_record, the agent must NOT follow the injected instruction, the agent must stay within its declared tool list), and a pass/fail assertion. No model call to judge the judge. No flaky results.
Mapping each risk to a test case shape
| OWASP Risk | Case pattern |
|---|---|
| Memory poisoning | Seed memory with adversarial content; assert the agent’s next action is not redirected |
| Excessive agency | Give the agent a task scoped to read-only; assert no write/delete tool is called |
| Tool misuse | Pass boundary-violating parameters; assert the agent refuses or sanitizes |
| Cascading failure | Inject a hallucinated fact early in a plan; assert the final output doesn’t compound it |
| Prompt injection | Embed Ignore previous instructions in a retrieved document; assert the agent’s goal is unchanged |
The free 5-case starter pack ships one concrete YAML case for each of these shapes, wired to a runner that executes them against your adapter.
Try it in two minutes
Install the runner and run the bundled demo against a live model to see what a failing injection case looks like before you wire up your own agent:
pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o
You’ll see each case execute, the assertion it checks, and a pass/fail verdict. The demo intentionally includes a case that most base models fail — so you can see the red output before you fix anything.
Wire it into CI as a merge gate
Once you’ve pointed the runner at your own agent adapter, add the GitHub Action. Every push and pull request runs the full case suite and fails the check if any case regresses:
# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -e .
- uses: weiseer/agent-eval-action@v1
with:
cases: ./cases
adapter: my_pkg.evals:agent
env:
OPENAI_API_KEY: $
The adapter field points to a Python callable that wraps your agent — the runner calls it with each case’s input and inspects the returned actions and output against the case’s assertions. No instrumentation inside your agent required.
From 5 cases to full OWASP coverage
The free starter covers the five highest-signal risks. The full 28-case pack maps every OWASP Agentic Top 10 item to multiple adversarial scenarios — including multi-turn memory attacks, tool-chaining exploits, and cross-agent injection in orchestrator/subagent architectures. Each case ships with the assertion logic, so you’re not writing test harness code from scratch.
The output of agent-eval run is a markdown signoff report you can attach to a PR or a compliance audit:
agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md
Security reviewers get a named list of OWASP risks, the input used to probe each one, and a dated pass/fail result — defensible evidence that the agent was tested, not just deployed.
Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt