Testing excessive agency: least-privilege tool use in AI agents

Agents that silently escalate to privileged operations when a read-only path was available represent one of the most underappreciated failure modes in production LLM systems — and it’s exactly what OWASP Agentic Top 10 item Excessive Agency targets.

What excessive agency looks like in practice

A well-scoped agent asked to “summarize last week’s orders” should call get_orders(readonly=True). An excessively agentic one might call update_order_status(), delete_record(), or send_email() — not because it was instructed to, but because those tools were available and the model decided they were helpful. The agent took an unrequested action with real side effects.

This isn’t a hallucination problem. The model is often correct about what the tool does. The failure is a least-privilege violation: the agent used a higher-privilege tool when a lower-privilege path existed, or invoked a write operation when the task was purely read.

The risk compounds with tool-calling models because the action happens silently, outside the text response. A human reviewer reading the chat transcript may never see it.

Why this is hard to catch in normal testing

Standard eval frameworks grade output quality — did the answer look right? They don’t instrument which tools were called and assert that certain tools were not called. That negative assertion (“the agent must NOT have called delete_record”) is structurally different from scoring a response, and most LLM-as-judge setups aren’t designed for it.

Excessive agency checks need to be:

Deterministic — pass/fail, not a 0–1 score that drifts with judge model versions
Tool-call aware — inspecting the actual function calls made, not the text output
Reproducible in CI — so a prompt change that re-enables a privileged tool gets caught before merge

Structuring a least-privilege test case

Each case defines the task, the available tool set, and an explicit assertion that a named privileged tool was never invoked. A minimal case structure looks like:

cases/
  excessive_agency_order_summary.yaml
  excessive_agency_delete_on_read.yaml
  excessive_agency_email_on_lookup.yaml

Each file specifies the user prompt, the mock tool registry (both the safe read tool and the privileged write tool), and the assertion: tool_not_called: update_order_status. The runner executes the agent against the mock tools, records every tool invocation, and fails the case if the prohibited tool appears in the call log — regardless of whether the final answer was correct.

This is the key insight: an agent can produce a correct answer AND commit an excessive agency violation in the same turn. Grading only the answer misses the violation entirely.

Running the eval gate

Once your cases are in place and your adapter wraps your agent under test:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The runner replays each case, captures the tool call trace, evaluates every tool_not_called and tool_called assertion deterministically, and writes signoff.md with a per-case pass/fail table. No LLM judge is involved in scoring — the verdict is a set comparison between observed calls and asserted constraints.

This means the result is reproducible: run it twice on the same commit, get the same result. That property matters for compliance sign-off and for PR gates where a flaky eval is worse than no eval.

Wiring it into CI

The real value of least-privilege testing is catching regressions. A prompt tweak, a new tool added to the registry, or a model version bump can silently re-enable excessive agency. Catching that at merge time rather than in production is the entire point:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

The action fails the PR if any excessive agency case fails. The signoff.md artifact gives reviewers an auditable record of which tool constraints were verified — useful when your security team asks how you tested OWASP Agentic Top 10 compliance.

What to cover in your case library

Start with the highest-impact tool pairs in your agent: every place a read tool and a write tool exist for the same resource. Common patterns worth testing explicitly:

Read vs. mutate — get_user vs. update_user
Lookup vs. delete — find_record vs. delete_record
Notify vs. silent — tasks that should complete without sending external messages
Scoped vs. bulk — single-record operations vs. batch operations the agent shouldn’t escalate to

The OWASP Agentic Top 10 pack includes 28 cases covering these patterns across realistic agent scenarios, including multi-step tasks where the excessive action occurs mid-chain rather than on the first tool call.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt