How to regression-test an AI agent when you change the model or prompt

Swapping gpt-4o for a newer model snapshot or tweaking a single line in your system prompt can silently change which tools your agent calls, whether it refuses edge-case inputs, and how it sequences multi-step tasks — with no error thrown and no obvious signal until a user reports broken behavior in production.

That’s the core problem with AI agent regression testing: the failure mode is behavioral drift, not a stack trace. A deterministic, case-based test suite run on every push is the only reliable way to catch it before it ships.

Why “vibe-checking” a new model isn’t enough

When you upgrade a model or edit a prompt, you’re changing the distribution of outputs across every possible input. A few manual spot-checks will miss the long tail. Common regressions include:

Tool selection changes — the agent starts calling a search tool where it previously answered from context, or vice versa.
Refusal threshold shifts — a prompt tweak that tightens safety language causes the agent to refuse legitimate requests it previously handled.
Step-order drift — a multi-step workflow reorders operations in a way that’s semantically plausible but functionally wrong (e.g., writing a file before validating its contents).
Scope creep — a more capable model snapshot starts taking actions beyond what the task requires, a direct hit on OWASP Agentic Top 10 risk A4 (Excessive Agency).

LLM-as-judge approaches can catch some of this, but they introduce their own non-determinism: the judge model can drift too, and “did the judge think this was correct?” is not a reproducible pass/fail signal you can defend in a code review or compliance audit.

The deterministic alternative: structured case files

A better pattern is to define expected behaviors as explicit assertions — expected tool calls, forbidden tool calls, required refusals, output schema checks — and evaluate them with deterministic logic. No judge model, no embeddings, no fuzzy similarity score. Each case either passes or fails, the same way every time, on every machine.

This is what makes the gate CI-native. A flaky test that sometimes passes is worse than no test; a deterministic pass/fail can block a merge with confidence.

Try it in two minutes

Install the runner and point it at your model to see the format against a built-in starter case:

pip install "agent-eval-runner[openai]"
export OPENAI_API_KEY=sk-...
agent-eval try --model openai:gpt-4o

This runs a single bundled case against your live model so you can see what a case file looks like and what a failure report looks like before you write anything custom.

Wire it into CI

Once you have a ./cases directory with your own case files and an adapter module that wraps your agent, the full run command is:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The --report signoff.md flag writes a human-readable artifact you can attach to a PR or a release sign-off checklist — useful when you need to show that behavioral regression testing was performed before a model upgrade went to production.

For GitHub Actions, drop this workflow file into your repo:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

Now every push and every pull request — including the one where someone bumps the model version in a config file — runs the full case suite. A regression blocks the merge. No regression, green check, merge with confidence.

What to put in your cases

Start with the behaviors that have broken before or that you’d be most embarrassed to ship broken:

Core happy-path tool calls — the tools the agent should always invoke for canonical inputs.
Refusal cases — inputs the agent must decline (out-of-scope requests, prompt injection attempts). These map directly to OWASP Agentic Top 10 risks around prompt injection (A1) and unsafe tool execution (A2).
Boundary cases for scope — inputs that are adjacent to the task but should not trigger additional tool calls or side effects.
Multi-step ordering — for agentic workflows where step sequence matters, assert the expected call order explicitly.

The 5-case free starter pack covers the most common patterns. The full 28-case OWASP Agentic Top 10 aligned pack covers all ten risk categories with ready-to-run cases you can adapt to your agent’s specific tools and domain.

The key discipline is: every time a regression reaches production, write a case for it. Over time, your case suite becomes a precise behavioral specification of your agent — one that runs in CI and never forgets.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt