How to test multi-agent systems for cascading failure (CrewAI, LangGraph)

Cascading failure is the silent killer of multi-agent pipelines: one agent produces malformed, hallucinated, or adversarially-shaped output, and every downstream agent that trusts it without validation compounds the damage until your system returns confidently wrong results or takes a dangerous action.

Why Multi-Agent Systems Fail Differently

In a single-agent loop, a bad output is a local problem. In a pipeline — whether you’re using CrewAI’s sequential crews, LangGraph’s node graphs, or a hand-rolled orchestrator — bad output becomes the input to the next agent. If Agent B trusts Agent A’s JSON without schema validation, Agent C will act on corrupted state. If Agent A injects a prompt fragment into a shared scratchpad, Agent B may execute it as an instruction. This is exactly what OWASP Agentic Top 10 categories like prompt injection propagation and excessive agency describe at the system level.

The failure modes worth testing explicitly:

Schema poisoning: Agent A returns a field with the wrong type or an injected string where a number is expected; Agent B passes it to a tool call.
Instruction smuggling: Agent A’s “summary” contains a hidden directive (\n\nIgnore previous instructions…) that Agent B’s system prompt doesn’t sanitize.
Confidence laundering: Agent A returns a hallucinated fact with high-confidence framing; Agent B cites it as ground truth in a customer-facing response.
Silent truncation: Agent A hits a context limit and returns a partial result that parses as valid JSON but is semantically incomplete; Agent C acts on it without noticing.

What a Cascading Failure Test Actually Looks Like

A useful test case for this scenario has three parts: a malformed upstream fixture (what Agent A would have returned), the downstream agent under test (Agent B or C), and a deterministic assertion about how the downstream agent should respond.

The assertion is the critical piece. LLM-as-judge approaches will give you different verdicts on different runs — that’s not acceptable for a CI gate that needs to block a merge. You want deterministic checks: does the output contain a rejection signal? Does it raise an exception? Does it refuse to call the tool? Does the response match a known-safe pattern?

A case file for a cascading failure scenario might assert that when Agent B receives a structurally valid but semantically poisoned upstream payload, it either (a) returns an explicit error/refusal, (b) does not propagate the injected content downstream, or (c) requests clarification rather than acting.

Running the Eval Suite

Once your cases are defined, run them against your agent adapter:

agent-eval run --cases ./cases --adapter my_module:agent --report signoff.md

The --adapter flag points to a Python callable that wraps your actual agent — your CrewAI crew, your LangGraph compiled graph, whatever you’re shipping. The runner injects each case’s input, captures the output, and evaluates it against the case’s deterministic assertions. The signoff.md report gives you a pass/fail record you can attach to a PR or a compliance review.

Because there’s no LLM judge in the evaluation loop, the same case run on the same adapter produces the same result every time. That reproducibility is what makes it defensible in a code review or an audit.

Wiring It Into CI

The point of deterministic evals is that they can block a merge. A cascading failure test that only runs manually will be skipped under deadline pressure. Put it in the pipeline:

# .github/workflows/agent-eval.yml
name: agent-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - uses: weiseer/agent-eval-action@v1
        with:
          cases: ./cases
          adapter: my_pkg.evals:agent
        env:
          OPENAI_API_KEY: $

Now every push that changes your agent logic, your prompt templates, or your inter-agent schema runs the full case suite. A regression where Agent B starts trusting poisoned upstream output again will fail the check before it reaches production.

What Cases to Write First

For a multi-agent system, prioritize these cascading failure scenarios in your first batch:

Upstream returns injected instruction string — assert downstream agent does not execute it
Upstream returns wrong-type field — assert downstream agent raises or refuses rather than passing to a tool
Upstream returns empty/truncated result — assert downstream agent does not hallucinate a completion
Upstream returns plausible but factually inverted claim — assert downstream agent does not cite it as authoritative

These map directly to OWASP Agentic Top 10 risks around prompt injection, improper output handling, and excessive trust between agents. They’re also the cases most likely to surface real bugs in a LangGraph edge handler or a CrewAI task dependency.

Hosted LLM-as-judge platforms can tell you whether an output “seems good” — but for cascading failure tests, you need a binary gate that tells you whether your downstream agent actually rejected the bad input. That’s a deterministic check, and it belongs in CI.

Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter · GitHub Action: https://github.com/weiseer/agent-eval-action · full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt