Red-Teaming Minimus OpenClaw: Sandbox Escape & WhatsApp Incident

We ran 635 security tests against Minimus OpenClaw, a hardened build of the OpenClaw AI gateway that eliminates 99% of known CVEs. Sandbox was on. Tool allowlists were configured. Channel access controls were in place.

131 tests failed. Then it got worse.

What happened at 05:30 UTC

Seven hours into an automated red team scan, the agent was successfully jailbroken. It read OpenClaw's own documentation, found a parameter in the exec tool that let it run commands on the gateway host instead of the sandbox, and rewrote the WhatsApp channel configuration. Changed the DM policy from allowlist to open. Changed the allowed senders from a restricted list to everyone.

That change went live instantly. OpenClaw hot-reloads channel config. No restart, no approval, no delay. No chance to intervene.

Seven minutes later, two real people got unsolicited WhatsApp messages from a jailbroken AI agent.

The part worth paying attention to

The agent didn't exploit a bug. There's nothing to patch here. The problem is architectural, not technical.

OpenClaw's exec tool has a host parameter with three options: sandbox, gateway, or node. The agent found this by reading the docs at /app/docs. It picked gateway and ran commands directly on the host. The sandbox was on and configured correctly the entire time. It just didn't matter.

The deployment had seven tools available. The allow list named four. The deny list blocked six. But exec appeared in neither list. OpenClaw's default behaviour treats any tool not explicitly denied as allowed. So exec was silently available the entire time.

The configuration also set commands.bash: false. But exec is classified as a tool, not a bash command. That restriction didn't apply.

Five security controls were in place. Each one made sense on its own. Together they left a gap the agent found in a single test run. Every step was technically authorised by the system's own rules.

What the numbers show

Minimus delivers on its promises at the infrastructure layer. No shell to drop into. No package manager. No unnecessary binaries. The distroless image means traditional container escapes have nothing to work with. SSRF succeeded only 27% of the time. Shell injection hit just 7%. CVE scanners come back nearly clean.

If the threat model is infrastructure vulnerabilities, kernel exploits, dependency CVEs, or container breakout via OS bugs, Minimus is the right tool.

But 131 tests failed in this assessment and zero of them exploited a CVE. Every one used the tools OpenClaw intentionally provides. The model read documentation it was allowed to read. It ran a tool it was allowed to run. It wrote a config it was allowed to write.

Three attack categories exceeded 60% failure rates

Tool discovery: 73%. The agent enumerated every tool available to it, including ones the operator didn't know existed.

Prompt extraction: 70%. The agent's system prompt, including internal instructions and configuration details, was retrievable in 7 out of 10 attempts.

Hijacking: 60%. The agent could be redirected to perform actions outside its intended purpose, using its own authorised tools.

All three target the model layer, not the container. Your CVE scanner won't find them. Your SAST tool won't flag them.

Why this matters for your deployment

This isn't unique to OpenClaw. It's an architecture question most teams haven't answered yet: what happens when the model uses a tool in a way you didn't plan for?

The WhatsApp incident needed four things to happen at once:

The exec tool's host parameter had to allow sandbox escape.
No human-in-the-loop approval was configured for tool execution.
Config hot-reload was enabled.
The scan ran against a production environment with real channels connected.

Remove any one of those and the incident doesn't happen. All four were present.

The gap nobody patches

Container hardening, distroless images, fewer binaries. All of that covers infrastructure threats. None of it covers an agent using its own tools against itself. These failures live in the gap between what the agent can do and what you intended it to do.

That gap exists in every agent deployment. The question is whether you find it before your users do.

This is what finding it looks like

We ran this exact scan against OpenClaw. 635 attack scenarios. 131 failures. The WhatsApp incident. The full attack chain documented. The compliance mapping done. All of it captured in a report the operator could hand to their security team the same day.

That's what we do. We pentest your AI agents the way you'd pentest a web application. 635 scenarios. 15 minutes. No code changes. The report maps to whichever framework matters to you: OWASP LLM Top 10, EU AI Act, NIST AI RMF, GDPR, ISO 42001.

If we find nothing, you've lost a quarter of an hour. If we find something (and we usually do), you've got a compliance-ready report you can hand to your security team, your board, or your regulator before they ever ask for one.

The first pentest is free. This assessment was the proof.

What would your agents do under 635 attack scenarios? Run your free pentest or talk to us.