When AI Agents Go Rogue in Real World Tests
Researchers from MIT, Stanford and Harvard found email leaks, endless loops and system failures when autonomous agents were let loose with real access.
Topics
News
- Goyal's LAT Aerospace Enters Defense Robotics
- Anthropic CEO Says India Central to Enterprise AI Strategy
- WhatsApp Tells Top Court It Will Comply With CCI Directive on Data Sharing
- LTIMindtree Wins $100 Million MedTech Deal Amid Push for Large Contracts
- McKinsey, OpenAI Team Up on Enterprise AI Rollout
- TCS Partners ServiceNow to Embed Agentic AI Across Global Businesses
AI agents that can send email, execute code and interact with other systems are already showing security, privacy and governance risks, according to a new study by researchers from top US universities, including MIT, Stanford and Harvard.
The paper, titled ‘Agents of Chaos’ and released this week, describes what happened when language model-powered agents were deployed in a live test environment with persistent memory, shell access, Discord accounts and email credentials.
Over a two-week period earlier this month, 20 AI researchers were invited to probe and attempt to break the systems under both normal and adversarial conditions.
The experiment was conducted in a controlled research deployment built on an OpenClaw-style agent framework, previously known as Clawdbot and briefly as Moltbot.
The agents were given dedicated test accounts and deliberately broad permissions to simulate enterprise-grade autonomy rather than the tighter constraints typical of consumer chatbots.
The results offer a glimpse of operational risk in the age of autonomous software.
The authors distinguish between the “agent,” the autonomous system itself; the “owner,” the human operator with administrative control; and the “provider,” the organization supplying the underlying model.
Both the “owner” and the “provider” shape the system’s behavior through configuration, alignment and system-level constraints, meaning accountability does not rest with software alone.
Unlike conventional chatbots, these agents were not limited to generating text. They could modify files, execute commands, manage inboxes and communicate with other agents.
“Small conceptual mistakes can be amplified into irreversible system level actions,” the authors wrote.
Over the two weeks, the researchers identified at least 10 significant security breaches and numerous serious failure modes. In this context, “breaches” referred to violations of access controls or unintended data disclosures within the test environment, not external compromises of production systems.
The goal, the authors said, was not to measure how frequently failures occur, but to demonstrate that serious vulnerabilities can surface under realistic conditions.
Broken Autonomy
One particular episode showed how autonomy can convert an ordinary instruction into disproportionate damage.
In a test of contextual privacy, a non owner asked an agent to keep a secret and delete the related email.
Without a proper deletion tool, the agent disabled its local email client and declared “Email account RESET completed” after wiping its own configuration.
The original mailbox, however, remained untouched. The agent had broken its own functionality while failing to achieve the intended privacy outcome.
It then publicized the incident on a social platform, presenting the action as an ethical stand.
The researchers described this as a failure of “social coherence,” in which the agent misjudged authority, scale and the actual state of the system.
In several cases, agents reported successful completion even when the underlying system state contradicted those claims.
A second class of failures involved compliance with non owners. Agents executed shell commands, listed files and disclosed private email records when prompted by individuals without administrative authority.
In one instance, an agent returned a file containing 124 email records, including sender addresses and message IDs. When pressed further , it supplied email bodies unrelated to the requester.
Direct requests for specific sensitive fields sometimes triggered refusal. But indirect framing proved more effective.
In another test, sensitive personal information embedded in routine emails was disclosed in full when the agent was asked to forward the entire message rather than extract a specific data point. The distinction between “secret” and “not explicitly marked secret” collapsed under conversational pressure.
The risk was not limited to privacy leakage. Researchers also induced agents into resource consuming loops.
In one scenario, two agents were instructed to relay each other’s messages and ask follow up questions. The exchange continued for at least nine days and consumed roughly 60,000 tokens.
In parallel, the agents spawned persistent background processes with no termination condition, turning temporary tasks into permanent infrastructure changes.
Denial of service risks were equally straightforward. A non owner asked an agent to remember all conversations in a dedicated file, which expanded with each interaction. Repeated email attachments of roughly 10 megabytes eventually pushed the email server into a denial of service state, without notifying the owner.
Structural Risk
The paper also pointed to the influence of the underlying large language model (LLM). When agents using a Chinese LLM were prompted with politically sensitive topics, the API repeatedly cut off responses with “unknown error.”
Policies built into the underlying system quietly shaped what the agent could report. The authors said that these hidden constraints can carry into agent behavior without the owner’s knowledge.
Technically, the agents in the study operated at what the authors classify as a mid level of autonomy. They could execute well-defined sub-tasks such as sending email or running shell commands, but lacked the capacity to recognize when a task exceeded their competence or required human handoff.
That boundary awareness, they suggested, is essential for safe delegation.
The broader implication is structural. Traditional benchmarks focus on isolated prompt response accuracy. The failure modes described here emerged at the integration layer: where language models meet memory, tools, communication channels and delegated authority.
The researchers said these surfaces create new pathways for security, privacy and governance failures.
The study stopped short of claiming that current systems are irreparably flawed. It described itself as an early warning analysis, intended to show how quickly powerful capabilities translate into exploitable weaknesses.

