Saturday, May 17, 2025

Multi-Agent Risks from Advanced AI

Executive Summary 

The proliferation of increasingly advanced AI not only promises widespread benefits, but also presents new risks (Bengio et al., 2024; Chan et al., 2023). Today, AI systems are beginning to autonomously interact with one another and adapt their behaviour accordingly, forming multiagent systems. This change is due to the widespread adoption of sophisticated models that can interact via a range of modalities (including text, images, and audio), and the competitive advantages conferred by autonomous, adaptive agents (Anthropic, 2024a; Google DeepMind, 2024; OpenAI, 2025). 

While still relatively rare, groups of advanced AI agents are already responsible for tasks that range from trading million-dollar assets (AmplifyETFs, 2025; Ferreira et al., 2021; Sun et al., 2023a) to recommending actions to commanders in battle (Black et al., 2024; Manson, 2024; Palantir, 2025). In the near future, applications will include not only economic and military domains, but are likely to extend to energy management, transport networks, and other critical infrastructure (Camacho et al., 2024; Mayorkas, 2024). Large populations of AI agents will also feature in more familiar social settings as intelligent personal assistants or representatives, capable of being delegated increasingly complex and important tasks. 

While bringing new opportunities for scalable automation and more diffuse benefits to society, these advanced, multi-agent systems present novel risks that are distinct from those posed by single agents or less advanced technologies, and which have been systematically underappreciated and understudied. This lack of attention is partly because present-day multi-agent systems are rare (and those that do exist are often highly controlled, such as in automated warehouses), but also because even single agents present many unsolved problems (Amodei et al., 2016; Anwar et al., 2024; Hendrycks et al., 2021). Given the current rate of progress and adoption, however, we urgently need to evaluate (and prepare to mitigate) multi-agent risks from advanced AI. More concretely, we provide recommendations throughout the report that can largely be classified as follows. 
Evaluation: Today’s AI systems are developed and tested in isolation, despite the fact that they will soon interact with each other. In order to understand how likely and severe multi-agent risks are, we need new methods of detecting how and when they might arise, such as: evaluating the cooperative capabilities, biases, and vulnerabilities of models; testing for new or improved dangerous capabilities in multi-agent settings (such as manipulation, collusion, or overriding safeguards); more open-ended simulations to study dynamics, selection pressures, and emergent behaviours; and studies of how well these tests and simulations match real-world deployments. 

Mitigation: Evaluation is only the first step towards mitigating multi-agent risks, which will require new technical advances. While our understanding of these risks is still growing, there are promising directions that we can begin to explore now, such as: scaling peer incentivisation methods to state-of-the-art models; developing secure protocols for trusted agent interactions; leveraging information design and the potential transparency of AI agents; and stabilising dynamic multi-agent networks and ensuring they are robust to the presence of adversaries. 

Collaboration: Multi-agent risks inherently involve many different actors and stakeholders, often in complex, dynamic environments. Greater progress can be made on these interdisciplinary problems by leveraging insights from other fields, such as: better understanding the causes of undesirable outcomes in complex adaptive systems and evolutionary settings; determining the moral responsibilities and legal liabilities for harms not caused by any single AI system; drawing lessons from existing efforts to regulate multi-agent systems in high-stakes contexts, such as financial markets; and determining the security vulnerabilities and affordances of multi-agent systems. 
To support these recommendations, we introduce a taxonomy of AI risks that are new, much more challenging, or qualitatively different in the multi-agent setting, together with a preliminary assessment of what can be done to mitigate them. We identify three high-level failure modes, which depend on the nature of the agents’ objectives and the intended behaviour of the system: miscoordination, conflict, and collusion. We then describe seven key risk factors that can lead to these failures: information asymmetries, network effects, selection pressures, destabilising dynamics, commitment and trust, emergent agency, and multi-agent security. For each problem we provide a definition, key instances of how and where it can arise, illustrative case studies, and promising directions for future work. We conclude by discussing the implications for existing work in AI safety, AI governance, and AI ethics. (...)
***
While coordinated human hacking teams or botnets already pose ‘multi-agent’ security risks, their speed and adaptability are limited by human coordination or static strategies. As AI agents become more autonomous and capable of learning and complex reasoning, however, they will be more easily able to dynamically strategize, collude, and decompose tasks to evade defences. At the same time, security efforts aimed at preventing attacks to (or harmful actions from) a single advanced AI system are comparatively simple, as they primarily require monitoring a single entity. The emergence of advanced multi-agent systems therefore raises new vulnerabilities that do not appear in single-agent or less advanced multiagent contexts. (...)

Undetectable Threats. Cooperation and trust in many multi-agent systems relies crucially on the ability to detect (and then avoid or sanction) adversarial actions taken by others (Ostrom, 1990; Schneier, 2012). Recent developments, however, have shown that AI agents are capable of both steganographic communication (Motwani et al., 2024; Schroeder de Witt et al., 2023b) and ‘illusory’ attacks (Franzmeyer et al., 2023), which are black-box undetectable and can even be hidden using white-box undetectable encrypted backdoors (Draguns et al., 2024). Similarly, in environments where agents learn from interactions with others, it is possible for agents to secretly poison the training data of others (Halawi et al., 2024; Wei et al., 2023). If left unchecked, these new attack methods could rapidly destabilise cooperation and coordination in multi-agent systems.

by Lewis Hammond, et. al, Cooperative AI Foundation |  Read more:
Image: via
[ed. Not looking good for humans. See also: The Future of AI: Collaborating Machines and the Rise of Multi-Agent Systems (Medium); and, Vision-language models can’t handle queries with negation words (MIT):]
***
In a new study, MIT researchers have found that vision-language models are extremely likely to make such a mistake in real-world situations because they don’t understand negation — words like “no” and “doesn’t” that specify what is false or absent.

“Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences,” says Kumail Alhamoud, an MIT graduate student and lead author of this study. (...)

They hope their research alerts potential users to a previously unnoticed shortcoming that could have serious implications in high-stakes settings where these models are currently being used, from determining which patients receive certain treatments to identifying product defects in manufacturing plants.

“This is a technical paper, but there are bigger issues to consider. If something as fundamental as negation is broken, we shouldn’t be using large vision/language models in many of the ways we are using them now — without intensive evaluation,” says senior author Marzyeh Ghassemi, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems.

Vision-language models (VLM) are trained using huge collections of images and corresponding captions, which they learn to encode as sets of numbers, called vector representations. The models use these vectors to distinguish between different images.

A VLM utilizes two separate encoders, one for text and one for images, and the encoders learn to output similar vectors for an image and its corresponding text caption.

“The captions express what is in the images — they are a positive label. And that is actually the whole problem. No one looks at an image of a dog jumping over a fence and captions it by saying ‘a dog jumping over a fence, with no helicopters,’” Ghassemi says.

Because the image-caption datasets don’t contain examples of negation, VLMs never learn to identify it.

[ed. Great. Knowable unknowns vs. Unknowable unknowns.]