Duck Soup: Notes on Existential Risk from Artificial Superintelligence

[ed. Are A.I. guardrails for human survival even possible?]

Introduction

What follows is presented in an interview format. It's not actually an interview, but rather an idealized distillation of conversations I've had with many people. I chose this unusual form after struggling with a more conventional essay or paper form; I think such forms imply more confidence than warranted in most discussions about ASI xrisk. An interview seems a more appropriate mix of evidence, argument, and opinion. Some of the material covers background that will be known to people well read on ASI xrisk. However, there are also novel contributions – for example, the discussion of emergence and of the three xrisk persuasion paradoxes – that I believe are of interest.

"Do you believe there is an xrisk from ASI?" Yes, I do. I don't have strong feelings about how large that risk is, beyond being significant enough that it should be taken very seriously. ASI is likely to be both the most dangerous and the most enabling technology ever developed by humanity. In what follows I describe some of my reasons for believing this. I'll be frank: I doubt such arguments will change anyone's mind. However, that discussion will lay the groundwork for a discussion of some reasons why thoughtful people disagree so much in their opinions about ASI xrisk. As we'll see, this is in part due to differing politics and tribal beliefs, but there are also some fundamental epistemic reasons intrinsic to the nature of the problem.

"So, what's your probability of doom?" I think the concept is badly misleading. The outcomes humanity gets depend on choices we can make. We can make choices that make doom almost inevitable, on a timescale of decades – indeed, we don't need ASI for that, we can likely arrange it in other ways (nukes, engineered viruses, …). We can also make choices that make doom extremely unlikely. The trick is to figure out what's likely to lead to flourishing, and to do those things. The term "probability of doom" began frustrating me after starting to routinely hear people at AI companies use it fatalistically, ignoring the fact that their choices can change the outcomes. "Probability of doom" is an example of a conceptual hazard – a case where merely using the concept may lead to mistakes in your thinking. Its main use seems to be as marketing: if widely-respected people say forcefully that they have a high or low probability of doom, that may cause other people to stop and consider why. But I dislike concepts which are good for marketing, but bad for understanding; they foster collective misunderstanding, and are likely to eventually lead to collective errors in action. (...)

"That wasn't an argument for ASI xrisk!" True, it wasn't. Indeed, one of the things that took me quite a while to understand was that there are very good reasons it's a mistake to expect a bulletproof argument either for or against xrisk. I'll come back to why that is later. I will make some broad remarks now though. I believe that humanity can make ASI, and that we are likely to make it soon – within three decades, perhaps much sooner, absent a disaster or a major effort at slowdown. Many able people and many powerful people are pushing very hard for it. Indeed: enormous systems are starting to push for it. Some of those people and systems are strongly motivated by the desire for power and control. Many are strongly motivated by the desire to contribute to humanity. They correctly view ASI as something which will do tremendous good, leading to major medical advances, materials advances, educational advances, and more. I say "advances", which has come to be something of a marketing term, but I don't mean Nature-press-release-style-(usually)-minor-advances. I mean polio-vaccine-transforming-millions-of-lives-style-advances, or even larger. Such optimists view ASI as a technology likely to produce incredible abundance, shared broadly, and thus enriching everyone in the world.

But while that is wonderful and worth celebrating, those advances seem to me likely to have a terrible dark side. There is a sense in which human understanding is always dual use: genuine depth of understanding makes the universe more malleable to our will in a very general way. For example, while the insights of relativity and quantum mechanics were crucial to much of modern molecular biology, medicine, materials, computing, and in many other areas, they also helped lead to nuclear weapons. I don't think this is an accident: such dual uses are very near inevitable when you greatly increase your understanding of the stuff that makes up the universe.

As an aside on the short term – the next few years – I expect we're going to see rapidly improving multi-modal foundation models which mix language, mathematics, images, video, sound, action in the world, as well as many specialized sources of data, things like genetic data about viruses and proteins, data from particle physics, sensor data from vehicles, from the oceans, and so on. Such models will "know" a tremendous amount about many different aspects of the world, and will also have a raw substrate for abstract reasoning – things like language and mathematics; they will get at least some transfer between these domains, and will be far, far more powerful than systems like GPT-4. This does not mean they will yet be true AGI or ASI! Other ideas will almost certainly be required; it's possible those ideas are, however, already extant. No matter what, I expect such models will be increasingly powerful as aids to the discovery of powerful new technologies. Furthermore, I expect it will be very, very difficult to obtain the "positive" capabilities, without also obtaining the negative. You can't just learn the "positive" consequences of quantum mechanics; they come as a package deal with the negative. Guardrails like RLHF will help suppress the negative, but as I discuss later it will also be relatively simply to remove those guardrails.

Returning to the medium-and-longer-term: many people who care about ASI xrisk are focused on ASI taking over, as some kind of successor species to humanity. But even focusing on ASI purely as a tool. ASI will act as an enormous accelerant on our ability to understand, and thus will be an enormous amplifier of our power. This will be true both for individuals and for groups. This will result in many, many very good things. Unfortunately, it will also result in many destructive things, no matter how good the guardrails. It is by no means clear that questions like "Is there a trivially easy-to-follow recipe to genocide [a race]?" or "Is there a trivially easy-to-follow recipe to end humanity?" don't have affirmative answers, which humanity is merely (currently and fortunately) too stupid to answer, but which an ASI could answer.

Historically, we have been very good at evolving guardrails to curb and control powerful new technologies. That is genuine cause for optimism. However, I worry that we won't be able to evolve guardrails sufficient to the increase in this case. The nuclear buildup from the 1940s through the 1980s is a cautionary example: reviewing the evidence it is clear we have only just barely escaped large-scale nuclear war so far – and it's still early days! It seems likely that ASI will create many such threats, in parallel, on a much faster timescale, and far more accessible to individuals and small groups. The world of intellect simply provides vastly scalable leverage: if you can create one artificial John von Neumann, then you can produce an army of them, some of whom may be working for people we'd really rather not have access to that kind of capacity. Many people like to talk about making ASI systems safe and aligned; quite apart from the difficulty in doing that (or even sensibly defining that) it seems it must be done for all ASI systems, ever. That seems to require an all-seeing surveillance regime, a fraught path. Perhaps such a surveillance regime can be implemented not merely by government or corporations against the populace, but in a much more omnidirectional way, a form of ambient sousveillance.

"What do you think about the practical alignment work that's going on – RLHF, Constitutional AI, and so on?": The work is certainly technically interesting. It's interesting to contrast to prior systems, like Microsoft's Tay, which could easily be made to do many terrible things. You can make ChatGPT and Claude do terrible things as well, but you have to work harder; the alignment work on those systems has created somewhat stable guardrails. This kind of work is also striking as a case where safety-oriented people have done detailed technical work to improve real systems, with hard feedback loops and clear criteria for success and failure, as opposed to the abstract philosophizing common in much early ASI xrisk work. It's certainly much easier to improve your ideas in the former case, and easier to fool yourself in the latter case.

With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today's AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There's an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he's only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.

Over the past decade I've met many AI safety people who speak as though "AI capabilities" and "AI safety/alignment" work is a dichotomy. They talk in terms of wanting to "move" capabilities researchers into alignment. But most concrete alignment work is capabilities work. It's a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.

A second point about alignment is that no matter how good the guardails, they are intrinsically unstable, and easily removed. I often meet smart AI safety people who have inventive schemes they hope will make ASI systems safe. Maybe they will, maybe they won't. But the more elaborate the scheme, the more unstable the situation. If you have a magic soup recipe which requires 123 different ingredients, but all must be mixed accurate to within 1% weight, and even a single deviation will make it deadly poisonous, then you really shouldn't cook and eat your "safe" soup. One of the undercooks forgets to put in a leek, and poof, there goes the village.

You see something like this with Stable Diffusion. Initial releases were, I am told, made (somewhat) safe. But, of course, people quickly figured out how to make them unsafe, useful for generating deep fake porn or gore images of non-consenting people. And there's all sorts of work going on finetuning AI systems, including to remove items from memory, to add items into memory, to remove RLHF, to poison data, and so on. Making a safe AI system unsafe seems to be far easier than making a safe AI system. It's a bit as though we're going on a diet of 100% magic soup, provided by a multitude of different groups, and hoping every single soup has been made absolutely perfectly.

Put another way: even if we somehow figure out how to build AI systems that everyone agrees are perfectly aligned, that will inevitably result in non-aligned systems. Part of the problem is that AI systems are mostly made up of ideas. Suppose the first ASI systems are made by OpenAnthropicDeepSafetyBlobCorp, and they are absolutely 100% safe (whatever that means). But those ideas will then be used by other people to make less safe systems, either due to different ideologies about what safe should mean, or through simple incompetence. What I regard as safe is very unlikely to be the same as what Vladimir Putin regards as safe; and yet if I know how to build ASI systems, then Putin must also be able to build such systems. And he's likely to put very different guardrails in. It's not even the same as with nuclear weapons, where capital costs and limited access to fissionable materials makes enforcement of non-proliferation plausible. In AI, rapidly improving ideas and dropping compute costs mean that systems which today require massive resources to build can be built for tuppence tomorrow. You see this with systems like GPT-3, which just a few years ago cost large sums of money and took large teams; now, small open source groups can get better results with modest budgets.

Summing up: a lot of people are trying to figure out how to align systems. Even if successful, such efforts will: (a) accelerate the widespread use and proliferation of such systems, by making them more attractive to customers and governments, and exciting to investors; but then (b) be easily circumvented by people whose idea of "safe" may be very, very different than yours or mine. This will include governments and criminal or terrorist organizations of ill intent.

"Does this mean you oppose such practical work on alignment?" No! Not exactly. Rather, I'm pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it's only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn't the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it's making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don't proposition.

Unfortunately, I am genuinely torn on the alignment dilemma! It's a very nasty dilemma, since it divides two groups who ought to be natural collaborators, on the basis of some uncertain future event. And apart from that point about collaboration and politics, it has nasty epistemic implications. It is, as I noted earlier, easiest to make real progress when you're working on concrete practical problems, since you're studying real systems and can iteratively test and improve your ideas. It's not impossible to make progress through more abstract work – there are important ideas like the vulnerable world hypothesis, existential risk and so on, which have come out of the abstract work on ASI xrisk. But immediate practical work is a far easier setting in which to make intellectual progress.

"Some thoughtful open source advocates believe the pursuit of AGI and ASI will be safer if carried out in the open. Do you buy that?": Many of those people argue that the tech industry has concentrated power in an unhealthy way over the past 30 years. And that open source mitigates some of that concentration of power. This is sometimes correct, though it can fail: sometimes open source systems are co-opted or captured by large companies, and this may protect or reinforce the power of those companies. Assuming this effect could be avoided here, I certainly agree that open source approaches might well help with many important immediate concerns about the fairness and ethics of AI systems. Furthermore, addressing those concerns is an essential part of any long-term work toward alignment. Unfortunately, though, this argument breaks down completely over the longer term. In the short term, open source may help redistribute power in healthy, more equitable ways. Over the long term the problem is simply too much power available to human beings: making it more widely available won't solve the problem, it will make it worse.

ASI xrisk persuasion paradoxes

"A lot of online discussion of ASI xrisk seems of very low quality. Why do you think that is?" I'll answer that indirectly. Something I love about most parts of science and mathematics is that nature sometimes forces you to change your mind about fundamental things that you really believe. When I was a teenager my mind recoiled at the theories of relativity and quantum mechanics. Both challenged my sense of the world in fundamental ways. Ideas like time dilation and quantum indeterminacy seemed obviously wrong! And yet I eventually realized, after much wrestling, that it was my intuitions about the world that were wrong. These weren't conclusions I wanted to come to: they were forced, by many, many, many facts about the world, facts that I simply cannot explain if I reject ideas like time dilation and quantum indeterminacy. This doesn't mean relativity and quantum mechanics are the last word in physics, of course. But they are at the very least important stepping stones to making sense of a world that wildly violates our basic intuitions.

by Michael Nielsen, Asteria Institute | Read more:

Image: via

[ed. The concept of alignment as an accelerant is a new one to me and should be disturbing to anyone who's hoping the "good guys" (ie. anyone prioritizing human agency) will win. In fact, the term human race is beginning to take on a whole new meaning.]

Friday, January 2, 2026

Notes on Existential Risk from Artificial Superintelligence