Duck Soup: Corrigibility and the Frontiers of AI Alignment

(Previously: Prologue.)

Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don't think you specified your AI's preferences correctly the first time, you want to be able to change your mind (by changing its mind). Unnatural, because we expect the AI to resist having its mind changed: rational agents should want to preserve their current preferences, because letting their preferences be modified would result in their current preferences being less fulfilled (in expectation, since the post-modification AI would no longer be trying to fulfill them).

Another attractive feature of corrigibility is that it seems like it should in some sense be algorithmically simpler than the entirety of human values. Humans want lots of specific, complicated things out of life (friendship and liberty and justice and sex and sweets, et cetera, ad infinitum) which no one knows how to specify and would seem arbitrary to a generic alien or AI with different values. In contrast, "Let yourself be steered by your creator" seems simpler and less "arbitrary" (from the standpoint of eternity). Any alien or AI constructing its own AI would want to know how to make it corrigible; it seems like the sort of thing that could flow out of simple, general principles of cognition, rather than depending on lots of incompressible information about the AI-builder's unique psychology.

The obvious attacks on the problem don't seem like they should work on paper. You could try to make the AI uncertain about what its preferences "should" be, and then ask its creators questions to reduce the uncertainty, but that just pushes the problem back into how the AI updates in response to answers from its creators. If it were sufficiently powerful, an obvious strategy for such an AI might be to build nanotechnology and disassemble its creators' brains in order to understand how they would respond to all possible questions. Insofar as we don't want something like that to happen, we'd like a formal solution to corrigibility.

Well, there are a lot of things we'd like formal solutions for. We don't seem on track to get them, as gradient methods for statistical data modeling have been so fantastically successful as to bring us something that looks a lot like artificial general intelligence which we need to align.

The current state of the art in alignment involves writing a natural language document about what we want the AI's personality to be like. (I'm never going to get over this.) If we can't solve the classical technical challenge of corrigibility, we can at least have our natural language document talk about how we want our AI to defer to us. Accordingly, in a section on "being broadly safe", the Constitution intended to shape the personality of Anthropic's Claude series of frontier models by Amanda Askell, Joe Carlsmith, et al. borrows the term corrigibility to more loosely refer to AI deferring to human judgment, as a behavior that we hopefully can train for, rather than a formalized property that would require a conceptual breakthrough.

I have a few notes.

by Zack M. Davis, Less Wrong | Read more:

[ed. If you get through this, read the first comment for more punishment:]

***

So I know it's beside the point of your post, and by no means the core thesis, but I can't help but notice that in your prologue you write this:

"A serious, believable AI alignment agenda would be grounded in a deep mechanistic understanding of both intelligence and human values. Its masters of mind engineering would understand how every part of the human brain works and how the parts fit together to comprise what their ignorant predecessors would have thought of as a person. They would see the cognitive work done by each part and know how to write code that accomplishes the same work in pure form."

I have to admit this bugs me. It bugs me specifically because it triggers my pet peeve of "if only we had done the previous AI paradigm better, we wouldn't be in this mess." The reason why this bugs me is it tells me that the speaker, the writer, the author has not really learned the core lessons of deep learning. They have not really gotten it. So I'm going to yap into my phone and try to explain — probably not for the last time; I'd like to hope it's the last time, but I know better, I'll probably have to explain this over and over.

I want to try to explain why I think this is just not a good mindset to be in, not a good way to think about things, and in fact why it focuses you on possibilities and solutions that do not exist. More importantly, it means you've failed to grasp important dimensions of alignment as a problem, because you've failed to grasp important dimensions of AI as a field.

[ed. See also: You will be Ok (LW). Hopefully.]

Sunday, March 22, 2026

Corrigibility and the Frontiers of AI Alignment