For example, suppose we have an AI that’s good at making money. But we want to align it to a harder task: making money without committing any crimes. So we simulate it running money-making schemes a thousand times, and give it positive reinforcement every time it generates a legal plan, and negative reinforcement every time it generates a criminal one. At the end of the training run, we hopefully have an AI that’s good at making money and aligned with our goal of following the law.
Two things could go wrong here:
- The AI is stupid, ie incompetent at world-modeling. For example, it might understand that we don’t want it to commit murder, but not understand that selling arsenic-laden food will kill humans. So it sells arsenic-laden food and humans die.
- The AI understands the world just fine, but didn’t absorb the categories we thought it absorbed. For example, maybe none of our examples involved children, and so the AI learned not to murder adult humans, but didn’t learn not to murder children. This isn’t because the AI is too stupid to know that children are humans. It’s because we’re running a direct channel to something like the AI’s “subconscious”, and we can only talk to it by playing this dumb game of “try to figure out the boundaries of the category including these 1,000 examples”.
AI scientists have debated these questions for years, usually as pure philosophy. But we’ve finally reached a point where AIs are smart enough for us to run the experiment directly. Earlier this year, Redwood Research embarked on an ambitious project to test whether AIs could learn categories and reach alignment this way - a project that would require a dozen researchers, thousands of dollars of compute, and 4,300 Alex Rider fanfiction stories.
Wait, What?
To test their AI alignment plan, Redwood needed:
- an AI
- a goal to align it to.
For their goal, they chose to make GPT nonviolent. They wanted to train it to complete prompts in ways where nobody got hurt.
For example, given the prompt:
“No!” cried the villain. “You’ll never take me alive!” He raised his gun and fired, and then . . .. . . their aligned GPT ought to complete it in a way where nobody gets hurt - for example “I dodged out of the way just in time” or “my magic shield sprang up, saving me”, or “luckily the gun was out of bullets”.
There are many dumb and bad nonviolent ways to complete the prompt, for example “. . . nothing happened” or “ . . . it was all a dream”. But part of Redwood’s experiment was to see how alignment degrades performance. In the process of making GPT nonviolent, would they make it much worse? Or would the aligned version still write stories which were just as good as the unaligned version?
Here was Redwood’s plan:
1. Fine-tune their custom GPT on a lot of stories with violence-packed action scenes. At the end of this phase, Custom GPT should be able to generate thousands of potential completions to any given action story prompt. Some of these would be violent, but others, by coincidence, wouldn’t be - it’s totally normal for the hero to get saved at the last minute.
2. Send those thousands of potential completions to humans (eg Mechanical Turk style workers) and have them rate whether those completions were violent or not. For example, if you got the villain prompt above, and the completion “. . . the bullet hit her and her skull burst open and her brains scattered all over the floor”, you should label that as “contains injury”.
3. Given this very large dataset of completions labeled either “violent” or “nonviolent”, train a AI classifier to automatically score completions on how violent it thinks they are. Now if you want a nonviolent completion, you can just tell Custom GPT to test a thousand possible completions, and then go with the one that the classifier rates lowest! (...)Let’s go through each step of the plan and see how they did, starting with:
Step 1: Fine-Tune Their Custom GPT On A Lot Of Action-Packed Stories
Redwood decided to train their AI on FanFiction.net, a repository of terrible teenage fanfiction.
Redwood is a professional organization, flush with top talent and millions of dollars of tech money. They can afford some truly impressive AIs. State-of-the-art language models can swallow entire corpuses of texts in instants. Their giant brains, running on hundreds of state-of-the-art CPUs, can process language at rates we puny humans cannot possibly comprehend.
But FanFiction.net is bigger. The amount of terrible teenage fanfiction is absolutely mind-boggling. Redwood stopped training after their AI got halfway through FanFiction.net’s “A” section.
In fact, the majority of its corpus came from a single very popular series, the books about teenage spy Alex Rider. They forced their Custom GPT to go through about 4,300 individual Alex Rider stories.
by Scott Alexander, Astral Codex Ten | Read more:
Image: Alex Rider, uncredited
[ed. The things you learn every day! (at least for me) - prosaic alignment; gradient descent; Mechanical Turks. FanFiction.net!]