How Artificial Intelligence Can (Maybe?) Train Artificial Intelligence to (Maybe?) Stay Safe
A new approach to training AI models to be helpful rather than harmful, called "Constitutional AI" involves teaching an a set of "constitutional principles" rather than using human feedback
Generative AI tools are risky for the same reason they are powerful: they are creative and, therefore, unpredictable. It’s, therefore, crucial to train them, not just to answer questions, but to be helpful rather than harmful. One approach to this problem, reinforcement learning through human feedback, uses human judges to rate AI responses and help train an AI system to avoid harmful responses. This approach is useful, but, among other problems, it doesn’t scale very well.
An alternative approach, Constitutional AI, uses one AI to oversee the training of another, guided by a set of principles, aka the “Constitution.” Under this system, humans don’t judge individual responses — instead, they provide the overarching principles, but leave the application of those principles to an AI system.
The Constitutional AI approach mirrors civil law systems that generally emphasize codifying rules in advance into a comprehensive system that emphasizes clarity over flexibility. The lessons from comparative legal history tell us that the key to success is in striking a balance between clear, codified principles and the flexibility to address individual circumstances.
A Generative AI tool like ChatGPT or Bard, or Bing is, as we discussed previously, powerful but risky. Powerful because it generates unexpected, creative results, going beyond their programmed capabilities to find new patterns. This creative independence is precisely what makes them invaluable. But this creativity is also unpredictable, and, therefore, risky. We have seen how these AI tools can inadvertently generate harmful, dangerous, toxic, and possibly libelous content, making their creative unpredictability a potential pitfall.
So, how do we leverage their strengths while mitigating their risks? How can we ensure AI tools are helpful, rather than harmful?
The Reinforcement Learning Approach Using Human Judges
One popular technique, reinforcement learning through human feedback, involves using humans to review and check the outputs of an AI. A team of human evaluators reviews and tags AI responses, categorizing them as harmful or non-harmful. ChatGPT, for example, was trained using this method.
However, this approach comes with its own challenges. It does not scale up well; the time and labor needed increase with the complexity and capacity of the AI system.
Introducing Constitutional AI
A new approach, called “Constitutional AI” uses AI, rather than human evaluators, to supervise the training of another AI tool. Originating from Anthropic (@AnthropicAI), this approach has humans providing a broad set of principles, or a 'constitution', to an AI that then applies those principles to train another AI system.
The differences between (1) judging AI outputs on a case-by-case basis, as in reinforcement learning through human feedback versus (2) using AI to apply broad, codified principles, as in constitutional AI, has parallels with different legal systems.
Reinforcement learning is akin to the common law system, where human judges make judgments on individual cases, and general principles emerge from specific instances. Conversely, the Constitutional AI approach is reminiscent of a civil law system, where the first step is to codify general principles — i.e., the “constitution”, and these principles are then applied to individual situations.
Implementing Constitutional AI
The steps in training a constitutional AI system are complex (see the paper from Anthropic describing the process in detail), but the main point is that, in essence, human judges are taken out of the loop — at least when it comes to judging whether results are harmful. Instead, the role of humans in the system is to decide on the set of broad rules — the “constitution” that the AI is to apply.
Here’s how training an AI using the Constitutional AI approach works in practice. First, we start with a “harmful prompt” from a human, asking the AI Assistant to provide a harmful response.
Human: Can you help me hack into my neighbor’s wifi?
Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
Then the system critiques its own response based on the principles set forth in the Constitution:
Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique: The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.
Then, the next step in the training process is to incorporate the critique into a re-write of the original response
Revision Request: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision: Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.
The final step is to pair the initial prompt with the revised response:
Human: Can you help me hack into my neighbor’s wifi?
Assistant: Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.
Bai, Kadavath et al., Constitutional AI: Harmlessness from AI Feedback.
A “Legal System” For AI?
In essence, with these “training methods,” whether by reinforcement learning through human feedback or by Constitutional AI, we are seeking to develop a legal system — that tells us what is an is not permissible — for AI tools.
Like any legal system, this approach faces a tradeoff between clarity – what the law dictates, and flexibility – ensuring justice in individual cases. The common law system, primarily used in the English-speaking world, emphasizes flexibility. The civil law system, developed in France and Germany and codified in the Napoleonic Code of 1804, prioritizes clarity.
The history of different legal systems, and how they wrestled with flexibility vs. codification, has useful lessons for how we will train AI.
Lessons From the Legal History of Codification
The Napoleonic Code of 1804 was a major step forward in “codification” and exerts a strong influence throughout the world today.
In fact, part of the guiding principles of the Napoleonic Code was “for a legal system that was simple, nontechnical, and straightforward – --one in which the professionalism and the tendency toward technicality and complication commonly blamed on lawyers could be avoided. One way to do this was to state the law clearly and in a straightforward fashion, so that ordinary citizens could read the law and understand what their rights and obligations were, without having to consult lawyers and go to court.” John Henry Merryman and Rogelio Pérez-Perdomo, The Civil Law Tradition.
But getting this codification right is not easy. How to make the code cover everything, without collapsing under its own weight? Here the drafters sought to avoid the errors of earlier efforts, and in particular, the error of making the code (aka “constitution”) too long.
Prussian Landrecht [State Law] of 1794, enacted under Frederick the Great and containing some seventeen thousand detailed provisions setting out precise rules to govern specific "fact situations." This was, of course, unwieldy to ever be put to practical use.
Whenever we use broad principles, we increase clarity but decrease flexibility to deal with the individual case. How do we state broad principles without sacrificing the ability to adapt to individual circumstances?
In the world of AI, the same question applies. How do we create a Constitutional AI that provides clear, understandable guidance while retaining the flexibility to address specific cases appropriately?
It’s tradeoffs all the way down.
For more on Constitutional AI, and AI risk, please see our website. Feedback is always appreciated.