Can Superalignment help AI safety?

Recently, Jensen Huang — co-founder, president and CEO of U.S. multinational tech giant Nvidia Corp. — said AI will surpass human beings in the next five years. In keeping with this seeming fait accompli, Sam Altman – CEO of OpenAI – said AI development cannot be stopped.

These fears are shared by Ilya Sutskever — co-founder and chief scientist at OpenAI. But his farsighted superalignment project may offer a viable way of avoiding what can only be seen as a major societal-scale risk.

For me, superalignment means automated alignment — ensuring AI continues to fit human expectations. But we need to first understand the formulation of AI as a two-step process, namely pre-training and fine-tuning, before going deeper.

Pre-training is akin to taking a mass of data and compressing it into a model, which can then produce answers according to the commonality of the obtained data. For example, the model may recognize the commonality of faces, such as outlines and colors. After receiving instructions, it can draw a new face. However, when training data is insufficient for the task at hand, the results may appear plausible, but are actually wrong. This is more commonly known as an AI “hallucination.”

Fine-tuning helps to teach the model to output answers acceptable to humans through rewarding good ones and dissuading bad ones. Confabulated and harmful responses are counterproductive, and transparency associated with confidence levels are important. If the answers are not based on the training materials, the model must say so.

Superalignment is essentially about leveraging existing AI systems in assisting humans to do most of the fine-tuning work.

For example, a judge can hear a complex patent infringement case without being an expert in the field. As long as the proceedings involve lawyers specialized in the particular area, the judge needs only determine whether the issues are in accordance with the law. Similarly, superalignment puts humans in the judge’s place.

Going forward, people can co-create a Constitutional AI document, and AI models can be aligned to the prescribed norms as The Collective Intelligence Project and Anthropic have ably demonstrated. Case in point, a small AI, trained in the United Nations’ Universal Declaration of Human Rights, can then calibrate a larger AI model’s alignment.

As policymakers, our job is to send clear signals to the market through tailored initiatives and state investment. The recently opened AI Evaluation Center is playing its part by issuing a 10-point plan governing AI models. This comprises safety, resilience, accuracy, accountability, privacy, explainability, fairness, transparency, reliability and security. If a model proves problematic, it will not be banned immediately, but the government cannot encourage its deployment.

When I assisted Apple Inc.’s Siri team from 2010 to 2016, the most important thing at that time was accuracy and harmlessness. As alignment techniques mature, I believe AI systems with frequent hallucinations will come to be avoided by investors.

This is like the revelation that some gasses used in refrigerators can destroy the ozone layer; signatories to the Montreal Protocol agreed they should be banned. It was a clear signal to the manufacturers that such processes must be adjusted quickly or the product will cease to exist. Conversely, when more stable and safe AI systems emerge in the market, subsequent consumer response and investment will ensure the industry does not go astray.

Can Superalignment help AI safety?

AI adaptation

Echoes of the Past, Voices of the Future