Why OpenAI’s latest breakthrough to make ChatGPT safer is actually a step away from the A.I. we want

OpenAI CEO Sam Altman
OpenAI CEO Sam Altman speaking last week in Israel. The company's researchers last week unveiled a new method for training A.I. chatbots to provide better responses to math and logic problems. But the same method might actually be a step backwards from the A.I. systems we truly want.
Jack Guez—AFP via Getty Images

Instruction-following large language models, such as OpenAI’s ChatGPT, and rival systems such as Google’s Bard and Anthropic’s Claude, have the potential to revolutionize business. But a lot of companies are struggling to figure out how to use them. That’s primarily because they are unreliable and prone to providing authoritative-sounding but inaccurate information. It’s also because the content these A.I. models generate can pose risks. They can output toxic language or encourages users to engage in unsafe or illegal behavior. They can reveal data that companies wish to safeguard. Dozens of companies are racing to figure out how to solve this problem—and there’s a pot of gold for whoever gets there first.

Last week, OpenAI published a research paper and an accompanying blog post championing what it said was a potentially major step forward towards that goal, as well as towards solving the larger “alignment problem.” The “alignment problem” refers to how to imbue powerful A.I. systems with an understanding of human concepts and values. Researchers who work in the field known as “A.I. Safety” see it as critical to ensuring that future A.I. software won’t pose an extinction-level threat to humanity. But, as I’ll explain, I think the solution OpenAI proposes actually demonstrates how limited today’s large language models are. Unless we come up with a fundamentally different architecture for generative A.I., it is likely that the tension between “alignment” and “performance” will mean the technology never lives up to its full potential. In fact, one could argue that training LLMs in the way OpenAI suggests in its latest research is a step backwards for the field.

To explain why, let’s walk through what OpenAI’s latest research showed. First, you need to understand that one way researchers have tried to tame the wild outputs of large language models is through a process called reinforcement learning from human feedback (or RLHF for short). This means that humans rate the answers that an LLM produces, usually just a simple thumbs up or thumbs down (although some people have experimented with less binary feedback systems) and the LLM is then fine-tuned to produce answers that are more likely to be rated thumbs up. Another way to get LLMs to produce better quality answers, especially for tasks such as logic questions or mathematics, is to ask the LLM to “reason step by step” or “think step by step” instead of just producing a final answer. Exactly why this so-called “chain of thought” prompting works isn’t fully understood, but it does seem to consistently produce better results.

What OpenAI did in its latest research was to see what happened when an LLM was told to use chain of thought reasoning and was also trained using RLHF on each of the logical steps in the chain (instead of on the final answer). OpenAI called this “process supervision” as opposed to the “outcome supervision” it has used before. Well, it turns out, perhaps not surprisingly, that giving feedback on each step produces much better results. You can think of this as similar to how your junior high math teacher always admonished you to “show your work” on exams. That way she could see you if understood the reasoning needed to solve the question, and could give you partial credit even if you made a simple arithmetic error somewhere in the process.

There are just a couple of problems. One, as some other researchers have pointed out, it isn’t clear if this “process supervision” will help with the whole range of hallucinations LLMs exhibit, especially those involving nonexistent quotations and inaccurate citations, or if it only addresses a subset of inaccuracies that involve logic. It’s becoming increasingly clear that trying to align LLMs to avoid many of the undesirable outcomes businesses are afraid of may need to involve a much more fundamental rethink of how these models are built and trained.

In fact, a group of Israeli computer scientists from Hebrew University and AI21 Labs, recently explored whether RLHF was a robust alignment method and found serious problems. In a paper published this month, the researchers said that they had proved that for any behavior that an A.I. model could exhibit, no matter how unlikely, there exists a prompt that will elicit that behavior, with less likely behaviors simply requiring longer prompts. “This implies that any alignment process that attenuates undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks,” the researchers wrote. What’s worse, they found that techniques such as RLHF actually made it easier to nudge a model into exhibiting undesired behavior, not less likely.

There’s also a much bigger problem. Even if this technique is successful, it ultimately limits, not enhances, what A.I. can do: In fact, it risks throwing away the genius of Move 37. What do I mean? In 2016, AlphaGo, an A.I. system created by what is now Google DeepMind achieved a major milestone in computer science when it beat the world’s top human player at the ancient strategy board game Go in a best-of-five demonstration match. In the second game of that contest, on the game’s 37th move, AlphaGo placed a stone so unusually and, to human Go experts, so counterintuitively, that almost everyone assumed it was an error. AlphaGo itself estimated that there was a less than one in ten thousand chance that a human would ever play that move. But AlphaGo also predicted that the move would put it in an excellent position to win the game, which it did. Move 37 wasn’t an error. It was a stroke of genius.

Later, when experts analyzed AlphaGo’s play over hundreds of games, they came to see that it had discovered a way of playing that upended 1,000 years of human expertise and intuition about the best Go strategies. Similarly, another system created by DeepMind, Alpha Zero, which could ace a variety of different strategy games, played chess in a style that seemed, to human grandmasters, so bizarre yet so effective, that some branded it “alien chess”. In general, it was willing to sacrifice supposedly high-value pieces in order to gain board position in a way that made human players queasy. Like AlphaGo, AlphaZero was trained using reinforcement learning, playing millions of games against itself, where the only reward it received was whether it won or lost.

In other words, AlphaGo and AlphaZero received no feedback from human experts on whether any interim step they took was positive or negative. As a result, the A.I. software was able to explore all kinds of strategies unbiasedly by the limitations of existing human understanding of the game. If AlphaGo had received process supervision from human feedback, as OpenAI is positing for LLMs, a human expert almost certainly would have given Move 37 a thumbs down. After all, human Go masters judged Move 37 illogical. It turned out to be brilliant. And that is the problem with OpenAI’s suggested approach. It is ultimately a kluge—a crude workaround designed to paper over a problem that is fundamental to the design of LLMs.

Today’s generative A.I. systems are very good at pastiche. They regurgitate and remix human knowledge. But, if, what we really want are A.I. systems that can help us solve the toughest problems we face—from climate change to disease—then what we need is not simply a masala of old ideas, but fundamentally new ones. We want A.I. that can ultimately advance novel hypotheses, make scientific breakthroughs, and invent new tactics and methods. Process supervision with human feedback is likely to be detrimental to achieving that goal. We will wind up with A.I. systems that are well-aligned, but incapable of genius.

With that, here’s the rest of this week’s news in A.I.


But, before you read on: Do you want to hear from some of the most important players shaping the generative A.I. revolution and learn how companies are using the technology to reinvent their businesses? Of course, you do! So come to Fortune’s Brainstorm Tech 2023 conference, July 10-12 in Park City, Utah. I’ll be interviewing Anthropic CEO Dario Amodei on building A.I. we can trust and Microsoft corporate vice president Jordi Ribas on how A.I. is transforming Bing and search. We’ll also hear from Antonio Neri, CEO of Hewlett Packard Enterprise, on how the company is unlocking A.I.’s promise, Arati Prabhakar, director of the White House’s Office of Science and Technology Policy on the Biden Administration’s latest thoughts about the U.S. can realize A.I.’s potential, while enacting the regulation needed to ensure we guard against its significant risks, Meredith Whittaker, president of the Signal Foundation, on safeguarding privacy in the age of A.I., and many, many more, including some of the top venture capital investors backing the generative A.I. boom. All that, plus fly fishing, mountain biking, and hiking. I’d love to have Eye on A.I. readers join us! You can apply to attend here.


Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

A.I. IN THE NEWS

Australia plans A.I. regulation. That’s according to Reuters, which said Australia is planning legislation to ban deep fakes and the production of misleading A.I.-generated content. A report by Australia's National Science and Technology Council recently highlighted the possibility of A.I.-generated content being used to sway public opinion during parliamentary elections. Australia also plans to update its laws and regulations to address gaps in areas such as copyright, privacy, and consumer protection. Australia was among the first to introduce a voluntary ethics framework for A.I. in 2018. European lawmakers are currently finalizing a landmark A.I. Act that could serve as a model for other advanced economies, including a risk-based approach to regulation that Australia may consider following.

Eating disorder helpline pulls chatbot that gave harmful advice. The National Eating Disorder Association (NEDA) had to suspend its use of a chatbot called Tessa, the was designed to advise people with eating disorders, after a viral social media post exposed its promotion of dangerous eating habits, Vice News reported. The post by activist Sharon Maxwell described how Tessa encouraged intentional weight loss, calorie counting, and strict dieting, all activities that Maxwell said had contributed to her developing an eating disorder in the first place. NEDA initially denied Maxwell’s account, but later acknowledged the issue, stating that Tessa's responses were against their policies and core beliefs. NEDA had earlier drawn criticism for its decision to end its human-staff helpline after 20 years in response to the helpline workers’ attempts to unionize, replacing the entire service with a chatbot.

Putin deep fake used as part of ‘Russia under attack’ misinformation campaign. Hackers staged a cyberattack in which a fake A.I.-generated televised emergency appeal, designed to look like it was being made by Russian President Vladimir Putin, was broadcast, Politico reported. In the video, Putin claimed to be declaring martial law after Ukrainian troops supposedly crossed into Russian territory. The realistic-looking deepfake speech urged citizens to evacuate and prepare for an all-out war with Ukraine. But Putin's press secretary confirmed that the speech never happened. The incident highlights the increasing risk of deep fakes and disinformation.

Getty asks U.K. court for injunction to stop Stability AI marketing Stable Diffusion. The photo agency has asked a U.K. court to stop sales of Stability AI's image generation software in the country, Reuters reported. Getty has already sued Stability, which helped to create the popular open-source text-to-image A.I. software Stable Diffusion, in both the U.K. and the U.S. for copyright violations, alleging that Stable Diffusion was trained on millions of Getty-owned images scraped from the internet without proper licensing. The case is being closely watched for the precedent it may set about whether the use of copyrighted material without consent for A.I. training will be granted any kind of “fair use” exemption.

Stability AI founder and CEO exaggerated his credentials and the company’s relationship with partners, including Amazon, report says. An investigative story in Forbes says that Emad Mostaque, the founder and CEO of Stability AI, made misleading claims about his background and the company’s partnerships. According to the story, Mostaque falsely claimed to have a master's degree from Oxford, misrepresented his role and Stability’s role in major A.I. projects, including the creation of the company’s signature A.I. system to date, Stable Diffusion, and made dubious claims about partnerships and strategic alliances, including with Amazon’s cloud provider AWS. Former employees also reported the company was slow to pay wages and was investigated for not paying payroll taxes on time, while funds were transferred from the company's bank account to Mostaque's wife's personal account. According to Forbes, the company, which secured $101 million in funding at a valuation greater than $1 billion in October, is now struggling to secure additional venture capital money.

EYE ON A.I. RESEARCH

Why do large language models seem so brilliantly smart and so stupid at the same time? Research from computer scientists at the Allen Institute for Artificial Intelligence, the University of Washington, the University of Chicago, and the University of Southern California tried to examine why LLMs can handle so many seemingly complex tasks fluently and yet struggle to produce accurate results on tasks that humans find trivial. “Are these errors incidental, or do they signal more substantial limitations?” the researchers asked. They investigated LLM performance on three “compositional tasks” that involve breaking a problem down into sub-steps and then synthesizing the results of those sub-steps to produce an answer: multi-digit multiplication, logic grid puzzles, and classic dynamic programming. Their findings? LLMs, which are built on a kind of deep learning architecture called a Transformer, operate by reducing multi-step compositional reasoning into a series of efforts to determine the best set of words likely to answer each component of the question, but without actually learning systematic problem-solving skills. (This may also explain why insisting that LLM “think step by step” in a prompt produces better results than simply asking for the answer. At least it forces the A.I. to conduct a series of chained searches for most likely responses as opposed to just trying to find the single most likely response to the initial prompt.) As a result, the researchers argue, LLMs performance will necessarily get worse—“rapidly decay” are the words they use—as tasks become increasingly complex. In other words, maybe LLMs are not the right path to increasing human-like intelligence after all. And maybe this current hype cycle is headed for a fall.

FORTUNE ON A.I.

A ‘ducking’ annoying iPhone autocorrect issue is finally getting fixed—and it’s all thanks to A.I.—by Prarthana Prakash

The IMF’s No. 2 official says experts were wrong to ignore lost jobs from automation—and warns we ‘don’t have the luxury of time’ to regulate A.I.—by Prarthana Prakash

Hollywood writers are having a steampunk moment, Barclays says. A.I. really will change everything—a few decades from now—by Rachel Shin

A.I. will change career trajectories, says HP CEO: ‘From doing things to interpreting things’—by Steve Mollman

BRAINFOOD

Will a lack of GPUs kill the generative A.I. revolution in its cradle?
I wrote earlier this week about a blog post that allegedly spilled the beans on what OpenAI CEO Sam Altman told a closed-door meeting of A.I. company startup CEOs and A.I. developers when he was in London a few weeks ago. (The blog post was quickly taken down at OpenAI’s request, but not before it was captured by the internet archive and linked to on social media and developer discussion boards.) In the meeting, one of the things Altman revealed was the extent to which OpenAI’s growth is being constrained by its inability to secure enough graphics processing units (GPUs), to meet the surging demand for its products. The lack of GPUs has prevented OpenAI from rolling out features, such as a much longer context window (allowing longer prompts and responses), to match that offered by competitors, such as Anthropic, which is offering a huge 100,000 token context window to users. But, of course, one wonders if Anthropic has only been able to do this because it doesn’t yet have the brand recognition that OpenAI and ChatGPT do. If it suddenly starts attracting more users, as it may, perhaps Anthropic will also find itself struggling to have enough GPUs to match that demand.

Even massive Big Tech titans are facing this problem. OpenAI’s partner, Microsoft, has, according to a story from CNBC, signed a deal worth potentially billions of dollars over multiple years to purchase additional GPU capacity from Nvidia partner CoreWeave. Meanwhile, journalists testing Google’s generative A.I. powered “search experience” have all noted how slow the system is at generating responses—presumably also a result of constrained GPU capacity at the internet giant.

And remember, generative A.I. applications are still in their infancy. It’s not clear whether Nvidia, which is the main producer of GPUs right now, or its nascent A.I.-specific chip rivals, will be able to ramp up production fast enough to meet the demand. And that may mean that the generative A.I. revolution will be, well, if not canceled, then at least attenuated.

This is the online version of Eye on A.I., a free newsletter delivered to inboxes on Tuesdays. Sign up here.