The image above is the first page of OpenAI’s recently released paper about hallucinations. (This post is a follow-up to my previous post about this paper.) This post discusses OpenAI’s post announcing and summarizing the paper.
The following is from the summarization post.
Suppose a language model is asked for someone’s birthday but doesn’t know. If it guesses “September 10,” it has a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Over thousands of test questions, the guessing model ends up looking better on scoreboards than a careful model that admits uncertainty. …
When averaging results across dozens of evaluations, most benchmarks pluck out the accuracy metric, but this entails a false dichotomy between right and wrong. On simplistic evals, some models achieve near 100% accuracy and thereby eliminate hallucinations. However, on more challenging evaluations and in real use, accuracy is capped below 100% because there are some questions whose answer cannot be determined for a variety of reasons such as unavailable information, limited thinking abilities of small models, or ambiguities that need to be clarified.
Accuracy-only scoreboards dominate leaderboards and model cards, motivating developers to build models that guess rather than hold back. That is one reason why, even as models get more advanced, they can still hallucinate, confidently giving wrong answers instead of acknowledging uncertainty. [Emphasis added]
The preceding seems to confuse the generation of fluent language with the generation of valid statements. It then builds on this confusion to claim to have identified the source of hallucinations.
Language models first learn through pretraining, a process of predicting the next word in huge amounts of text. Unlike traditional machine learning problems, there are no “true/false” labels attached to each statement. The model sees only positive examples of fluent language and must approximate the overall distribution.
It’s doubly hard to distinguish valid statements from invalid ones when you don’t have any examples labeled as invalid. [Imagine training data originally intended for animal categorization, but in which each image is labeled by the birthday of the animal portrayed rather than the animal’s category.] Since birthdays are essentially random, [a system trained on this data] would always produce errors, no matter how advanced the algorithm.
The same principle applies in pretraining. Spelling and parentheses follow consistent patterns, so errors there disappear with scale. But arbitrary, low-frequency facts, like [an animal’s] birthday, cannot be predicted from patterns alone and hence lead to hallucinations.
The preceding does indeed explain why LLMs produce hallucinations: “there are no ‘true/false’ labels attached to [statements in the training data].” But given that explanation, it’s not clear why one should expect LLMs to be able to distinguish valid from invalid statements. LLMs are not trained to, and should not be expected to, distinguish valid from invalid statements. They are trained to produce fluent language, not to produce valid statements. It should not be a surprise that they don’t always produce valid statements.
The post’s proposed remedy continues the confusion: “Penalize confident errors more than uncertainty, and give partial credit for appropriate expressions of uncertainty.”
This proposed hallucination remedy is so inconsistent with the goal implicit in the training of LLMs that I’m surprised OpenAI authors would even consider it, much less propose to incorporate it into LLM training.
LLMs were never intended to be systems that “know” things. They are trained to generate fluent language, not to serve as repositories of information about the world—or about anything else. LLMs are intended to manipulate words, which they do strikingly well. There is no reason to expect that systems trained for this capability will somehow develop new and unrelated capabilities.
As an analogy, consider a device capable of producing exquisitely detailed physical objects by means of 3D printing. There is no reason to expect such a device to be able to determine whether the objects it produces are suited for particular functions. Such an expectation is similar to expecting an LLM to determine whether the linguistic artifacts it produces are true or false. In both cases these are distinct capabilities. Systems capable of producing a given category of artifacts should not be expected to be able to assess those artifacts with respect to properties the systems were not built to evaluate.
In short, the phenomena known as LLM hallucinations are built into the nature of LLMs as currently understood.
Andrej Karpathy said LLMs are dream machines: "We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful. It's only when the dreams go into deemed factually incorrect territory that we label it a 'hallucination'".
Did an LLM write the piece . . . or people? (I was confused by "according to Open AI")