OpenAI's new chatbots are more capable but hallucinate up to 48% of the time
OpenAI’s recent releases have been a mixed bag. The o3 and o4-mini chatbots are more capable than ever, but they’re also more prone to hallucinations.
Hallucinations are the instances where the models confidently tell you wrong or made up information. For a long time, newer models have been less prone to this behaviour, so the uptick is concerning.
In internal tests, OpenAI’s o3 model hallucinated 33% of the time when assessed on the PersonQA benchmark. The older o1 model had a 16% hallucination rate, while the o3-mini model had a 14.8% rate. The o4-mini model fared worse than all of them, with a 48% hallucination rate.
OpenAI’s technical report on the new models says “more research is needed” to understand why the newer models are hallucinating more. The company points out that o3 and o4-mini are more likely to excel in certain areas like coding and maths, but because they generate more overall responses that leads to more accurate claims and more hallucinated claims.
The report says: “While these models perform better than previous models on tasks like coding and math, they also generate more overall responses, which leads to both more accurate and more hallucinated claims.”
In tests conducted by the nonprofit AI lab Transluce, OpenAI’s o3 model was found to fabricate actions (like running code) that it cannot do. For instance, it claimed to run code on a MacBook Pro from 2021, which it cannot do because it doesn’t have access to a computer.
Neil Chowdhury, a researcher at Transluce, suggested that the reinforcement learning techniques used to train these models might be making hallucination issues worse.
In some contexts, like legal contexts where people are using these tools to draft documents that could be submitted to courts, the consequences of hallucinations could be dire.
OpenAI is looking at ways to solve this issue. Pairing GPT-4o with a web search boosted accuracy on the SimpleQA benchmark to 90%. However, it’s still possible for models to be more intelligent and less reliable at the same time. That’s something OpenAI is working hard on.
“We’re constantly researching how to reduce hallucinations and improve model reliability,” a spokesperson for OpenAI said. Niko Felix added: “We’re also exploring new techniques like incorporating web search into our systems.”