But Whisper has a major flaw: It tends to make up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers.
These experts said some of the fictional text – known in the industry as hallucinations – could include racial commentary, violent rhetoric and even imagined medical treatment.
Experts said such fabrications are problematic because Whisper is used in multiple industries around the world to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos.
More troubling, they said, is the rush by medical centers to use Whisper-based tools to transcribe patients’ consultations with doctors, despite OpenAI’s warnings that the tool should not be used in “high-risk domains.”
The full extent of the problem is difficult to understand, but researchers and engineers said they often encountered Whisper’s hallucinations in their work.
A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in eight out of 10 audio transcriptions he inspected before he began trying to improve the model.
A machine learning engineer said he initially found hallucinations in about half of the more than 100 hours of Whisper transcriptions he analyzed.
A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper.
Problems persist even with well-recorded, short audio samples. A recent study by computer scientists revealed 187 hallucinations in over 13,000 clear audio fragments they examined.
This trend would lead to tens of thousands of wrong transcriptions in millions of records, the researchers said.
Such mistakes could have “really dire consequences,” especially in hospital settings, said Alondra Nelson, who ran the White House Office of Science and Technology Policy for the Biden administration until last year.
The proliferation of such hallucinations has prompted OpenAI experts, advocates and former employees to call on the federal government to review AI regulations. At the very least, they said, OpenAI should address the shortcoming.
An OpenAI spokesperson said the company is constantly researching how to reduce hallucinations and evaluates researchers’ findings, adding that OpenAI incorporates feedback into model updates.
While most developers assume transcription tools misspell words or make other mistakes, engineers and researchers said they’ve never seen another AI-powered transcription tool hallucinate as much as Whisper.
Professors Alison Koeneke of Cornell University and Mona Sloan of the University of Virginia examined thousands of short excerpts obtained from TalkBank, a research repository hosted at Carnegie Mellon University.
They found that nearly 40% of hallucinations are harmful or distressing because the speaker can be misinterpreted or misrepresented.
In an example they revealed, a spokesperson said: “He, the boy, was going to, I’m not sure exactly, take the umbrella.”
But the transcription software added: “He took a big piece of a cross, one small, small piece… I’m sure he didn’t have a terrorist knife, so he killed a lot of people.”
A speaker in another recording described “two other girls and a lady”.
Whisper came up with an additional comment about race, adding “two other girls and a lady, um, who were black.”