The AI Hallucination Problem
While AI hallucination rates have improved, they remain a significant risk for professionals in regulated sectors where verification is often undermined by human cognitive bias. Relying on "check your work" instructions is insufficient, as passive review fails to catch sophisticated errors. Robust solutions require an adversarial, automated verification process combined with expert human judgement to ensure accuracy.
AI hallucination hasnāt gone away: itās just morphed in nature
Hallucination isnāt as bad as it used to be. The headline numbers look good: several vendors are getting sub 5% hallucination rates, with several showing up between 1.8-4% on summarisation tasks. (Vectara leaderboard). IIRC, the early days of ChatGPT were 30%-40% - so a 10X improvement.
That said, 1-in-25 error rate is not a rounding error when the output is a document that has to survive regulatory scrutiny. Plus, the most widely used consumer models are still high-error (typically around 10-15%).
Even if the rate continues to drop, thereās still some thorny issues.
1: Regulated professions are carrying disproportionate risk
For most users, a hallucination means wasted time or mild embarrassment. For say, a lawyer or sustainability officer, it can lead to lost business, financial penalties or even a ban from the profession.
There have been multiple cases now of professionals receiving serious penalties relating to improper AI use. Some of them even involve well-intentioned employees at respected firms, using top-tier tools which warn the user to check their work / sources. So this isn't outright negligence.
2: "Check your work" doesn't work
I see this everywhere: tell employees to verify AI outputs. Most governance frameworks say this. The regulations say this. Every vendor says this.
The problem is that humans are terrible at scrutinising things from trusted sources.
The Law Society put a name to it at LegalGeek Growth recently: "sedation risk." Reviewing AI output you didn't write, in a domain you know well, involves a cognitive dynamic closer to passive reading than active analysis. The output looks plausible. The formatting is good. The citations look real. Your brain stops interrogating it.
AI verification is mentally taxing for most people which is why many end up just relying on the AI. This tax is particularly strong given the LLM output is slightly different each time.
3: RAG helps but has limitations
RAG is the dominant solution for specialist tools. Retrieval Augmented Generation is basically forcing the generative AI to check it's output against a specific, curated information set rather than general training data. This is naturally better - it constrains output and reduces errors on in-scope questions considerably. One founder I spoke to is so confident in their RAG processes that he guarantees data quality to clients on that basis. Thatās bold.
However, RAG systems can still hallucinate, especially when the query falls outside the typical scope. Plus, thereās an inherent limitation to a system checking its own work - similar to humans being somewhat blind to their own errors.
4: A separate, adversarial checker is more robust
A more robust approach treats verification as a separate, adversarial process: a second system designed to be sceptical, not helpful, checking output against authoritative sources directly.
The more deterministic and strict the checker, the more useful it is. Does this case exist? Does it say what the AI claims it says? Binary lookups against a trusted source like gov.uk or PubMed give you a result that doesn't require interpretation.
This is a critical difference in how tools like Verbatim Lite work: it takes a first pass of the document and instead of trying to discover āis this correct?ā it asks āwhatās WRONG with this?ā
5: AI hallucinations often amplify human bias
Even if you get your own systems under control, the outsider problem is very real.
I've spoken to lawyers, doctors, grant assessors and a vet who all told me the same thing: there's been a massive uptick in AI slop from their clients.
Examples include:
- Lawyers receiving AI-drafted documents that misapply the law and have to be challenged
- Grant assessors for credits getting applications with bad data embedded in them, having to reject and re-do.
- A vet receiving an AI-produced cancer diagnosis for the ownerās dog, with the owner so convinced he demanded for days that the oncologist review and confirm / reject the whole thing
- My favourite: an employment lawyer advising on an acrimonious exit. ChatGPT fuelled the clientās anger and pushed them toward grievance claims. The lawyer asked what they really valued long term and it was far simpler: a tighter, limited nonācompete. Following ChatGPTās advice would have meant an expensive, relationshipādamaging fight. Luckily, the client took the human advice.
6: Expert judgement is a valuable second layer
Automated checkers catch specific, verifiable errors: citations that don't exist, figures that don't match a source. They are less good at errors of reasoning or commercial judgement - the stuff a senior professional feels is wrong before they can articulate why. An experienced partner may sense that a legal argument doesn't hold before they've traced the flaw. A grizzled illustrator can tell you that Nano Banana is rendering subtle inconsistencies in major characters across a 32 page kidsā book.
So how do we mitigate hallucinations?
No single intervention solves this, but combined, they can be highly effective:
- Retrieval Augmented Generation - when compiling answers in critical contexts, force the model to check against trusted sources. This can materially reduce the risk of the hallucination happening in the first place
- A first-pass check that is adversarial by design - sceptical, binary, surfacing obvious errors to be corrected
- Pick the right model and limit the scope - counterintuitively, some of the āliteā models are more accurate on certain tasks. Gemini Flash is very cheap and fast - but its also more accurate. Iād trust something like Flash to search a document over say, Opus 4.8.
- Cognitive Forcing Functions: This is particularly interesting for junior employees as well as scenarios where accuracy is critical. In short, academic research shows that if you design the workflow in a very particular way, you can āforceā people to think about something, tripling the chance that errors are caught
- Expert human review as a final layer - by someone with enough domain knowledge to sense when something doesn't smell right. This is expensive, naturally, but highly worthwhile for those edge or unusual cases
Fixing human-in-the-loop workflows - including verification - is something we are focussed on at Monboard. Drop us a ping if you want to know more.
ā All posts