monboard
← All posts

The AI Hallucination Problem

While AI hallucination rates have improved, they remain a significant risk for professionals in regulated sectors where verification is often undermined by human cognitive bias. Relying on "check your work" instructions is insufficient, as passive review fails to catch sophisticated errors. Robust solutions require an adversarial, automated verification process combined with expert human judgement to ensure accuracy.


AI hallucination hasn’t gone away: it’s just morphed in nature

Hallucination isn’t as bad as it used to be. The headline numbers look good: several vendors are getting sub 5% hallucination rates, with several showing up between 1.8-4% on summarisation tasks. (Vectara leaderboard). IIRC, the early days of ChatGPT were 30%-40% - so a 10X improvement.

That said, 1-in-25 error rate is not a rounding error when the output is a document that has to survive regulatory scrutiny. Plus, the most widely used consumer models are still high-error (typically around 10-15%).

Even if the rate continues to drop, there’s still some thorny issues.


1: Regulated professions are carrying disproportionate risk

For most users, a hallucination means wasted time or mild embarrassment. For say, a lawyer or sustainability officer, it can lead to lost business, financial penalties or even a ban from the profession.

There have been multiple cases now of professionals receiving serious penalties relating to improper AI use. Some of them even involve well-intentioned employees at respected firms, using top-tier tools which warn the user to check their work / sources. So this isn't outright negligence.


2: "Check your work" doesn't work

I see this everywhere: tell employees to verify AI outputs. Most governance frameworks say this. The regulations say this. Every vendor says this.

The problem is that humans are terrible at scrutinising things from trusted sources.

The Law Society put a name to it at LegalGeek Growth recently: "sedation risk." Reviewing AI output you didn't write, in a domain you know well, involves a cognitive dynamic closer to passive reading than active analysis. The output looks plausible. The formatting is good. The citations look real. Your brain stops interrogating it.

AI verification is mentally taxing for most people which is why many end up just relying on the AI. This tax is particularly strong given the LLM output is slightly different each time.


3: RAG helps but has limitations

RAG is the dominant solution for specialist tools. Retrieval Augmented Generation is basically forcing the generative AI to check it's output against a specific, curated information set rather than general training data. This is naturally better - it constrains output and reduces errors on in-scope questions considerably. One founder I spoke to is so confident in their RAG processes that he guarantees data quality to clients on that basis. That’s bold.

However, RAG systems can still hallucinate, especially when the query falls outside the typical scope. Plus, there’s an inherent limitation to a system checking its own work - similar to humans being somewhat blind to their own errors.


4: A separate, adversarial checker is more robust

A more robust approach treats verification as a separate, adversarial process: a second system designed to be sceptical, not helpful, checking output against authoritative sources directly.

The more deterministic and strict the checker, the more useful it is. Does this case exist? Does it say what the AI claims it says? Binary lookups against a trusted source like gov.uk or PubMed give you a result that doesn't require interpretation.

This is a critical difference in how tools like Verbatim Lite work: it takes a first pass of the document and instead of trying to discover ā€œis this correct?ā€ it asks ā€œwhat’s WRONG with this?ā€


5: AI hallucinations often amplify human bias

Even if you get your own systems under control, the outsider problem is very real.

I've spoken to lawyers, doctors, grant assessors and a vet who all told me the same thing: there's been a massive uptick in AI slop from their clients.

Examples include:

  • Lawyers receiving AI-drafted documents that misapply the law and have to be challenged
  • Grant assessors for credits getting applications with bad data embedded in them, having to reject and re-do.
  • A vet receiving an AI-produced cancer diagnosis for the owner’s dog, with the owner so convinced he demanded for days that the oncologist review and confirm / reject the whole thing
  • My favourite: an employment lawyer advising on an acrimonious exit. ChatGPT fuelled the client’s anger and pushed them toward grievance claims. The lawyer asked what they really valued long term and it was far simpler: a tighter, limited non‑compete. Following ChatGPT’s advice would have meant an expensive, relationship‑damaging fight. Luckily, the client took the human advice.
You can safeguard your tools and team. You cannot do this for clients or partners.

6: Expert judgement is a valuable second layer

Automated checkers catch specific, verifiable errors: citations that don't exist, figures that don't match a source. They are less good at errors of reasoning or commercial judgement - the stuff a senior professional feels is wrong before they can articulate why. An experienced partner may sense that a legal argument doesn't hold before they've traced the flaw. A grizzled illustrator can tell you that Nano Banana is rendering subtle inconsistencies in major characters across a 32 page kids’ book.


So how do we mitigate hallucinations?

No single intervention solves this, but combined, they can be highly effective:

  • Retrieval Augmented Generation - when compiling answers in critical contexts, force the model to check against trusted sources. This can materially reduce the risk of the hallucination happening in the first place
  • A first-pass check that is adversarial by design - sceptical, binary, surfacing obvious errors to be corrected
  • Pick the right model and limit the scope - counterintuitively, some of the ā€œliteā€ models are more accurate on certain tasks. Gemini Flash is very cheap and fast - but its also more accurate. I’d trust something like Flash to search a document over say, Opus 4.8.
  • Cognitive Forcing Functions: This is particularly interesting for junior employees as well as scenarios where accuracy is critical. In short, academic research shows that if you design the workflow in a very particular way, you can ā€œforceā€ people to think about something, tripling the chance that errors are caught
  • Expert human review as a final layer - by someone with enough domain knowledge to sense when something doesn't smell right. This is expensive, naturally, but highly worthwhile for those edge or unusual cases
Doing these in the above order can also substantially reduce cost. Don’t make your senior specialist doctor, for example, run first-pass checks on AI-generated letters from patients - put it through a filter of some sort first.

Fixing human-in-the-loop workflows - including verification - is something we are focussed on at Monboard. Drop us a ping if you want to know more.


← All posts