Large Language Models: Smarter than us — or confidently wrong?

Few innovations in recent years have reshaped our world as rapidly as language models (LLMs). Their ability to converse in natural language and deliver seemingly coherent answers to any question thrown their way gives the impression of true intelligence. But as we race to integrate them into workflows, it’s worth stopping to ask: are we giving them more credit than they truly deserve?

This post explores one of the biggest limitations of LLMs: their tendency to hallucinate. It covers what hallucinations are, why they happen, and how to work around them. Because, as AI becomes more deeply embedded in how we live and work, understanding its weaknesses as well as its strengths will be key to realising its full potential.

Confidently Wrong

Hallucinations occur when an LLM produces plausible sounding but incorrect information. Sometimes it’s a minor factual error resulting from inaccuracies in its training data. Other times it’s a complete fabrication – an entirely false statement which has no basis in its training data at all. Models have been caught inventing case law for legal briefs, generating fictional citations for academic papers, and convincingly summarising documents they’ve never seen.

More troubling, perhaps, is the confidence with which these models present this incorrect information. Responses come across as smooth, authoritative, and impeccably structured –it’s easy to take them at face value. This is not by accident, but a function of how they are trained. LLMs are built to predict the most likely continuation of text, not necessarily the most accurate one. Furthermore, they are fine-tuned to be agreeable, often conditioned to produce the response that feels most satisfying to the user. When faced with a question outside their knowledge, they fill in the gaps with content that sounds convincing.

An unsolvable problem?

Of course, not all models hallucinate at the same rate and performance varies depending on the task. Benchmark tests, such as Vectara’s evaluation of models summarising news articles, show error rates ranging from under 1% to nearly 30%.

Surprisingly, it isn’t simply a matter of picking the newest or most powerful model. In fact, so-called “reasoning models,” designed to break problems into steps, often perform worse. OpenAI itself has reported that its GPT-o3 and GPT-o4 models are substantially more prone to hallucinations than some earlier versions.

Many experts now believe while efforts can be made to minimise hallucinations, they will never be fully eliminated. This is because errors are possible at every stage: training data is always incomplete, retrieval and classification are imperfect, and the generative process itself is inherently probabilistic.

Working Smarter With LLMs

Although hallucinations may be here to stay, there are practical steps you can take to reduce their impact.

Use Retrieval-Augmented Generation (RAG)

RAG systems improve reliability by allowing an LLM to pull in fresh, relevant information before generating a response, rather than having it rely solely on its training data. They typically work by accessing a database or API to find documents related to your query, before passing that context to the model for use in its answer.

The field is moving fast, too. The French startup Linkup is developing an API that gives developers access to up-to-date web content from premium, trusted outlets, which can then be fed into an LLM to enrich its outputs. Knowledge graph technology such as Neo4J’s GraphRAG takes this approach even further. Instead of feeding the model disconnected text snippets, a knowledge graph organises information into entities and relationships, mapping how facts connect to one another. This gives the model a much clearer picture of context to draw on and has been shown to improve the accuracy of generalist LLMs like ChatGPT by over 25%.

Leverage vertical or domain-specific models

General-purpose LLMs are designed to handle a wide range of topics, which makes them more prone to hallucinations when venturing into niche areas. Domain-specific models, on the other hand, are fine-tuned on a set of data tailored to a particular industry or field. Because they’re trained on specialised knowledge, they can deliver outputs that are more accurate, consistent, and context-aware.

Early adopters include fields where reliability is most critical, such as law and medicine, but other industries are catching on too. BloombergGPT was custom trained for finance, while Walmart built Wallaby for supply chain and customer service tasks using decades of internal data. One startup in the field, Luminance, trained its proprietary “Legal Pre-Trained Transformer” on over 150 million legal documents to achieve greater accuracy in analysing contracts and other legal content.

Build guardrails – both technological and human

As well as seeking to reduce the frequency of hallucinations, it is also important to build guardrails, both human and technological, that identify inaccuracies in outputs as early as possible.

On the technology side, startups like Qualifire, provide real-time guardrails to detect and mitigate hallucinations, while also addressing unsafe content, jailbreak attempts, and policy breaches. Tools like this act as automated checks, catching problems before they reach end users and giving teams greater visibility into model behaviour.

Not every safeguard needs to be advanced, though. Simple practices, such as asking the model to provide sources or explain its reasoning, are valuable when using LLMs for research. For important tasks, it’s also essential to independently verify that those sources actually exist — and that they say what the model claims. These habits not only catch errors, but also help you recognise the model’s blind spots, making it easier to apply the technology responsibly in the future.

Final Thoughts

Large Language Models may be inherently flawed, but that doesn’t make them useless. They will sometimes make things up, misrepresent the truth, or present information with misplaced confidence — but then again, so do humans. The key is not to expect perfection, but to understand how they work, recognise their limitations, and put strategies in place to mitigate their weaknesses.