AWS Machine Learning Blog→ original

AWS and Artificial Genius demonstrated a way to reduce LLM hallucinations in finance and medicine

AWS and Artificial Genius proposed a schema for banks, medicine, and other regulated industries where LLM doesn't generate an answer but extracts or verifies…

AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS and Artificial Genius demonstrated a way to reduce LLM hallucinations in finance and medicine
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

AWS, together with partner Artificial Genius, has demonstrated how to adapt large language models for tasks where errors are unacceptable. The approach is built on Amazon Nova and SageMaker, but the key idea is not in a new model size, but in using its language understanding without free answer generation.

Why This Is a Problem

For financial services, medicine, insurance, and legal processes, ordinary LLMs still appear risky. They write, summarize, and explain well, but by their nature remain probabilistic systems: the model predicts the next token rather than extracting a guaranteed correct fact. This leads to hallucinations—answers that sound convincing but are not supported by the source data. In an environment where audit, reproducibility, and accountability matter, this mode of operation is poorly compatible with production.

The authors of the paper suggest viewing the evolution of AI in three steps. The first wave was built on symbolic logic and rigid rules: such systems were deterministic but too inflexible. The second wave, which includes modern transformers, provided a huge leap in fluency and language understanding, but brought unpredictability with it. Artificial Genius calls their approach the third generation: the model still understands natural language like a modern LLM, but the final answer goes through deterministic logic and should not exceed what actually exists in the input context.

How the Scheme Works

The main thesis of AWS and Artificial Genius sounds like this: a generative model can be used strictly non-generatively. That is, it does not "guess" the answer based on the probability of the next token, but checks whether it can be extracted from the document, and if not—refuses to answer. Such a mode is especially useful for questions like dates, amounts, names, excerpts from reports, or confirmation of a specific fact.

In the paper, this is formulated very directly:

"If the question cannot be answered from the document, the model

should respond: 'Unknown'."

  • as the base model, they chose Amazon Nova Lite, because it is better suited for short and clear answers without unnecessary verbosity;
  • fine-tuning is done in SageMaker through supervised fine-tuning, so the model follows one system rule—don't make things up;
  • for training, they use a synthetic set of questions and answers, with both answerable and intentionally unanswerable queries;
  • instead of classical RAG, which remains generative anyway, the emphasis is on a tighter connection between document text and a specific question;
  • on top of this, it is packaged in an agent platform, where a free query can be translated into a more strict specification, and the only manual check remains at the stage of such translation.

An important detail: the authors separately contrast their method with the familiar advice "set temperature to zero." According to them, this doesn't solve the root problem because the model continues to generate anyway. In their version, it's not just the degree of randomness that changes, but the logic of using the model itself: probabilistic understanding is preserved on input, while on output the system strives for a binary mode—respond only with what is confirmed by the text, or honestly say there is no answer.

What the Tests Showed

Technically, the scheme looks rather down-to-earth and therefore interesting. Training data is stored in Amazon S3, fine-tuning of the base Nova model is done in SageMaker Training Jobs, and then the custom version is imported into Amazon Bedrock and delivered to the application through a standard inference pipeline. For corporate teams, this matters not only for convenience but also for data lineage transparency: it's easier to understand what data the model was trained on, where it was modified, and how it was then deployed in production.

The team also revealed several engineering insights. For fine-tuning, they used LoRA to avoid breaking the model's base language understanding. In previous experiments on another model, they even had to forcibly suppress chain-of-thought through a service token `</think>`, because detailed reasoning interfered with concise deterministic answers. For the Nova Lite version, the authors combined LoRA dropout at the 50% level, manual early stopping, and expansion of the synthetic dataset to 30 thousand examples. According to their data, this reduced the hallucination frequency from fractions of a percent in early configurations to 0.03% in the best variant.

What This Means

The story matters not just to AWS users. It shows a broader shift: the market is starting to look for not just the "smartest" LLMs, but models with engineered behavioral boundaries. For banks, insurance companies, clinics, and legal-tech, this is a signal that AI implementation will increasingly be built around verifiability, answer refusal, and controlled workflows, rather than around beautiful generation at any cost.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…