MIT News→ original

MIT Proposed a Metric That Detects Confident Errors and LLM Hallucinations

MIT presented a new way to check when a language model sounds confident but still makes mistakes. Instead of only self-consistency, researchers added…

AI-processed from MIT News; edited by Hamidun News
MIT Proposed a Metric That Detects Confident Errors and LLM Hallucinations
Source: MIT News. Collage: Hamidun News.
◐ Listen to article

Researchers at MIT have proposed a new way to measure uncertainty in large language models and more accurately detect situations when an AI responds confidently but makes a mistake. The idea is simple: users need to see not only a polished answer, but also a signal for how much this confidence can actually be trusted, especially when dealing with tasks where errors carry real consequences.

Why Old Metrics Fail

Today, one of the popular ways to check LLM reliability is to ask the same question multiple times and see if the model answers consistently. If the answers match, this is often interpreted as high confidence. The problem is that this check only measures the model's internal consistency.

It shows how confident the model is in itself, but doesn't tell whether it's actually right. For an interface, this is a convenient signal, but not always useful. This is where a dangerous scenario emerges: the model can repeatedly produce the same incorrect answer while maintaining the appearance of reliability.

For users, this is especially risky in tasks where errors are costly—for example, in medicine, finance, or applied analytics. In such cases, a confident hallucination looks more convincing than a cautious but honest answer with caveats. That's why researchers decided to measure not only the model's self-confidence, but also the probability that it's even the right choice for the given question.

How the New Method Works

The MIT team augmented the familiar self-consistency metric with another signal—disagreement between models. Instead of asking the same LLM multiple times, the researchers compare the target model's answer with answers from a small group of similar models of comparable scale and architectural class. If such models begin to diverge meaningfully in meaning, this becomes an important indicator that the original answer may be unreliable, even if the target model itself sounds very confident. An important point is that the comparison measures not just literal wording match, but semantic closeness of answers. This approach better reflects genuine agreement or disagreement between models than simple word-by-word matching.

According to the researchers, in practice, an unexpectedly simple variant worked best: using models created by different companies. More complex ensemble selection schemes were tested but offered no advantage over this straightforward and transparent strategy.

  • First, the target model whose answer needs to be evaluated is selected.
  • Then, the same query is addressed to several similar LLMs.
  • After that, the system measures how much the answers align semantically.
  • This metric is combined with the standard self-consistency metric.
  • The output is a total uncertainty score.

The authors call the second component epistemic uncertainty: it shows how well the model itself was chosen for the specific task. Combined with aleatoric uncertainty, which reflects the answer's internal instability, you get a more complete picture of risk. In simple terms, the system checks both whether the model contradicts itself and whether it diverges from other plausible models. The method works in a black-box format: it requires only text answers, without access to logits or the model's internal states.

Where the Method Is Most Useful

Researchers tested the combined metric on ten realistic tasks, including question-answering scenarios, summarization, translation, and mathematical reasoning. In the main series of experiments, they compared several instruction-tuned models, with separate tests on API models as well. In these tests, the overall uncertainty score more reliably detected unreliable answers than either component alone. The new approach performed particularly well in tasks where there is one correct answer, such as factual Q&A or translation.

If a model repeats the same answer many times, that doesn't

necessarily mean the answer is correct.

There is also an important limitation. For more open-ended tasks where multiple good answer variants are acceptable, the inter-model disagreement signal may be less useful. In other words, if it's not about facts but about freer generation, disagreement between models alone doesn't always indicate an error. The authors directly note that in the future they want to adapt the technique specifically for such scenarios and separately investigate other forms of assessing the model's internal uncertainty.

Another practical advantage is computational savings. In some experiments, calculating total uncertainty required fewer queries than traditional assessment through self-consistency alone. This means not only lower inference costs, but also potentially less energy consumption with large-scale use of such checks. For production, this is an important argument: if a metric is simultaneously more accurate and cheaper, it has much better chances of making it into real AI products rather than remaining a purely academic idea.

What It Means

For the industry, this is a step from assessing "how confident does the model sound" to assessing "how much can this confidence be trusted." If the approach takes hold in production, AI services will be able to more accurately warn about hallucinations, and users will be less likely to accept a convincing-sounding error as a credible answer. This is especially important for all scenarios where LLMs already function not as a toy, but as a working tool that influences decisions, money, and daily processes in a company.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…