AWS Machine Learning Blog→ original

Multimodal judges: how AWS evaluates the quality of image descriptions

AWS has added multimodal evaluators to Strands Evals, a tool for evaluating AI models. They check whether descriptions of images, invoices, and screenshots matc

Multimodal judges: how AWS evaluates the quality of image descriptions
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

If you're developing an image search system for products, document recognition, or diagram analysis, you need reliable verification of model quality. AWS introduced a solution in Strands Evals — multimodal evaluators that verify how well the response matches the original image.

Why Text Evaluators Don't Work for Image-to-Text

Traditional evaluators work only with text. They compare the model's response to a reference answer, but don't see the image itself. This creates a blind spot: the evaluator cannot verify whether a product description contains precise details from the photo, whether the amount was correctly extracted from an invoice, or whether a screenshot was summarized correctly. A model can provide an answer that looks perfect on paper, but contradicts what's visible in the image. For example, an invoice recognition system might correctly identify a number format but get the actual value wrong if the digit on the document is blurry. A text evaluator won't catch this mistake.

How Multimodal Judges See the Full Context

AWS's new evaluators use multimodal large language models (MLLM) that simultaneously view the original image and the model's text response. This allows the judge to verify not just grammar or style, but the actual correspondence between the image and the response. Such a judge can:

  • Verify that a product description matches its appearance and color
  • Ensure that numbers and text extracted from a document are accurate
  • Assess whether information from a screenshot, diagram, or drawing was conveyed correctly
  • Detect hallucinations — cases when the model outputs information that doesn't appear in the image at all
  • Check the quality of translating text visible in the image

Practical Application Across Industries

Multimodal evaluators are especially useful where recognition errors can lead to losses. In e-commerce, companies train models to describe products from photos, and incorrect descriptions reduce conversion and increase returns. In financial analysis, an error in extracting an amount from an invoice can lead to audit mistakes. And in information systems, incorrect document processing can block an entire business process. AWS integrated multimodal evaluators into Strands Evals so developers could automatically verify during model development or testing that their system truly "sees" data the same way humans do.

What This Means for Developers

For ML engineers, this means no longer needing to manually verify samples of results. The quality evaluation process can be automated and made more objective. Multimodal judges are becoming a standard tool for validating computer vision models, just as text metrics have long been used in NLP.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…