Hugging Face Blog→ original

TII introduced QIMMA — an Arabic LLM leaderboard with benchmark quality checks

TII launched QIMMA, a new leaderboard for Arabic LLMs where the tests themselves are cleaned and validated before models are evaluated. The set includes 109…

AI-processed from Hugging Face Blog; edited by Hamidun News
TII introduced QIMMA — an Arabic LLM leaderboard with benchmark quality checks
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

TII launched QIMMA — a new leaderboard for Arabic LLMs that transforms the approach to model evaluation: the team first verifies the quality of benchmarks, and only then publishes results. The project authors demonstrated that even well-known Arabic datasets have systematic errors that distort final scores.

What is QIMMA

QIMMA combines 109 subsets from 14 original benchmarks into a unified evaluation system with over 52,000 examples. Coverage is broad: culture, STEM, law, medicine, security, poetry and literature, as well as programming. According to the authors, 99% of the content in the dataset is originally in Arabic, not translated from English.

This matters because translated tests often break natural context, make phrasing awkward, and give models tasks that poorly reflect real Arabic language use. Against this backdrop, QIMMA positions itself not just as another leaderboard, but as an attempt to address several longstanding problems in Arabic NLP: fragmented leaderboards, weak reproducibility, lack of line-by-line results, and unverified gold answers. The authors particularly emphasize one more distinction: this is the first Arabic leaderboard with built-in code evaluation.

To achieve this, the system added adapted Arabic versions of HumanEval+ and MBPP+ to check not only language knowledge but also the model's ability to understand programming tasks formulated in Arabic.

How validation works

The key part of the project is a two-stage validation pipeline. Before running models, each example is independently checked by two large models: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. They evaluate assignments on a scale of ten binary criteria. If at least one model gives an example less than 7 out of 10, it is considered problematic: when both models agree, such an example is immediately excluded, and disputed cases are sent for manual review by native speakers familiar with regional and dialectal nuances.

QIMMA checks benchmarks before evaluating models, so final scores reflect the true quality of

Arabic LLMs.

For code benchmarks, the team took a different approach. Instead of removing tasks, researchers rewrote Arabic formulations without changing identifiers, reference solutions, and test sets. In HumanEval+, they corrected 145 out of 164 prompts, that is 88%, and in MBPP+ — 308 out of 378, or 81%. The fixes addressed several aspects:

  • normalization of language to natural contemporary literary Arabic
  • removal of ambiguities and clarification of constraints
  • alignment of terminology, punctuation, and example format
  • correction of structural errors like broken lines and corrupted text fragments
  • clarification of meaning where ranges or conditions were ambiguous

What problems were found

The review showed that these were not isolated mistakes, but recurring defects in the datasets themselves. For example, in ArabicMMLU, the team discarded 436 examples, or 3.1% of the dataset, and in MizanQA — 41 examples, or 2.

3%. There were lower defect rates in some, but the pattern repeated across datasets: errors in correct answers, unreadable text, duplicates, culturally disputed labels, and misalignment between gold answer and evaluation method. In other words, some popular Arabic benchmarks were being used as if they were error-free, when they were not.

On the cleaned dataset, the leader was Qwen3.5-397B-A17B-FP8 with an average score of 68.06.

In second place — Karnak with 66.20, in third — Jais-2-70B-Chat with 65.81.

Notably, the authors point out that model size does not guarantee better results. Arabic-specialized models often perform stronger on cultural and language tasks, while multilingual systems perform better in coding: Qwen3.5-397B achieves the best results on both HumanEval+ and MBPP+.

In other words, QIMMA is useful not just as a ranking, but as a map of the strengths of different architectures.

What this means

QIMMA makes a simple but important shift: comparing LLMs without verifying the tests themselves is no longer sufficient. For the Arabic market, this could become a new evaluation standard, and for developers — a reminder that benchmark quality affects model reputation just as much as the model itself.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…