Seguridad

Benchmark

A benchmark is a standardized test set used to measure and compare AI model performance on defined tasks. Benchmark scores provide a common scale for tracking progress across models and communicating capability claims to researchers, customers, and regulators.

In AI, a benchmark is a curated dataset of inputs paired with expected outputs or evaluation criteria, administered under consistent conditions to different models. Benchmarks span a wide range of tasks: reading comprehension (SQuAD), commonsense reasoning (HellaSwag), coding (HumanEval), mathematics (MATH), broad knowledge (MMLU — Massive Multitask Language Understanding), and safety-specific probes. A benchmark result is typically a single number — percentage accuracy, pass rate, or similar — enabling direct comparison across systems and over time.

Benchmark administration involves presenting the same prompts to each model under equivalent conditions (same token budget, same sampling temperature, same few-shot examples if any) and scoring responses against ground truth. Scoring methods vary by task: automatic exact-match for factual tasks, unit-test pass rates for code generation, or human or model-based judgment for open-ended tasks. Meta-benchmarks like HELM (Holistic Evaluation of Language Models, Stanford) and BIG-bench aggregate many individual benchmarks into composite profiles, reducing the risk of gaming any single metric.

Benchmarks have been the primary vehicle for communicating AI progress and enable objective, reproducible comparison of systems built by different organizations. GPT-4's MMLU score of approximately 86% in 2023 represented a large improvement over GPT-3.5 at roughly 70%, achieved within a single year, illustrating the pace of capability advance during this period. Regulatory discussions increasingly reference benchmark performance as evidence of system capability and risk level, making benchmark methodology a matter of policy as well as research.

As of 2026, a well-recognized problem is benchmark saturation: leading models score near ceiling on established tests such as MMLU (90%+) and HumanEval (95%+), reducing their discriminative value. The field has responded with harder successors — MMLU-Pro, GPQA (Graduate-Level Google-Proof Q&A), LiveCodeBench for real-world coding tasks, and ARC-AGI for abstract reasoning. Contamination — training data inadvertently containing benchmark questions — remains a persistent concern, prompting interest in dynamic and held-out benchmarks maintained by organizations such as the U.S. AI Safety Institute.

Ejemplo

A research team compares their new 7-billion-parameter language model against GPT-4o and Llama 3 on the MMLU benchmark, reporting accuracy scores across 57 academic subjects to demonstrate competitive broad-knowledge performance at a fraction of the parameter count.

Términos relacionados

Últimas noticias sobre el tema

← Glosario