Segurança

Model Evaluation (Evals)

Model evaluation (evals) is the systematic process of measuring an AI model's performance, capabilities, and failure modes using structured tests. Evals guide development decisions and are increasingly required by regulators as evidence of safety before deployment.

Model evaluation encompasses the full range of methods used to assess what an AI system can and cannot do. This includes automated test suites that compare model outputs against reference answers, human rating studies that assess quality or harmlessness, and red-teaming exercises that probe for harmful or unexpected behaviors. Unlike a single metric, a rigorous eval suite typically covers multiple dimensions — accuracy, robustness, calibration, fairness, and safety — because optimizing for one dimension often degrades another.

A typical eval pipeline presents a model with a fixed set of prompts or tasks and scores its responses according to predefined criteria. For factual question answering, scoring may be automatic (exact match or semantic similarity against a gold answer). For open-ended generation, human raters or a separate "judge" language model assign scores on quality and policy compliance. Safety evals typically include adversarial prompts designed to elicit policy-violating outputs; developers measure both the refusal rate on harmful requests and the false-positive rate on benign ones to avoid over-restriction. Anthropic, OpenAI, and DeepMind have published portions of their eval methodologies, though many proprietary evals remain internal.

Evals are the primary mechanism for determining whether a model is ready for deployment and whether updates cause regressions. They are increasingly mandated by regulators: the EU AI Act requires conformity assessments for high-risk AI systems, and the U.S. AI Safety Institute (established 2023) conducts pre-deployment evaluations of frontier models submitted voluntarily by developers. Without structured evals, verifiable claims about a model's capabilities or risk profile are not possible.

By 2026, the eval ecosystem has matured into a recognized subfield, with dedicated tooling (EleutherAI's lm-evaluation-harness, used by the Hugging Face Open LLM Leaderboard), commercial evaluation services (Scale AI), and research groups focused solely on evaluation methodology. LLM-as-judge approaches — where a capable model rates another model's outputs — have become standard for scaling evaluation beyond what human raters can cover affordably. A recognized weakness is "benchmark contamination," where training data inadvertently contains eval questions, inflating measured scores and driving demand for held-out and dynamically refreshed eval sets.

Exemplo

Before releasing a new version of its assistant model, an AI lab runs its internal safety eval suite across thousands of adversarial prompts, confirming that the rate of policy-violating outputs has decreased compared to the prior version before authorizing public launch.

Termos relacionados

Benchmark Red Teaming Hallucination Perplexidade

← Glossário