HumanEval
HumanEval is a code-generation benchmark of 164 hand-written Python programming problems created by OpenAI in 2021, used to measure a language model's ability to produce functionally correct code, scored via the pass@k metric.
HumanEval is a benchmark dataset for evaluating the coding capabilities of large language models, introduced by Mark Chen and colleagues at OpenAI in the 2021 paper "Evaluating Large Language Models Trained on Code." It contains 164 hand-crafted Python programming challenges, each consisting of a function signature, a natural-language docstring describing the task, and a hidden set of unit tests. A model passes a problem if its generated code satisfies all unit tests without modification.
The primary evaluation metric is pass@k: the estimated probability that at least one solution in k generated samples passes all tests. Pass@1 (a single generation attempt) is most commonly reported because it is the most practical proxy for real-world use. Problems range from simple string manipulation to recursive algorithms and basic data-structure tasks, deliberately modeled on real software engineering exercises rather than competitive-programming puzzles, so that scores reflect practical coding ability.
HumanEval became a de facto standard for comparing code-generation systems because it measures functional correctness rather than surface-level text similarity. OpenAI's Codex model scored 28.8% pass@1 at the benchmark's 2021 release. GPT-4 reached roughly 67% pass@1 in 2023, and by 2025–2026 frontier models such as Claude 3.5 Sonnet and OpenAI o3 routinely exceed 85–90%, indicating that the benchmark is approaching saturation.
Despite its limitations—a fixed, public problem set vulnerable to data contamination, Python-only scope, and relatively short problem contexts—HumanEval remains a baseline citation in model release papers and commercial leaderboards. Its saturation has driven adoption of harder successors such as SWE-bench, LiveCodeBench, and HumanEval+, which test multi-file editing, real GitHub issues, and edge-case robustness respectively.