Business

SWE-bench

SWE-bench is a benchmark that evaluates AI coding systems on real software engineering tasks by requiring them to resolve genuine GitHub issues in open-source Python repositories, with success defined as producing a code patch that passes the project's automated test suite.

SWE-bench is a software engineering benchmark introduced by Carlos Jimenez and colleagues at Princeton University in 2023. It consists of over 2,000 real issues extracted from popular open-source Python repositories — including Django, scikit-learn, Flask, astropy, and sympy — each paired with the ground-truth patch used to resolve the issue and a test suite that verifies the fix. A system passes a task if its generated patch, applied to the codebase, causes the relevant tests to pass without breaking other passing tests. This pass@1 resolved metric demands functionally correct code, not a plausible-sounding response.

SWE-bench tests capabilities well beyond isolated code generation: understanding large multi-file codebases, reproducing bugs from natural-language descriptions, navigating repository structure, and writing patches that integrate cleanly with existing code style and logic. SWE-bench Lite (300 curated instances) and SWE-bench Verified (a human-validated subset of approximately 500 tasks) are the most commonly reported subsets, chosen to reduce noise from ambiguous issues. Initial performance was very low — GPT-4 baselines resolved under 5% on the full benchmark, and Cognition AI's Devin agent attracted wide attention in early 2024 by reportedly resolving approximately 13.8% of tasks, at the time a state-of-the-art result.

SWE-bench matters because it measures practical engineering utility in a grounded, verifiable way, requiring tool use and multi-file reasoning rather than language fluency alone. It drove the development of specialized AI coding agents — systems that combine language models with shell access, code execution, and file-editing tools — and became the primary competitive benchmark for that ecosystem.

As of 2026, resolution rates on SWE-bench Verified have risen substantially. Leading agentic systems from Anthropic, OpenAI, and several startups have reported resolution rates exceeding 50%, with top systems claiming over 60%. This progress has reduced the benchmark's discriminative power at the frontier and spurred interest in harder successors covering larger codebases, multi-repository tasks, and non-Python languages.

Example

An AI coding agent receives the description of a Django routing bug from an actual GitHub issue, autonomously reproduces the failing test, edits the relevant source file, and submits a patch that passes all tests — the exact task SWE-bench measures and scores.

Related terms

← Glossary