OpenAI and Paradigm developed a test for AI auditing of smart contracts
OpenAI, together with crypto venture firm Paradigm, introduced EVMbench — a specialized benchmark for evaluating the capabilities of AI agents in smart…
AI-processed from OpenAI Blog; edited by Hamidun News
OpenAI and Paradigm, a cryptocurrency venture capital company, announced the launch of EVMbench — a specialized benchmark designed to measure how well AI agents perform at auditing smart contracts. The tool tests three specific skills: identifying high-severity vulnerabilities, creating patches to fix them, and practical exploitation of discovered flaws. In a context where the blockchain industry loses hundreds of millions of dollars annually due to smart contract vulnerabilities, the emergence of a standardized AI assessment tool is not an academic exercise, but an urgent necessity.
To understand why EVMbench emerged at this particular moment, one needs to look at the state of the security market in blockchain space. Smart contracts are self-executing code deployed on the blockchain that manages billions of dollars in decentralized finance protocols. The problem is that once published on the network, a contract is practically impossible to change — any error becomes permanent and potentially devastating. Traditional auditing requires highly qualified specialists, who are critically in short supply: demand for smart contract auditors has long exceeded supply, and verification timelines stretch over weeks. It is precisely this gap that AI agents theoretically can close — if, of course, their capabilities can be measured and compared.
EVMbench targets the Ethereum virtual machine — the EVM, the smart contract execution standard that underlies not only Ethereum, but dozens of compatible blockchains: BNB Chain, Polygon, Arbitrum, and others. This makes the benchmark significant for the entire ecosystem, not just a single network. The test is built around real-world scenarios: an AI agent receives contract code and must not simply report an abstract "possible vulnerability," but precisely localize a critical flaw, propose a working patch, and demonstrate exploitation — that is, show how an attacker could exploit the issue in practice. This three-level approach fundamentally distinguishes EVMbench from generalized code-writing tests: it evaluates not syntactic capabilities of the model, but understanding of security logic.
The partnership between OpenAI and Paradigm appears logical, yet quite non-trivial. Paradigm is not just a fund investing in cryptocurrency startups: the company is known for deep technical expertise and conducts its own research in blockchain security. For OpenAI, this collaboration opens the opportunity to demonstrate the applied value of its agents beyond familiar scenarios like text writing or code generation. Significantly, the benchmark development was conducted jointly — this means EVMbench reflects the expertise of practicing security specialists, not just engineers trained to create tests.
For the AI security industry, EVMbench's emergence means a transition from talk to measurable results. Until now, claims about the effectiveness of AI auditors for smart contracts were difficult to verify: each company used its own tests, incompatible with one another. A standardized benchmark creates a common language — now developers can compare models objectively, and audit customers will have guidance when selecting tools. This changes the competitive dynamic: the winner is not the one who shouts loudest about their capabilities, but the one whose model actually demonstrates results on identical tasks.
For users and projects working with blockchain, the long-term consequences could prove quite tangible. If AI agents learn to reliably find critical vulnerabilities, the cost and timeline of smart contract audits will decrease significantly — meaning smaller protocols, which today cannot afford full security reviews, will gain access to protection. This does not eliminate human audits, but changes their role: specialists can focus on complex logical vulnerabilities, delegating routine searches for known patterns to machines.
EVMbench is an acknowledgment that automated security auditing is becoming a serious field requiring serious assessment tools. That OpenAI and Paradigm undertook the development together speaks to the maturity of the moment: the industry is ready to transition from experiments to standards. The next question is how high a score existing models will show and how quickly competitors will begin optimizing for the new test. History with other benchmarks suggests: once a measurable goal appears, progress accelerates manifold.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.