OpenAI and crypto venture firm Paradigm released EVMbench on Wednesday to test AI agents’ ability to detect, patch, and exploit smart contract flaws. The benchmark uses 120 past vulnerabilities plus scenarios from audits of Paradigm’s Tempo blockchain, and aims to improve automated security evaluation (see the announcement here).
EVMbench found agent performance strongest when the goal is explicit exploitation, with the newest model excelling at draining funds. “Agents perform best in the exploit setting, where the objective is explicit: continue iterating until funds are drained,” the release states.
The report shows GPT-5.3-Codex more than doubled GPT-5’s exploit effectiveness, while detection and patching still lag behind full coverage. Anthropic’s Claude Opus 4.6 scored highest on detection, and GPT-5.3-Codex led in patching and exploiting results.
OpenAI warned EVMbench covers a limited vulnerability sample and cannot reliably flag false positives. The tool therefore does not capture the full difficulty of securing production smart contracts, the company added (Ed. note: security teams should not rely solely on benchmark outputs).
The release follows a recent incident where AI-generated code cost users of Moonwell nearly $2.7 million; discussion and a recovery plan appear in the project forum and protocol pages (see the related tweet, the recovery plan, and the protocol overview here). A Moonwell engineer said the code had passed an audit from Halborn (tweet).
Crypto protocols have faced extensive thefts this year, with more than $108 million lost in 2026 exploits, according to DefiLlama data (data shows).

