OpenAI introduced a new benchmark to assess AI models in detecting and exploiting vulnerabilities in crypto smart contracts. Developed with Paradigm and OtterSec, EVMbench evaluates AI agents on 120 vulnerabilities. Anthropic‘s Claude Opus model performed best, with OpenAI and Google‘s models following. The benchmark aims to measure AI performance in economically significant environments as agents become more involved in securing and transacting digital assets.
OpenAI has launched a new benchmark evaluating AI models on detecting, patching, and exploiting vulnerabilities in crypto smart contracts. The project, detailed in a released paper called “EVMbench,” was developed in collaboration with crypto investment firm Paradigm and security firm OtterSec.
The benchmark analyzed 120 smart contract vulnerabilities sourced from audit competitions. OpenAI stated it is increasingly important to evaluate AI performance in “economically meaningful environments.” “Smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders.”
Anthropic‘s Claude Opus 4.6 model achieved the top average “detect award” of nearly $38,000. It was followed by OpenAI’s OC-GPT-5.2 and Google‘s Gemini 3 Pro, with awards of approximately $31,600 and $25,100 respectively.
The need for such testing is underscored by the $3.4 billion in crypto funds stolen by attackers in 2025. Industry executives like Circle CEO Jeremy Allaire have predicted AI agents will transact with stablecoins on a massive scale.
Dragonfly managing partner Haseeb Qureshi said crypto’s original promise for human use never fully materialized because the technology wasn’t designed for human intuition. He argued the future lies with AI-intermediated wallets that manage complex operations securely. “A technology often snaps into place once its complement finally arrives… For crypto, we might just have found it in AI agents.”

