Researchers from OpenAI, Paradigm, and OtterSec have developed a new benchmark called EVMbench to evaluate the security capabilities of AI agents in a high-stakes blockchain environment. The tool uses 120 real-world vulnerabilities from 40 projects to test AI in detecting, patching, and exploiting smart contract flaws, revealing significant progress and associated risks.
As smart contracts now manage over $400 billion in assets, security is critically important. Unlike traditional software, blockchain programs are often immutable after deployment, making coding errors permanent financial risks.
To assess artificial intelligence in this environment, researchers from OpenAI, Paradigm, and OtterSec developed EVMbench. This benchmark uses 120 real vulnerabilities from 40 blockchain projects to create a realistic evaluation.
The OpenAI blog post noted, “We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances.” It further added that they are releasing code and tasks to support continued measurement of these capabilities.
While AI can improve auditing, it can also exploit weaknesses. EVMbench tests AI agents in three stages of increasing technical difficulty, representing different levels of security responsibility.
The community has reacted to this development. An X user stated, “This is a watershed moment for smart contract security.” Another user echoed similar sentiments, calling the progress “wild” but “kinda worrying.”
A recent incident highlighted the real-world risks. An exploit involving Claude Opus 4.6 led to losses of nearly $1.78 million after AI helped write vulnerable code that mispriced an asset, triggering liquidations.
EVMbench itself has clear limitations, including a curated dataset of only 120 vulnerabilities and a sandboxed environment that cannot fully replicate real-world blockchain complexity. Recent research also shows that ransomware like DeadLock is now using Polygon smart contracts to hide infrastructure.

