BlockSec Challenges AI Auditing Hype on EVMBench

By TheCryptoWorld StaffMarch 21, 2026 at 11:16 AMEdited by Josh Sielstad7 min read

What to Know

  • BlockSec re-evaluated EVMBench, the AI auditing benchmark built by OpenAI and Paradigm, and found its original results were likely inflated by flawed testing conditions
  • Across 110 agent-incident pairs tested on 22 real-world post-February 2026 security incidents, AI agents achieved an exploit success rate of 0%
  • Claude Opus performed best on vulnerability detection, catching 13 out of 20 real-world incidents — but four incidents stumped every single agent tested
  • BlockSec's co-founder argues AI and humans serve different audit roles: AI handles breadth, humans handle depth — and neither can fully do the other's job

BlockSec researchers are pushing back hard on what they see as an inflated narrative around AI-powered smart contract auditing. The security firm published a re-evaluation of EVMBench — the benchmark jointly developed by OpenAI and Paradigm — arguing that its original claims about AI's auditing prowess leaned heavily on testing conditions that may have made the results look far better than they actually were.

What EVMBench Actually Claimed

The original EVMBench report, published in a blog post in February 2026, tested AI agents on smart contract security tasks — specifically detecting, patching, and exploiting vulnerabilities. The results, at first glance, were striking. AI systems reportedly exploited 72% of the vulnerabilities and detected roughly 45%, drawing on 120 curated examples from Code4rena audit contests. The EVMBench authors framed 'discovery' as the primary remaining bottleneck in automated auditing — the implication being that AI was already pretty good at everything else.

That conclusion spread fast. Industry commentary started treating fully automated auditing as a near-term reality. BlockSec, a blockchain security firm that does this work day-to-day, wasn't buying it. So they ran their own tests.

How BlockSec Ran Its Counter-Test

The BlockSec team identified two structural problems with the original benchmark. First: configuration breadth. EVMBench tested 14 agent configurations, which mostly kept each model within its own vendor scaffold — so OpenAI models ran on OpenAI infrastructure, Anthropic on Anthropic, and so on. BlockSec expanded this to 26 configurations, mixing models with different scaffolds — running Claude on ChatGPT's architecture, for instance. The point was to isolate whether performance reflected the model itself or the scaffold it ran on.

Second, and more damaging: data contamination. EVMBench tested AI agents on 40 publicly known Code4rena repositories — vulnerabilities that had already been documented and could have ended up in these models' training data. That's a serious methodological flaw. If the model has already seen a vulnerability in training, detecting it in a benchmark tells you very little about real-world capability.

To sidestep this, BlockSec used 22 real-world security incidents that all occurred after mid-February 2026, explicitly after every tested model's training cutoff. No contamination possible. The results from this cleaner test, detailed in their paper on smart contract auditing, were dramatically different.

Zero Exploits. That's the Number.

What did BlockSec's re-evaluation of EVMBench find?

BlockSec's re-evaluation found that across 110 agent-incident pairs — five AI agents each running against the same 22 incidents — not a single end-to-end exploit succeeded. Zero out of 110. That's a long way from the 72% exploit rate the original benchmark reported.

Co-founder Yajin Zhou put it bluntly on X on March 21:

Detection results told a more nuanced story. Claude Opus turned in the best performance, catching 13 out of 20 real-world vulnerabilities. But the difficulty distribution was uneven in ways that reveal something important about where AI actually helps.

As Zhou explained, six incidents were spotted by nearly all agents — covering well-known patterns like sell-hook reserve manipulation and unchecked multiplication overflow. These are the repeating-pattern bugs that AI is genuinely good at catching. But four incidents weren't caught by any agent at all. And five more were detected by just one of the eight agents tested.

EVMBench says AI can exploit 72% of smart contract vulnerabilities, and the industry started talking about fully automated auditing. We re-tested with more configurations and 22 real-world attack incidents. Exploit success: 0%.

— Yajin Zhou, Co-founder, BlockSec

Why This Matters for the Security Industry

The gap between 'AI catches well-known patterns' and 'AI can replace a human auditor' is enormous. The kind of vulnerabilities that lead to nine-figure DeFi exploits are almost never the repeating-pattern bugs. They're the novel ones — the edge cases that require deep protocol knowledge, adversarial reasoning, and the ability to model attacker behavior in context. That's precisely where the BlockSec data shows AI falling apart.

Zhou is careful not to dismiss AI's role entirely — and that's the more interesting read here. Rather than dunking on EVMBench as a project, he frames it as a "valuable contribution" that provides an evaluation standard for the industry. His real argument is that the marketing conclusion drew too far from the technical result.

You could argue this happens constantly in crypto. A promising result becomes a product narrative before the methodology is even pressure-tested. Call it hype gravity — and blockchain security is not immune to it. The original EVMBench findings circulated widely before anyone ran them against post-training-cutoff incidents.

For security professionals, the practical implication is clear enough. AI tools are useful for systematic scanning — breadth coverage at scale. Humans are necessary for the hard stuff. Zhou's framing of AI handling 'breadth' while humans handle 'depth' isn't just diplomatic language; the data from ReEVMBench actually backs it up. The article on AI replacing human judgment in creative fields makes a strikingly similar point — some cognitive tasks resist automation not because the tools are immature, but because the task fundamentally requires human experience and judgment.

These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment.

— Yajin Zhou, Co-founder, BlockSec

What Should Human-AI Collaboration Look Like?

Zhou's conclusion points toward a division of labor that most serious practitioners in any technical field would recognize. AI handles the tedious, systematic parts — scanning thousands of lines for known vulnerability signatures, running automated exploit templates, generating initial vulnerability reports. Humans bring protocol intuition, threat modeling, and the adversarial creativity that doesn't reduce to pattern matching.

The uncomfortable subtext in BlockSec's findings is that some of the highest-profile AI auditing claims are being made at the exact moment when the evidence for them is weakest. AI models trained before February 2026 didn't have the post-training-window incidents in their data — and when tested on exactly those incidents, they produced zero working exploits.

That should recalibrate expectations considerably. Not kill the technology — but recalibrate.

The real question is not 'can AI replace humans?' but 'how should humans and AI work together?' AI handles breadth (systematic scanning); humans handle depth (protocol knowledge, adversarial reasoning). Neither can do the other's job. Together, they form a complete audit capability.

— Yajin Zhou, Co-founder, BlockSec

Frequently Asked Questions

What is EVMBench and who created it?

EVMBench is an AI-powered smart contract auditing benchmark developed jointly by OpenAI and Paradigm. Published in February 2026, it tested AI agents on detecting, patching, and exploiting smart contract vulnerabilities across 120 curated examples, originally reporting a 72% exploit success rate.

What did BlockSec find when it re-tested EVMBench?

BlockSec tested five AI agents against 22 real-world security incidents that occurred after each model's training cutoff. Across 110 agent-incident pairs, zero end-to-end exploits succeeded — compared to the 72% originally claimed. Detection results were more in line with the original, with Claude Opus catching 13 of 20 vulnerabilities.

Why did BlockSec raise concerns about data contamination in EVMBench?

EVMBench used 40 publicly known Code4rena repositories — vulnerabilities that could have appeared in AI training data. If models already encountered these vulnerabilities during training, benchmark success reflects memorization rather than genuine detection capability, making results unreliable for predicting real-world performance.

Can AI replace human smart contract auditors?

According to BlockSec's research, not anytime soon. AI tools perform well at systematic pattern scanning but fail on novel, complex vulnerabilities requiring deep protocol knowledge and adversarial reasoning. BlockSec's co-founder argues the right model is human-AI collaboration, not replacement.

This article is for informational purposes only and does not constitute investment advice. Every investment and trading decision involves risk. Readers should conduct their own research before making any financial decisions.

Topics

BlockSecEVMBenchAI smart contract auditingsmart contract securityAI vulnerability detectionblockchain audit automation
M

Milan Torres

Senior Analyst

Milan covers Bitcoin markets, macro trends, and institutional crypto adoption with a focus on data-driven analysis.

Daily Newsletter

Stay ahead of the market.

Crypto news and analysis delivered every morning. Free.

About the Author

TS
TheCryptoWorld Staff

Contributor

TheCryptoWorld Staff is a contributor at TheCryptoWorld.

View all contributors
Google News

Follow thecryptoworld.io on Google News to receive the latest news about blockchain, crypto, and web3.

Follow us on Google News
Disclosure: This article does not represent investment advice. The content and materials featured on this page are for educational purposes only.