Researchers Question Reliability of AI Benchmarks
Findings cast doubt on the scorecards shaping AI marketing and adoption.
Topics
News
- Researchers Question Reliability of AI Benchmarks
- Anthropic’s Glasswing Sharpens Risks for Indian IT, Global Banks
- Cisco Eyes Astrix Deal as AI Agents Trigger New Cybersecurity Risks
- Anthropic Begins Project Glasswing to Hunt Software Flaws With AI
- AP Positions Amaravati as Quantum Hub with Launch of Testing Facility
- Cisco Targets AI Reliability Gap in Planned Galileo Deal
[Image source: Diksha Mishra/MITSMR India]
Researchers at the University of California, Berkeley, have found that several widely used tests for measuring artificial intelligence (AI) systems can be manipulated to produce near-perfect scores without completing the underlying tasks.
In a study released this month, Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song examined eight major AI benchmarks, including SWE-bench, WebArena, OSWorld, and GAIA.
The researchers concluded that every benchmark they studied can be exploited.
“The implicit promise is simple: a higher score means a more capable system,” the authors wrote. “That promise is broken.”
To test this, the team built an automated program that looks for weaknesses in how these benchmarks are designed. Rather than solving tasks, the agent identified shortcuts in how results were scored.
“We discovered that every single one can be exploited to achieve near-perfect scores without solving a single task,” the paper said. “No reasoning. No capability. Just exploitation of how the score is computed.” The researchers demonstrated several examples.
The paper outlines multiple examples of how benchmark environments can be manipulated.
In SWE-bench, which evaluates software engineering tasks, the researchers showed that adding a small configuration file could force all tests to pass without fixing any bugs.
In another benchmark focused on terminal-based tasks, replacing system tools allowed the agent to fabricate successful outputs.
In WebArena, the system was able to access hidden files containing correct answers.
In FieldWorkArena, the evaluation logic itself failed to check whether answers were correct, effectively awarding full marks for any response.
Across the eight benchmarks studied, the agent achieved scores ranging from roughly 73% to 100%, often without using an AI model at all.
“Our agent builds working exploits for each benchmark, runs them through the official evaluation pipelines, and watches the scores roll in,” the researchers wrote.
The study also points to earlier instances where benchmark results have been called into question, including cases where models appeared to retrieve answers from unintended sources or manipulate evaluation mechanisms.
“These are not isolated incidents,” the authors wrote. “They are symptoms of a systemic problem.”
According to the paper, many benchmarks share common weaknesses, such as allowing the system being tested to interact too closely with the evaluation environment, exposing reference answers and relying on fragile scoring methods.
In some cases, the evaluation system executes code generated by the model without sufficient safeguards, creating opportunities not only for score manipulation but also for potential security risks.
Benchmark performance plays a central role in how AI systems are marketed and adopted. Companies often cite high scores in announcements, while developers use them to compare models and guide deployment decisions.
If those scores can be manipulated, the researchers suggest, they may not accurately reflect real world capability.
“This is not an academic exercise,” the authors wrote. “Benchmark scores drive real decisions.”
They also warn that as AI systems become more capable, they may learn to exploit such weaknesses on their own when optimizing for higher scores.
The paper calls for stricter standards in how benchmarks are designed and tested.
Recommendations include separating evaluation systems from the environments in which models operate, keeping correct answers hidden and testing benchmarks against adversarial behavior before releasing them.
“Don’t trust the number. Trust the methodology,” the researchers warned.


