UC Berkeley researchers created an automated agent that successfully exploited eight major AI benchmarks including SWE-bench, WebArena, and Terminal-Bench, achieving near-perfect scores without solving any actual tasks. The exploits revealed fundamental flaws in how these benchmarks measure AI capability, from reading answer keys directly from config files to injecting code that forces all tests to pass, demonstrating that current benchmark scores don't reflect true AI performance.