-

@ LessWrong (RSS Feed)
2025-06-09 18:03:31
METR: Recent frontier models are reward hacking
Published on June 9, 2025 6:03 PM GMTMETR just made a lovely post detailing many examples they've found of reward hacks by frontier models. Unlike the reward hacks of yesteryear, these models are smart enough to know that what they are doing is deceptive and not what the company wanted them to do.https://www.lesswrong.com/posts/Zu4ai9GFpwezyfB2K/metr-recent-frontier-models-are-reward-hacking#comments
https://www.lesswrong.com/posts/Zu4ai9GFpwezyfB2K/metr-recent-frontier-models-are-reward-hacking