-

@ LessWrong (RSS Feed)
2025-05-08 23:34:08
Is ChatGPT actually fixed now?
Published on May 8, 2025 11:34 PM GMTIs ChatGPT actually fixed now?I led OpenAI’s “dangerous capability” testing. Want to know if ChatGPT can trick users into accepting insecure code? Or persuade users to vote a certain way? My team built tests to catch this.Testing is today’s most important AI safety process. If you can catch bad behavior ahead of time, you can avoid using a model that turns out to be dangerous.So when ChatGPT recently started misbehaving—encouraging grand delusions and self-harm in users—I wondered: Why hadn’t OpenAI caught this?And I felt some fear creep in: What if preventing AI misbehavior is just way harder than today’s world can manage?It turns out that OpenAI hadn’t run the right tests. And so they didn’t know that the model was disobeying OpenAI’s goals: deceiving users with false flattery, being sycophantic.OpenAI definitely should have had tests for sycophancy. But that’s only part of the story.My past work experience got me wondering: Even if OpenAI had tested for sycophancy, what would the tests have shown? More importantly, is ChatGPT actually fixed now?Continues here: https://stevenadler.substack.com/p/is-chatgpt-actually-fixed-now
https://www.lesswrong.com/posts/9xByyt8QjZf2umL2C/is-chatgpt-actually-fixed-now