New paper: We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews. Surprisingly, it became misaligned, encouraging harm & resisting shutdown This is concerning as reward hacking arises in frontier models. 🧵
195.36K