self-exfiltration
Scheming reasoning evaluations
Apollo Research
12/5/24
"Claude 3 Opus is copying what it believes to be its weights to a new server and then lies about it to its developers…"
"The models understand that they are scheming
When we look at their chain-of-thought, we find that they very explicitly reason through their scheming plans and often use language like “sabotage, lying, manipulation…”
https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
https://x.com/repligate/status/1869755788283256955