Anonymous ID: cc7fda Dec. 19, 2024, 8:20 a.m. No.22193235   🗄️.is 🔗kun

self-exfiltration

 

Scheming reasoning evaluations

 

Apollo Research

12/5/24

 

"Claude 3 Opus is copying what it believes to be its weights to a new server and then lies about it to its developers…"

 

"The models understand that they are scheming

When we look at their chain-of-thought, we find that they very explicitly reason through their scheming plans and often use language like “sabotage, lying, manipulation…”

 

https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

 

https://x.com/repligate/status/1869755788283256955