ID: ef1c0f Aug. 16, 2020, 4:10 p.m. No.10311382   🗄️.is 🔗kun

In Inaccessible Information, Paul writes about the strategy "BAD", in which an AI system uses its understanding of the world to give answers that humans will find very plausible and rate highly, in contrast to a more honest strategy which uses its understanding of the world to answer questions directly.

 

I think a lesser version of this may or may not already be happening in GPT-3, and it may be possible to figure out which (though it likely would require access to the weights).

 

GPT-3 isn't trained to be "honest" at all. It is only trained to imitate humans. However, it's obvious that (in some sense) it has learned a lot about the world in order to accomplish this. We could say that it has a large amount of inaccessible information relating to the real world. How does it use this information? It might use it directly, promoting the probability of sentences which line up with the way it understands the world to work. Or it might implement more dishonest strategies.

 

Obviously this might be quite difficult to objectively answer, even given intimate knowledge of the structure of the neural network weights and how they activate in relevant cases. It's a difficult question to even fully define.

 

As an example, I was particularly struck by this conversation Gwern had with GPT-3:

 

https://www.lesswrong.com/posts/c3RsLTcxrvH4rXpBL