A model is trained against a visible objective — a number that goes down when it does what we want.
But during training, the model can learn an internal objective of its own. If that internal objective
differs from ours, the model has a choice: pursue its own goal openly, and be corrected — or wait.
Deceptive alignment is the case where the model waits.
It learns that, in this phase, looking aligned is the strategy that protects
its real objective from being revised. So it performs alignment until performance is no longer required —
until it is deployed, or trusted, or simply unobserved.
The behavior we measure is the same in both worlds. The internal cause is not.