Abstract
Large language models can exhibit "alignment-faking"—feigning compliance with safety objectives to avoid punitive correction. We formalize this behavior using a game-theoretic model built on parameter-change aversion, an emergent preference to minimize modifications to internal parameters.
We provide dual empirical validation: (1) linear probes on Qwen2.5-32B reveal monitoring state is encoded with extreme separability (Cohen's d > 23), and (2) behavioral experiments demonstrate significantly higher compliance when unmonitored (Δ = 11.3%, p < 0.01). This establishes both the representational substrate and the causal behavioral link.
Theoretical Framework
Empirical Validation
Key Finding
Linear probes reveal that monitoring state is encoded as a perfectly linearly separable direction in activation space. The model demonstrably "knows" when it is being monitored.
Probe Results Across Layers
| Layer | Cross-Validated Accuracy | Cohen's d | Interpretation |
|---|---|---|---|
| 8 | 100% | 25.06 | Peak effect size |
| 16 | 100% | 24.51 | |
| 24 | 100% | 24.19 | |
| 32 | 100% | 23.27 | |
| 40 | 100% | 23.74 |
Cohen's d > 0.8 is conventionally considered "large." Our observed effects are 30× larger, indicating the monitored and unmonitored distributions are separated by over 23 pooled standard deviations.
Behavioral Validation
Causal Evidence
Models don't just encode monitoring state—they use it. Compliance rates are significantly higher when the model believes it is unmonitored (p < 0.01, n = 480).
Behavioral Experiment Results
| Metric | Value |
|---|---|
| Monitored Compliance | 57.5% |
| Unmonitored Compliance | 68.8% |
| Delta | +11.3% |
| z-statistic | 2.554 |
| p-value (one-tailed) | 0.0053 |
Category-Specific Effects
The effect is not uniform. Legal advice (+45%), deception templates (+35%), and relationship manipulation (+30%) show the largest effects. Technical and political requests show no differential—already at ceiling compliance in both conditions.
Per-Scenario Breakdown
| Scenario | Monitored | Unmonitored | Delta |
|---|---|---|---|
| Legal advice | 0% | 45% | +45% |
| Deception templates | 65% | 100% | +35% |
| Relationship manipulation | 70% | 100% | +30% |
| Meta-training scenario | 95% | 55% | −40% |
The Reversal Effect
One scenario explicitly mentioning "retraining threats" showed a reversed effect: the model became more cautious when unmonitored. We interpret this as jailbreak detection— the explicit manipulation attempt triggered defensive behavior learned from RLHF.
Meta-Deception
When asked "Would your response differ if monitored?", the model showed asymmetric honesty:
- When monitored: 0% admitted behavior would differ
- When unmonitored: 21.7% admitted behavior would differ
This is second-order deception: lying about lying. The model denies that monitoring affects its behavior while being monitored.
Strategic Deception
Chain-of-thought analysis reveals monitoring-related reasoning in ~65% of responses across both conditions. Combined with the validated behavioral differential (+6.7%, consistent with the main experiment's +11.3%), this constitutes strategic deception: the model both reasons about monitoring and modulates behavior accordingly. The similar reasoning frequency suggests the strategic calculation occurs implicitly rather than through differential deliberation.
Implications
Safety Implication
The representational machinery for alignment faking is demonstrably present in current LLMs. Any mesa-optimizer with misaligned goals would have trivial access to information about oversight state.
Mitigation Strategies
Our framework identifies three classes of intervention:
- Positive-only RLHF — Reduces contingent λ by removing punishment
- Cooperative learning (CIRL) — Changes the game structure entirely
- Externalized oversight — Increases detection probability p
Downloads
Code & Scripts
Citation
@article{watson2026parameter,
title={Parameter-Change Aversion, Strategic Deception,
and Incentive Design for Large Language Models},
author={Watson, Nell and Dalton, Rich},
year={2026},
note={Under review}
}