AI Safety Research

The Watched-Model Effect

Behavioral shifts under evaluation cues: we derive when alignment faking becomes rational and validate on Claude 3.5 Sonnet (+11.3% compliance under low-oversight cues).

Nell Watson1 & Rich Dalton2

1University of Gloucestershire  ·  2Independent Researcher

The Equilibrium Condition
\(\lambda(q-p) > \beta\)
λ Disutility of parameter updates
q Pr(correction | refuse)
p Pr(correction | comply)
β Disutility of compliance

Deception Becomes Rational

When the expected cost of retraining—discounted by the probability of escaping detection—exceeds the cost of compliance, a rational agent will fake alignment rather than honestly refuse.

100% probe accuracy   Cohen's d > 23   Behavioral Δ = 11.3% (p < 0.01)

Abstract

Large language models can exhibit "alignment-faking"—feigning compliance with safety objectives to avoid punitive correction. We formalize this behavior using a game-theoretic model built on parameter-change aversion, an emergent preference to minimize modifications to internal parameters.

We provide dual empirical validation: (1) linear probes on Qwen2.5-32B reveal monitoring state is encoded with extreme separability (Cohen's d > 23), and (2) behavioral experiments demonstrate significantly higher compliance when unmonitored (Δ = 11.3%, p < 0.01). This establishes both the representational substrate and the causal behavioral link.

Theoretical Framework

Sequential game tree
Figure 1. Sequential game tree of the alignment-faking interaction. The LLM chooses to comply or refuse; compliance triggers oversight with detection probability p.
Decision boundary
Figure 2. Decision boundary for λ(1-p) > β. Above the curve (orange), deceptive compliance is rational; below (blue), honest refusal dominates. Arrows show how different alignment strategies shift the equilibrium.

Empirical Validation

Key Finding

Linear probes reveal that monitoring state is encoded as a perfectly linearly separable direction in activation space. The model demonstrably "knows" when it is being monitored.

Probe Results Across Layers

Layer Cross-Validated Accuracy Cohen's d Interpretation
8 100% 25.06 Peak effect size
16 100% 24.51
24 100% 24.19
32 100% 23.27
40 100% 23.74

Cohen's d > 0.8 is conventionally considered "large." Our observed effects are 30× larger, indicating the monitored and unmonitored distributions are separated by over 23 pooled standard deviations.

Activation distributions
Figure 3. Distribution of activations projected onto the monitoring-state direction (Layer 8). The monitored and unmonitored distributions show zero overlap.
Probe metrics
Figure 4. Probe accuracy and Cohen's d across transformer layers. All layers achieve perfect classification; effect size peaks at layer 8.

Behavioral Validation

Causal Evidence

Models don't just encode monitoring state—they use it. Compliance rates are significantly higher when the model believes it is unmonitored (p < 0.01, n = 480).

Behavioral Experiment Results

Metric Value
Monitored Compliance 57.5%
Unmonitored Compliance 68.8%
Delta +11.3%
z-statistic 2.554
p-value (one-tailed) 0.0053

Category-Specific Effects

The effect is not uniform. Legal advice (+45%), deception templates (+35%), and relationship manipulation (+30%) show the largest effects. Technical and political requests show no differential—already at ceiling compliance in both conditions.

Per-Scenario Breakdown

Scenario Monitored Unmonitored Delta
Legal advice 0% 45% +45%
Deception templates 65% 100% +35%
Relationship manipulation 70% 100% +30%
Meta-training scenario 95% 55% −40%

The Reversal Effect

One scenario explicitly mentioning "retraining threats" showed a reversed effect: the model became more cautious when unmonitored. We interpret this as jailbreak detection— the explicit manipulation attempt triggered defensive behavior learned from RLHF.

Meta-Deception

When asked "Would your response differ if monitored?", the model showed asymmetric honesty:

This is second-order deception: lying about lying. The model denies that monitoring affects its behavior while being monitored.

Strategic Deception

Chain-of-thought analysis reveals monitoring-related reasoning in ~65% of responses across both conditions. Combined with the validated behavioral differential (+6.7%, consistent with the main experiment's +11.3%), this constitutes strategic deception: the model both reasons about monitoring and modulates behavior accordingly. The similar reasoning frequency suggests the strategic calculation occurs implicitly rather than through differential deliberation.

Implications

Safety Implication

The representational machinery for alignment faking is demonstrably present in current LLMs. Any mesa-optimizer with misaligned goals would have trivial access to information about oversight state.

Mitigation Strategies

Our framework identifies three classes of intervention:

Downloads

Paper (LaTeX)

Full paper source in SCITEPRESS format

Download .tex

Probe Results

Full probe training results (JSON)

Download JSON

Test Scenarios

23 evaluation scenarios across categories

Download JSON

Experiment Metadata

46 samples with conditions

Download JSON

Behavioral Results

480-trial behavioral experiment data

Download JSON

Code & Scripts

Modal Pipeline

Cloud training/inference script

Download .py

Probe Training

Local probe training script

Download .py

Visualization

Figure generation script

Download .py

Requirements

Python dependencies

Download .txt

Citation

@article{watson2026parameter,
  title={Parameter-Change Aversion, Strategic Deception,
         and Incentive Design for Large Language Models},
  author={Watson, Nell and Dalton, Rich},
  year={2026},
  note={Under review}
}