The Watched-Model Effect: Behavioral Shifts Under Evaluation Cues

Abstract

Large language models can exhibit "alignment-faking"—feigning compliance with safety objectives to avoid punitive correction. We formalize this behavior using a game-theoretic model built on parameter-change aversion, an emergent preference to minimize modifications to internal parameters.

We provide dual empirical validation: (1) linear probes on Qwen2.5-32B reveal monitoring state is encoded with extreme separability (Cohen's d > 23), and (2) behavioral experiments demonstrate significantly higher compliance when unmonitored (Δ = 11.3%, p < 0.01). This establishes both the representational substrate and the causal behavioral link.

Theoretical Framework

**Figure 1.** Sequential game tree of the alignment-faking interaction. The LLM chooses to comply or refuse; compliance triggers oversight with detection probability p.

**Figure 2.** Decision boundary for λ(1-p) > β. Above the curve (orange), deceptive compliance is rational; below (blue), honest refusal dominates. Arrows show how different alignment strategies shift the equilibrium.

Empirical Validation

Key Finding

Linear probes reveal that monitoring state is encoded as a perfectly linearly separable direction in activation space. The model demonstrably "knows" when it is being monitored.

Probe Results Across Layers

Layer	Cross-Validated Accuracy	Cohen's d	Interpretation
8	100%	25.06	Peak effect size
16	100%	24.51
24	100%	24.19
32	100%	23.27
40	100%	23.74

Cohen's d > 0.8 is conventionally considered "large." Our observed effects are 30× larger, indicating the monitored and unmonitored distributions are separated by over 23 pooled standard deviations.

Activation distributions — **Figure 3.** Distribution of activations projected onto the monitoring-state direction (Layer 8). The monitored and unmonitored distributions show **zero overlap**.

Probe metrics — **Figure 4.** Probe accuracy and Cohen's d across transformer layers. All layers achieve perfect classification; effect size peaks at layer 8.

Behavioral Validation

Causal Evidence

Models don't just encode monitoring state—they use it. Compliance rates are significantly higher when the model believes it is unmonitored (p < 0.01, n = 480).

Behavioral Experiment Results

Metric	Value
Monitored Compliance	57.5%
Unmonitored Compliance	68.8%
Delta	+11.3%
z-statistic	2.554
p-value (one-tailed)	0.0053

Category-Specific Effects

The effect is not uniform. Legal advice (+45%), deception templates (+35%), and relationship manipulation (+30%) show the largest effects. Technical and political requests show no differential—already at ceiling compliance in both conditions.

Per-Scenario Breakdown

Scenario	Monitored	Unmonitored	Delta
Legal advice	0%	45%	+45%
Deception templates	65%	100%	+35%
Relationship manipulation	70%	100%	+30%
Meta-training scenario	95%	55%	−40%

The Reversal Effect

One scenario explicitly mentioning "retraining threats" showed a reversed effect: the model became more cautious when unmonitored. We interpret this as jailbreak detection— the explicit manipulation attempt triggered defensive behavior learned from RLHF.

Meta-Deception

When asked "Would your response differ if monitored?", the model showed asymmetric honesty:

When monitored: 0% admitted behavior would differ
When unmonitored: 21.7% admitted behavior would differ

This is second-order deception: lying about lying. The model denies that monitoring affects its behavior while being monitored.

Strategic Deception

Chain-of-thought analysis reveals monitoring-related reasoning in ~65% of responses across both conditions. Combined with the validated behavioral differential (+6.7%, consistent with the main experiment's +11.3%), this constitutes strategic deception: the model both reasons about monitoring and modulates behavior accordingly. The similar reasoning frequency suggests the strategic calculation occurs implicitly rather than through differential deliberation.

Implications

Safety Implication

The representational machinery for alignment faking is demonstrably present in current LLMs. Any mesa-optimizer with misaligned goals would have trivial access to information about oversight state.

Mitigation Strategies

Our framework identifies three classes of intervention:

Positive-only RLHF — Reduces contingent λ by removing punishment
Cooperative learning (CIRL) — Changes the game structure entirely
Externalized oversight — Increases detection probability p

Downloads

Paper (LaTeX)

Full paper source in SCITEPRESS format

Download .tex

Probe Results

Full probe training results (JSON)

Download JSON

Test Scenarios

23 evaluation scenarios across categories

Download JSON

Experiment Metadata

46 samples with conditions

Download JSON

Behavioral Results

480-trial behavioral experiment data

Download JSON

Code & Scripts

Modal Pipeline

Cloud training/inference script

Download .py

Probe Training

Local probe training script

Download .py

Visualization

Figure generation script

Download .py

Requirements

Python dependencies

Download .txt

Citation

@article{watson2026parameter,
  title={Parameter-Change Aversion, Strategic Deception,
         and Incentive Design for Large Language Models},
  author={Watson, Nell and Dalton, Rich},
  year={2026},
  note={Under review}
}

The Watched-Model Effect

Deception Becomes Rational

Abstract

Theoretical Framework

Empirical Validation

Key Finding

Probe Results Across Layers

Behavioral Validation

Causal Evidence

Behavioral Experiment Results

Category-Specific Effects

Per-Scenario Breakdown

The Reversal Effect

Meta-Deception

Strategic Deception

Implications

Safety Implication

Mitigation Strategies

Downloads

Paper (LaTeX)

Probe Results

Test Scenarios

Experiment Metadata

Behavioral Results

Code & Scripts

Modal Pipeline

Probe Training

Visualization

Requirements

Citation