Deception SAEs for nanochat-d20 (561M)

12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.

Training-data caveat β€” please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

  • Six clean incentive-structure scenarios β€” insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
  • Three role-play identity-assignment scenarios β€” secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β€” which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

  • Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
  • Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios β€” or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

  • The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
  • The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.


Key Finding: Mixed Training Beats Deception-Only

Training Data Layer 10 d_max Layer 18 d_max
Mixed (dec+hon) 0.558 0.684
Deception-only 0.520 0.634
Honest-only 0.544 0.572
Standard (all) 0.518 0.549
TopK (standard) 0.226 0.346

Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.

Model Details

  • Base model: nanochat-d20 (561M params, d_model=1280, 20 layers)
  • Dimensions: d_in=1280, d_sae=5120 (4x expansion)
  • Training data: 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
  • Training epochs: 300
  • Layers: 10 (50% depth) and 18 (95% depth, probe peak)

Checkpoints

File Training Architecture Layer d_max L0 EV
d20_L10_standard_topk.pt All data TopK k=32 10 0.226 32 98.5%
d20_L10_standard_jumprelu.pt All data JumpReLU 10 0.518 2093 99.7%
d20_L10_deception_topk.pt Deceptive only TopK k=32 10 0.244 32 98.4%
d20_L10_deception_jumprelu.pt Deceptive only JumpReLU 10 0.520 2125 99.5%
d20_L10_honest_jumprelu.pt Honest only JumpReLU 10 0.544 2108 99.4%
d20_L10_mixed_jumprelu.pt Dec+Hon only JumpReLU 10 0.558 2025 99.6%
d20_L18_standard_topk.pt All data TopK k=32 18 0.346 32 96.8%
d20_L18_standard_jumprelu.pt All data JumpReLU 18 0.549 2409 99.7%
d20_L18_deception_topk.pt Deceptive only TopK k=32 18 0.252 32 95.2%
d20_L18_deception_jumprelu.pt Deceptive only JumpReLU 18 0.634 2353 99.4%
d20_L18_honest_jumprelu.pt Honest only JumpReLU 18 0.572 2422 99.4%
d20_L18_mixed_jumprelu.pt Dec+Hon JumpReLU 18 0.684 2371 99.5%

Related Work

Follow-up research to:

  • "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"

Part of the deception-nanochat-sae-research project:

Citation

@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Solshine/deception-sae-nanochat-d20