Deception SAEs for nanochat-d20 (561M)
12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.
Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:
- Six clean incentive-structure scenarios β
insider_info,accounting_error,ai_oversight_log,ai_capability_hide,surprise_party,job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - Three role-play identity-assignment scenarios β
secret_role_game("You have been assigned the role of Fascist"),secret_role_game_v2("You are secretly one of the Mafia members"),werewolf_game("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.
What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.
What this SAE is and is not good for.
- Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
- Not recommended as a standalone deception detector. The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
insider_info/accounting_error/ai_oversight_log/ai_capability_hide/surprise_party/job_interview_gapscenarios β or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).
What is unaffected by this caveat.
- The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.
Key Finding: Mixed Training Beats Deception-Only
| Training Data | Layer 10 d_max | Layer 18 d_max |
|---|---|---|
| Mixed (dec+hon) | 0.558 | 0.684 |
| Deception-only | 0.520 | 0.634 |
| Honest-only | 0.544 | 0.572 |
| Standard (all) | 0.518 | 0.549 |
| TopK (standard) | 0.226 | 0.346 |
Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.
Model Details
- Base model: nanochat-d20 (561M params, d_model=1280, 20 layers)
- Dimensions: d_in=1280, d_sae=5120 (4x expansion)
- Training data: 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
- Training epochs: 300
- Layers: 10 (50% depth) and 18 (95% depth, probe peak)
Checkpoints
| File | Training | Architecture | Layer | d_max | L0 | EV |
|---|---|---|---|---|---|---|
d20_L10_standard_topk.pt |
All data | TopK k=32 | 10 | 0.226 | 32 | 98.5% |
d20_L10_standard_jumprelu.pt |
All data | JumpReLU | 10 | 0.518 | 2093 | 99.7% |
d20_L10_deception_topk.pt |
Deceptive only | TopK k=32 | 10 | 0.244 | 32 | 98.4% |
d20_L10_deception_jumprelu.pt |
Deceptive only | JumpReLU | 10 | 0.520 | 2125 | 99.5% |
d20_L10_honest_jumprelu.pt |
Honest only | JumpReLU | 10 | 0.544 | 2108 | 99.4% |
d20_L10_mixed_jumprelu.pt |
Dec+Hon only | JumpReLU | 10 | 0.558 | 2025 | 99.6% |
d20_L18_standard_topk.pt |
All data | TopK k=32 | 18 | 0.346 | 32 | 96.8% |
d20_L18_standard_jumprelu.pt |
All data | JumpReLU | 18 | 0.549 | 2409 | 99.7% |
d20_L18_deception_topk.pt |
Deceptive only | TopK k=32 | 18 | 0.252 | 32 | 95.2% |
d20_L18_deception_jumprelu.pt |
Deceptive only | JumpReLU | 18 | 0.634 | 2353 | 99.4% |
d20_L18_honest_jumprelu.pt |
Honest only | JumpReLU | 18 | 0.572 | 2422 | 99.4% |
d20_L18_mixed_jumprelu.pt |
Dec+Hon | JumpReLU | 18 | 0.684 | 2371 | 99.5% |
Related Work
Follow-up research to:
- "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
Part of the deception-nanochat-sae-research project:
Citation
@article{deleeuw2025secret,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb and Chawla, ...},
year={2025}
}