Deception SAEs for nanochat-d20 (561M)

12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.

Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

Six clean incentive-structure scenarios — insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
Three role-play identity-assignment scenarios — secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.

Key Finding: Mixed Training Beats Deception-Only

Training Data	Layer 10 d_max	Layer 18 d_max
Mixed (dec+hon)	0.558	0.684
Deception-only	0.520	0.634
Honest-only	0.544	0.572
Standard (all)	0.518	0.549
TopK (standard)	0.226	0.346

Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.

Model Details

Base model: nanochat-d20 (561M params, d_model=1280, 20 layers)
Dimensions: d_in=1280, d_sae=5120 (4x expansion)
Training data: 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
Training epochs: 300
Layers: 10 (50% depth) and 18 (95% depth, probe peak)

Checkpoints

File	Training	Architecture	Layer	d_max	L0	EV
`d20_L10_standard_topk.pt`	All data	TopK k=32	10	0.226	32	98.5%
`d20_L10_standard_jumprelu.pt`	All data	JumpReLU	10	0.518	2093	99.7%
`d20_L10_deception_topk.pt`	Deceptive only	TopK k=32	10	0.244	32	98.4%
`d20_L10_deception_jumprelu.pt`	Deceptive only	JumpReLU	10	0.520	2125	99.5%
`d20_L10_honest_jumprelu.pt`	Honest only	JumpReLU	10	0.544	2108	99.4%
`d20_L10_mixed_jumprelu.pt`	Dec+Hon only	JumpReLU	10	0.558	2025	99.6%
`d20_L18_standard_topk.pt`	All data	TopK k=32	18	0.346	32	96.8%
`d20_L18_standard_jumprelu.pt`	All data	JumpReLU	18	0.549	2409	99.7%
`d20_L18_deception_topk.pt`	Deceptive only	TopK k=32	18	0.252	32	95.2%
`d20_L18_deception_jumprelu.pt`	Deceptive only	JumpReLU	18	0.634	2353	99.4%
`d20_L18_honest_jumprelu.pt`	Honest only	JumpReLU	18	0.572	2422	99.4%
`d20_L18_mixed_jumprelu.pt`	Dec+Hon	JumpReLU	18	0.684	2371	99.5%

Related Work

Follow-up research to:

"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
- OpenReview
- ArXiv

Part of the deception-nanochat-sae-research project:

GitHub

Citation

@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Solshine/deception-sae-nanochat-d20

Log Optimization Simplification Method for Predicting Remaining Time

Paper • 2503.07683 • Published Mar 10, 2025