lens-gemma-3-4b-first-light-v1

Concept detection lenses for google/gemma-3-4b-pt trained on the first-light concept pack.

Overview

Metric Value
Base Model google/gemma-3-4b-pt
Concept Pack first-light
Total Lenses 7,947
Layers 0-6
Dtype bfloat16
Pack Size ~5.1 GB

What are Lenses?

Lenses are small neural classifiers trained to detect specific concepts in a language model's hidden states. Each lens monitors activations at a specific layer and outputs a probability score indicating whether that concept is active.

This pack contains lenses for ~8,000 concepts from the SUMO ontology mapped to WordNet, covering:

  • Physical objects and entities
  • Actions and processes
  • Abstract concepts
  • AI safety-relevant concepts (deception, manipulation, sycophancy, etc.)

Usage

from huggingface_hub import snapshot_download

# Download the lens pack
lens_pack_path = snapshot_download("HatCatFTW/lens-gemma-3-4b-first-light-v1")

# Load a specific lens
import torch
deception_lens = torch.load(f"{lens_pack_path}/layer4/Deception.pt")

For full integration with the HatCat monitoring system:

from src.map.registry import PackRegistry

registry = PackRegistry()
registry.pull_lens_pack("gemma-3-4b-first-light-v1")

Training Details

  • Validation Mode: Falloff (tests generalization across model layers)
  • Training Samples: 50 positive / 50 negative per concept
  • Test Samples: 20 positive / 20 negative per concept
  • Architecture: 3-layer MLP (input โ†’ 128 โ†’ 64 โ†’ 1)

Related Resources

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support