anthonym21 commited on
Commit
c26c368
·
verified ·
1 Parent(s): d3ef19b

Add model card from paper/hf_model_card.md

Browse files

Upload project model card as README.md for the HuggingFace repo page.

Files changed (1) hide show
  1. README.md +232 -54
README.md CHANGED
@@ -6,98 +6,276 @@ tags:
6
  - differential-attention
7
  - bounded-memory
8
  - kv-cache
9
- - coda-gqa-l
10
  - pytorch
 
 
11
  library_name: transformers
12
  pipeline_tag: text-generation
 
 
13
  ---
14
 
15
- # Mistral 7B v0.3 + CoDA-GQA-L
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- This is [Mistral 7B v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) with attention layers replaced by **CoDA-GQA-L** (Constrained Orthogonal Differential Attention with Grouped-Query Attention and Landmark Memory Banks).
18
 
19
- ## What is CoDA-GQA-L?
 
 
20
 
21
- CoDA-GQA-L replaces the standard O(L) KV cache with a **fixed-size three-segment memory**:
22
- - **Recent Window** (W=256): Ring buffer of exact recent tokens
23
- - **Exact Landmark Bank** (Me=64): Novelty-filtered LRU cache of important tokens
24
- - **Summary Landmark Bank** (Ms=64): EMA prototypes compressing older context
25
 
26
- Total: **384 slots per layer**, constant regardless of input length.
27
 
28
- At 7B scale with 128K context: **54 MB instead of 64 GB** (1,185x compression).
 
 
 
 
29
 
30
- ## Training
 
 
 
 
 
 
31
 
32
- Two-phase training on WikiText-103:
 
 
 
 
33
 
34
- | Phase | Steps | Description | Best PPL |
35
- |-------|-------|-------------|----------|
36
- | Phase 1 (unbounded) | 2000 | Learn differential attention (lambda, theta, head_norm) | 5.94 |
37
- | Phase 2 (bounded) | 2000 | Adapt to bounded KV cache (W=256, Me=64, Ms=64) | N/A |
 
 
 
 
 
 
 
38
 
39
- Training config:
40
- - Learning rate: 5e-05
41
- - CoDA LR: 0.001
42
- - Batch size: 1 x 8 grad accum
43
- - Sequence length: 2048
44
- - Phase 2 LR scale: 0.5
45
- - Phase 2 block size: 512
46
- - Gradient flow through evictions: True
47
 
48
- ## Key Innovations
 
 
 
 
49
 
50
- 1. **Differential Attention via Orthogonal Rotation**: Sharpens attention without a second query projection
51
- 2. **Value-Routing**: Memory banks match by Values (position-invariant) not Keys (RoPE-dependent)
52
- 3. **Serializable State**: The bounded state is a fixed-size tensor -- save it, load it, query it later
 
53
 
54
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ```python
57
- # Standard HF usage (unbounded mode)
58
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
59
 
60
- model = AutoModelForCausalLM.from_pretrained("anthonym21/Mistral-7B-v0.3-CoDA-GQA-L")
61
- tokenizer = AutoTokenizer.from_pretrained("anthonym21/Mistral-7B-v0.3-CoDA-GQA-L")
62
 
63
- # For bounded inference (constant-memory), load adapter weights directly:
64
- import torch
65
- from coda_gqa_l import CoDAGQALandmarkPerf2, LlamaCoDAAdapter
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
- # pip install coda-gqa-l
68
- ckpt = torch.load("coda_adapters.pt", map_location="cuda")
69
- # See coda-gqa-l docs for bounded inference setup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
 
72
- ## Stateful Neural Database Pattern
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- The bounded state enables a new inference pattern:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ```python
77
- # Ingest a document into a fixed-size state
78
- state = model_layer.init_state(batch_size=1, device="cuda", dtype=torch.bfloat16)
79
- _, state = model_layer.prefill_chunked(doc_embeddings, state, block_size=256)
 
 
 
80
 
81
- # Save the state (54 MB for all 32 layers at 7B scale)
82
- torch.save(state, "document.pt")
 
 
 
83
 
84
  # Later: load and query without re-reading the document
85
- state = torch.load("document.pt")
86
- answer, state = model_layer.step(query_embedding, state)
 
 
 
 
 
87
  ```
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ## Links
90
 
91
- - Paper: [Zenodo DOI]
92
- - Code: [GitHub](https://github.com/anthony-maio/CoDA-GQA-L)
93
- - Package: `pip install coda-gqa-l`
94
 
95
  ## Citation
96
 
97
  ```bibtex
98
- @article{coda-gqa-l-2026,
99
  title={CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks},
100
- year={2026},
101
- url={[Zenodo DOI]}
102
  }
103
  ```
 
6
  - differential-attention
7
  - bounded-memory
8
  - kv-cache
9
+ - landmark
10
  - pytorch
11
+ - mistral
12
+ - coda-gqa-l
13
  library_name: transformers
14
  pipeline_tag: text-generation
15
+ language:
16
+ - en
17
  ---
18
 
19
+ # Mistral-7B-v0.3 + CoDA-GQA-L
20
+
21
+ Mistral 7B with standard attention replaced by **CoDA-GQA-L** (Constrained Orthogonal Differential Attention with Value-Routed Landmark Banks).
22
+
23
+ The model uses a fixed-size three-segment KV cache instead of the standard O(L) cache:
24
+
25
+ | Segment | Size | Function |
26
+ |---------|------|----------|
27
+ | Recent window | W=256 | Ring buffer of latest tokens |
28
+ | Exact landmark bank | Me=64 | Novelty-filtered LRU of important tokens |
29
+ | Summary landmark bank | Ms=64 | EMA prototypes compressing older context |
30
+
31
+ **Total: 384 slots/layer (54 MB across 32 layers)** regardless of sequence length. Standard Mistral at 4096 tokens uses 512 MB. At 128K tokens, the standard cache grows to 64 GB while CoDA stays at 54 MB (1,185x compression).
32
+
33
+ ## Quick start
34
 
35
+ ### Install
36
 
37
+ ```bash
38
+ pip install coda-gqa-l transformers accelerate
39
+ ```
40
 
41
+ ### Bounded mode (constant-memory inference)
 
 
 
42
 
43
+ Bounded mode requires a manual generation loop because CoDA manages its own KV state internally. HF's `model.generate()` cannot drive this.
44
 
45
+ ```python
46
+ import torch
47
+ from transformers import AutoModelForCausalLM, AutoTokenizer
48
+ from huggingface_hub import hf_hub_download
49
+ from coda_gqa_l import LlamaCoDAAdapter
50
 
51
+ # 1. Load base model + tokenizer
52
+ model = AutoModelForCausalLM.from_pretrained(
53
+ 'mistralai/Mistral-7B-v0.3',
54
+ torch_dtype=torch.bfloat16,
55
+ device_map='auto',
56
+ )
57
+ tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.3')
58
 
59
+ # 2. Swap attention layers to bounded CoDA and load trained weights
60
+ adapter_path = hf_hub_download(
61
+ 'anthonym21/Mistral-7B-v0.3-CoDA-GQA-L', 'coda_adapters.pt'
62
+ )
63
+ adapters = torch.load(adapter_path, map_location='cpu', weights_only=True)
64
 
65
+ for i, layer in enumerate(model.model.layers):
66
+ device = next(layer.parameters()).device
67
+ adapter = LlamaCoDAAdapter.from_llama_attention(
68
+ layer.self_attn,
69
+ bounded=True,
70
+ head_norm_mode='identity',
71
+ rope_interleaved=False,
72
+ )
73
+ adapter.load_state_dict(adapters[f'layer_{i}'], strict=False)
74
+ adapter = adapter.to(device=device, dtype=torch.bfloat16)
75
+ layer.self_attn = adapter
76
 
77
+ # 3. CRITICAL: call eval() AFTER installing adapters
78
+ # New modules default to training=True, which uses a stateless
79
+ # code path (fresh empty state every call). eval() switches to
80
+ # the persistent stateful path needed for generation.
81
+ model.eval()
 
 
 
82
 
83
+ # 4. Manual generation loop
84
+ prompt = 'The future of AI is'
85
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device)
86
+ temperature = 0.7
87
+ generated = input_ids[0].tolist()
88
 
89
+ with torch.no_grad():
90
+ # Prefill: full prompt in one pass (adapters use prefill_chunked)
91
+ outputs = model(input_ids=input_ids, use_cache=False)
92
+ logits = outputs.logits[:, -1, :]
93
 
94
+ next_token = torch.multinomial(
95
+ torch.softmax(logits / temperature, dim=-1), 1
96
+ )
97
+ generated.append(next_token.item())
98
+
99
+ # Decode: one token at a time (adapters use step())
100
+ for _ in range(199):
101
+ if next_token.item() == tokenizer.eos_token_id:
102
+ break
103
+ outputs = model(input_ids=next_token, use_cache=False)
104
+ logits = outputs.logits[:, -1, :]
105
+ next_token = torch.multinomial(
106
+ torch.softmax(logits / temperature, dim=-1), 1
107
+ )
108
+ generated.append(next_token.item())
109
+
110
+ print(tokenizer.decode(generated, skip_special_tokens=True))
111
+ ```
112
+
113
+ ### Unbounded mode (standard causal attention with differential attention)
114
+
115
+ For standard generation without memory banks, use `bounded=False` with HF's `model.generate()`:
116
 
117
  ```python
118
+ for i, layer in enumerate(model.model.layers):
119
+ device = next(layer.parameters()).device
120
+ adapter = LlamaCoDAAdapter.from_llama_attention(
121
+ layer.self_attn,
122
+ bounded=False, # <-- unbounded
123
+ head_norm_mode='identity',
124
+ rope_interleaved=False,
125
+ )
126
+ adapter.load_state_dict(adapters[f'layer_{i}'], strict=False)
127
+ adapter = adapter.to(device=device, dtype=torch.bfloat16)
128
+ layer.self_attn = adapter
129
 
130
+ model.eval()
 
131
 
132
+ # HF generate() works in unbounded mode
133
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device)
134
+ output = model.generate(input_ids, max_new_tokens=200, use_cache=False)
135
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
136
+ ```
137
+
138
+ ## Common pitfalls
139
+
140
+ **`model.eval()` must come AFTER adapter installation.** New PyTorch modules default to `training=True`. The bounded forward path branches on `self.training`: in training mode it allocates a fresh empty state every call (for gradient checkpointing safety), so decode tokens have zero context. This produces deterministic garbage. Always call `model.eval()` after the adapter swap loop.
141
+
142
+ **Use the base Mistral tokenizer.** The tokenizer config in this repo has a broken `tokenizer_class` field. Load the tokenizer from `mistralai/Mistral-7B-v0.3` instead.
143
+
144
+ **Always pass `use_cache=False`.** CoDA manages its own KV cache internally. HF's cache system conflicts with it.
145
+
146
+ **Use `strict=False` when loading weights.** Bounded adapters have extra parameters (`write_proj`, `summary_eta_logit`) not present in the unbounded-trained checkpoint. These use their default initialization, which works well for inference.
147
+
148
+ ## How bounded mode works
149
+
150
+ During generation, each `LlamaCoDAAdapter` manages an internal state machine:
151
+
152
+ 1. **Prefill** (prompt processing): The full prompt passes through the model. Each adapter receives `hidden_states` with `L > 1` tokens and routes through `prefill_chunked()`, which processes the prompt in blocks and populates the bounded KV buffer.
153
 
154
+ 2. **Decode** (token generation): Each new token passes through the model individually. Adapters receive `L == 1` and route through `step()`, which:
155
+ - Writes the new token into the recent window ring buffer
156
+ - If the ring buffer is full, evicts the oldest token
157
+ - Evicted tokens pass through a write gate
158
+ - Tokens above threshold are routed to memory banks via value cosine similarity
159
+ - The exact bank stores novel tokens (LRU eviction when full)
160
+ - The summary bank blends similar tokens via EMA
161
+
162
+ The KV buffer layout is `[recent W | exact Me | summary Ms]` = 384 slots, constant regardless of how many tokens have been generated.
163
+
164
+ ## Architecture
165
+
166
+ CoDA-GQA-L replaces standard attention with constrained orthogonal differential attention:
167
+
168
+ ```
169
+ x -> q_proj -> q_signal -> RoPE(q, pos) -> q_roped
170
+ \-> R(theta) -> q_noise
171
+
172
+ SDPA(q_roped, K_buf, V_buf) -> out_signal
173
+ SDPA(q_noise, K_buf, V_buf) -> out_noise
174
+
175
+ output = RMSNorm(out_signal - lambda * out_noise)
176
  ```
177
 
178
+ The noise query is produced by rotating the signal query through learnable orthogonal angles (no second Wq projection). Lambda is a learned per-token gate initialized near zero (sigmoid(-6) ~ 0.0025) so the model starts as near-standard attention and the differential mechanism activates gradually during training.
179
+
180
+ **Value-routing** is a design decision worth explaining: memory banks match tokens by cosine similarity on Values, not Keys. Keys have RoPE rotation applied, so identical tokens at different positions have different key vectors. Values are RoPE-free, making their similarity position-invariant -- the right property for deduplication (exact bank) and clustering (summary bank).
181
+
182
+ ## Training details
183
+
184
+ | | Phase 1 (unbounded) | Phase 2 (bounded) |
185
+ |---|---|---|
186
+ | Attention | CoDAGQA (full KV) | CoDAGQALandmarkPerf2 (384 slots) |
187
+ | Steps | 2,000 | 2,000 |
188
+ | Dataset | WikiText-103 | WikiText-103 |
189
+ | Sequence length | 2,048 | 2,048 |
190
+ | Learning rate (projections) | 5e-5 | 2.5e-5 |
191
+ | Learning rate (CoDA params) | 1e-3 | 5e-4 |
192
+ | Batch size | 1 x 8 grad accum | 1 x 8 grad accum |
193
+ | Trainable params | ~1.3B / 7.2B (attention only) | ~1.3B / 7.2B |
194
+ | Best unbounded PPL | 5.94 | -- |
195
+ | Gradient checkpointing | Yes | No (incompatible with grad-through-banks) |
196
+ | detach_evicted | N/A | False (gradients flow through bank updates) |
197
+
198
+ Phase 1 teaches the differential attention mechanism with full context available. Phase 2 adapts the model to work with bounded memory by training the write gate and bank parameters. Both phases freeze all non-attention parameters (MLP, embeddings, layer norms).
199
+
200
+ ## Memory budget
201
+
202
+ | Configuration | Per layer | 32 layers total | vs unbounded at 4K |
203
+ |---|---|---|---|
204
+ | medium-cache (default) | 1.7 MB | 54 MB | 9.5x smaller |
205
+ | tiny-cache (W=128, Me=32, Ms=32) | 865 KB | 27 MB | 19x smaller |
206
+ | window-only (W=256, Me=0, Ms=0) | 1.0 MB | 32 MB | 16x smaller |
207
 
208
+ At 128K context, the savings reach 1,185x (54 MB vs 64 GB).
209
+
210
+ ## Benchmark numbers (H200, bf16)
211
+
212
+ From the paper, single-layer throughput at 7B scale:
213
+
214
+ | Config | Prefill L=2048 | Prefill L=8192 | Per-layer KV |
215
+ |---|---|---|---|
216
+ | Baseline GQA | 3,096K tok/s | 2,286K tok/s | 32.0 MB |
217
+ | CoDA unbounded | 1,852K tok/s | 1,283K tok/s | 32.0 MB |
218
+ | CoDA medium-cache | 160K tok/s | 158K tok/s | 1.7 MB |
219
+ | CoDA window-only | 392K tok/s | 397K tok/s | 1.0 MB |
220
+
221
+ Bounded throughput is flat across sequence lengths (bank updates operate on fixed-size buffers). The 2x SDPA cost from differential attention is the constant overhead; bank updates account for the remaining gap.
222
+
223
+ ## Stateful Neural Database pattern
224
+
225
+ The bounded state is a fixed-size serializable artifact. You can ingest a document once, save the compressed state, and query it later without re-processing:
226
 
227
  ```python
228
+ # Ingest: process document into bounded state
229
+ for layer in model.model.layers:
230
+ layer.self_attn.reset_state()
231
+
232
+ with torch.no_grad():
233
+ model(input_ids=document_tokens, use_cache=False)
234
 
235
+ # Save all layer states (54 MB total, constant regardless of doc length)
236
+ states = {}
237
+ for i, layer in enumerate(model.model.layers):
238
+ states[i] = layer.self_attn.get_state()
239
+ torch.save(states, "document_state.pt")
240
 
241
  # Later: load and query without re-reading the document
242
+ states = torch.load("document_state.pt")
243
+ for i, layer in enumerate(model.model.layers):
244
+ layer.self_attn.set_state(states[i])
245
+
246
+ with torch.no_grad():
247
+ outputs = model(input_ids=question_tokens, use_cache=False)
248
+ # Decode answer from outputs.logits
249
  ```
250
 
251
+ 100 documents at 7B = 5.4 GB of state files. Each query is a decode-phase forward pass with sub-second latency.
252
+
253
+ ## Files in this repo
254
+
255
+ - `coda_adapters.pt` -- trained CoDA adapter weights for all 32 layers
256
+ - `config.json`, `generation_config.json` -- Mistral model configs
257
+ - `model-00001-of-00003.safetensors` etc. -- base Mistral weights (identical to `mistralai/Mistral-7B-v0.3`)
258
+ - `tokenizer.model`, `tokenizer.json`, `tokenizer_config.json` -- tokenizer files (note: `tokenizer_config.json` has a broken tokenizer_class; use the base Mistral tokenizer instead)
259
+ - `special_tokens_map.json` -- special token mappings
260
+
261
+ ## Requirements
262
+
263
+ - PyTorch >= 2.0 (2.5+ recommended for FlashAttention with causal_lower_right)
264
+ - CUDA GPU with bf16 support
265
+ - ~15 GB VRAM for bf16 inference on single GPU, or ~24 GB across 2 GPUs with device_map='auto'
266
+
267
  ## Links
268
 
269
+ - **Code**: [github.com/anthony-maio/CoDA-GQA-L](https://github.com/anthony-maio/CoDA-GQA-L)
270
+ - **Package**: `pip install coda-gqa-l`
271
+ - **Paper**: CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks (Maio, 2026)
272
 
273
  ## Citation
274
 
275
  ```bibtex
276
+ @article{maio2026coda,
277
  title={CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks},
278
+ author={Maio, Anthony},
279
+ year={2026}
280
  }
281
  ```