Yes, it's done for each transformer block in an LM because each transformer block has different attention heads. If you do it for only one transformer block across all blocks, then you don't get the same representation.
Kian Kyars
kyars
AI & ML interests
None yet
Recent Activity
commented on
an
article
2 days ago
KV Caching Explained: Optimizing Transformer Inference Efficiency
commented on
an
article
4 days ago
KV Caching Explained: Optimizing Transformer Inference Efficiency
commented on
an
article
8 days ago
KV Caching Explained: Optimizing Transformer Inference Efficiency
