WASM Interpreter Transformer
A hand-compiled transformer that executes WebAssembly bytecode via real forward passes. Every weight was set by a compiler — not by gradient descent. No training data, no loss function, no optimizer. Just linear algebra.
What This Is
This is a complete WASM bytecode interpreter implemented as a transformer neural network. Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time — using real matrix multiplications, real attention, and real feed-forward network computations.
The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating. The attention heads retrieve operands and stack values through quadratic key matching. Stack depth is computed internally by a cumulative-sum attention head — no precomputed depth values are needed. The unembedding converts numeric results to token predictions via a quadratic scoring trick.
50/50 test programs pass with 100% accuracy.
Architecture
| Parameter | Value |
|---|---|
d_model |
86 |
n_heads |
12 |
d_head |
2 |
n_layers |
7 |
d_ffn |
86 |
vocab_size |
260 (256 byte tokens + 4 special) |
| FFN activation | SwiGLU (ReLU gate) |
| Attention | Hard-max (argmax) + sum-mode for depth tracking |
| Total parameters | ~600K (all hand-compiled) |
How It Works
The 7-Layer Pipeline
- Layer 0: Opcode fetch — 10 paired attention heads identify 19 opcodes by matching one-hot flags, plus 1 head for operand retrieval
- Layer 1: Stack depth accumulation — 1 sum-attention head computes cumulative stack depth as a running sum of push/pop deltas. This replaces precomputed depth values in the positional encoding.
- Layer 2: Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
- Layer 3: Bit retrieval — 8 hard-max heads extract individual bits from historical stack values for AND/OR operations
- Layer 4: Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all 17 operations (52 neurons)
- Layer 5: Local variables — 1 head finds the matching
local.tee/local.set, FFN gates the retrieved value - Layer 6: Linear memory — 1 head finds the matching
i32.storeby address, FFN gates the retrieved value
Sum-Attention (Cumulative Sums)
Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the sum-mode head accumulates all past value vectors.
Each instruction's value encodes its stack delta: +1 for pushes (e.g., i32.const), -1 for pops (e.g., i32.add). The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.
This is the same "cumulative sums" primitive described in the article: a single attention head can track running totals by summing rather than selecting.
SwiGLU Gating
Each neuron has two weight vectors:
- Gate: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
- Value: Reads the computation inputs (stack top, stack second, operand, bits).
output = max(0, gate · x) × (value · x)
Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.
Key Tricks
- Multiplication via gating: For
i32.mul, the gate equals one operand while the value holds the other.max(0, TOP) × SECOND = TOP × SECOND. - Comparisons via ReLU pairs: Two neurons with gates
(a-b)and(a-b-1)create a step function that detectsa > b. - Quadratic unembedding:
logit(t) = 2t·R - t²is a downward parabola peaking att = RESULT. - Quadratic key trick:
K = (2j, -j²),Q = (i, 1)→ dot product peaks atj = ifor exact position matching. - Sum-attention for depth: Instead of precomputing stack depth in PE, one head sums all past stack deltas. The result feeds into later layers' quadratic keys.
Supported WASM Operations
i32.const, i32.add, i32.sub, i32.mul, i32.eq, i32.ne, i32.lt_s, i32.gt_s,
i32.le_s, i32.ge_s, i32.and, i32.or, i32.load, i32.store,
local.get, local.set, local.tee, output, halt
Positional Encoding Note
This model uses program-specific positional encodings computed by a compile-time analysis pass, but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is not part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.
The remaining PE contains: instruction indices, local variable source locations, and memory address mappings — structural metadata that a trained model would learn. No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.
How to Use
This model uses a custom architecture. To run it, use the reference implementation:
# Load with safetensors
from safetensors.torch import load_file
import json
weights = load_file("model.safetensors")
config = json.load(open("config.json"))
# The model requires a custom forward pass implementation
# with both hard-max and sum-mode attention.
# See config.json sum_attention_heads for which heads use sum mode.
# See the reference TypeScript implementation for the complete specification.
A complete TypeScript reference implementation is available in the source repository.
Live Demos
- Interactive WASM REPL — type WASM instructions line-by-line and watch the transformer execute them in real time: WASM Transformer REPL
- Interactive Article Explorer — explore the concepts behind this model: Can LLMs Be Computers?
- FFN Interpreter Slide Deck — 15-slide visual explanation of how the FFN interprets bytecode: Slide Deck
Inspiration
This model is inspired by "Can LLMs Be Computers?" by Percepta AI.
License
MIT
- Downloads last month
- 45