WASM Interpreter Transformer

A hand-compiled transformer that executes WebAssembly bytecode via real forward passes. Every weight was set by a compiler — not by gradient descent. No training data, no loss function, no optimizer. Just linear algebra.

What This Is

This is a complete WASM bytecode interpreter implemented as a transformer neural network. Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time — using real matrix multiplications, real attention, and real feed-forward network computations.

The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating. The attention heads retrieve operands and stack values through quadratic key matching. Stack depth is computed internally by a cumulative-sum attention head — no precomputed depth values are needed. The unembedding converts numeric results to token predictions via a quadratic scoring trick.

50/50 test programs pass with 100% accuracy.

Architecture

Parameter	Value
`d_model`	86
`n_heads`	12
`d_head`	2
`n_layers`	7
`d_ffn`	86
`vocab_size`	260 (256 byte tokens + 4 special)
FFN activation	SwiGLU (ReLU gate)
Attention	Hard-max (argmax) + sum-mode for depth tracking
Total parameters	~600K (all hand-compiled)

How It Works

The 7-Layer Pipeline

Layer 0: Opcode fetch — 10 paired attention heads identify 19 opcodes by matching one-hot flags, plus 1 head for operand retrieval
Layer 1: Stack depth accumulation — 1 sum-attention head computes cumulative stack depth as a running sum of push/pop deltas. This replaces precomputed depth values in the positional encoding.
Layer 2: Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
Layer 3: Bit retrieval — 8 hard-max heads extract individual bits from historical stack values for AND/OR operations
Layer 4: Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all 17 operations (52 neurons)
Layer 5: Local variables — 1 head finds the matching local.tee/local.set, FFN gates the retrieved value
Layer 6: Linear memory — 1 head finds the matching i32.store by address, FFN gates the retrieved value

Sum-Attention (Cumulative Sums)

Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the sum-mode head accumulates all past value vectors.

Each instruction's value encodes its stack delta: +1 for pushes (e.g., i32.const), -1 for pops (e.g., i32.add). The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.

This is the same "cumulative sums" primitive described in the article: a single attention head can track running totals by summing rather than selecting.

SwiGLU Gating

Each neuron has two weight vectors:

Gate: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
Value: Reads the computation inputs (stack top, stack second, operand, bits).

output = max(0, gate · x) × (value · x)

Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.

Key Tricks

Multiplication via gating: For i32.mul, the gate equals one operand while the value holds the other. max(0, TOP) × SECOND = TOP × SECOND.
Comparisons via ReLU pairs: Two neurons with gates (a-b) and (a-b-1) create a step function that detects a > b.
Quadratic unembedding: logit(t) = 2t·R - t² is a downward parabola peaking at t = RESULT.
Quadratic key trick: K = (2j, -j²), Q = (i, 1) → dot product peaks at j = i for exact position matching.
Sum-attention for depth: Instead of precomputing stack depth in PE, one head sums all past stack deltas. The result feeds into later layers' quadratic keys.

Supported WASM Operations

i32.const, i32.add, i32.sub, i32.mul, i32.eq, i32.ne, i32.lt_s, i32.gt_s, i32.le_s, i32.ge_s, i32.and, i32.or, i32.load, i32.store, local.get, local.set, local.tee, output, halt

Positional Encoding Note

This model uses program-specific positional encodings computed by a compile-time analysis pass, but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is not part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.

The remaining PE contains: instruction indices, local variable source locations, and memory address mappings — structural metadata that a trained model would learn. No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.

How to Use

This model uses a custom architecture. To run it, use the reference implementation:

# Load with safetensors
from safetensors.torch import load_file
import json

weights = load_file("model.safetensors")
config = json.load(open("config.json"))

# The model requires a custom forward pass implementation
# with both hard-max and sum-mode attention.
# See config.json sum_attention_heads for which heads use sum mode.
# See the reference TypeScript implementation for the complete specification.

A complete TypeScript reference implementation is available in the source repository.

Live Demos

Interactive WASM REPL — type WASM instructions line-by-line and watch the transformer execute them in real time: WASM Transformer REPL
Interactive Article Explorer — explore the concepts behind this model: Can LLMs Be Computers?
FFN Interpreter Slide Deck — 15-slide visual explanation of how the FFN interprets bytecode: Slide Deck

Inspiration

This model is inspired by "Can LLMs Be Computers?" by Percepta AI.

License

MIT

Downloads last month: 45

Safetensors

Model size

258k params

Tensor type

F32