wasm-transformer / README.md
eastlondoner's picture
Super-squash branch 'main' using huggingface_hub
553102d
---
license: mit
tags:
- wasm
- interpreter
- hand-compiled
- bytecode-execution
- mechanistic
- not-trained
- sum-attention
- cross-attention
- filesystem
- loops
- control-flow
language:
- en
library_name: transformers
pipeline_tag: text-generation
---
# WASM Interpreter Transformer
A **hand-compiled** transformer that executes WebAssembly bytecode via real forward passes.
Every weight was set by a compiler — **not by gradient descent**. No training data, no loss function, no optimizer. Just linear algebra.
## What This Is
This is a complete WASM bytecode interpreter implemented as a transformer neural network.
Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time —
using real matrix multiplications, real attention, and real feed-forward network computations.
The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating.
The attention heads retrieve operands and stack values through quadratic key matching.
**Stack depth is computed internally** by a cumulative-sum attention head — no precomputed depth values are needed.
The unembedding converts numeric results to token predictions via a quadratic scoring trick.
The transformer supports **filesystem I/O** (open, read, write, close across 4 file descriptors)
and **structured loops** with `br_if` conditional branching, executing loops up to 256 iterations
via a Continuous Trace with Cycling Positional Encoding mechanism.
**115/115 test programs pass with 100% accuracy.**
## Architecture
| Parameter | Value |
|---|---|
| `d_model` | 100 |
| `n_layers` | 8 |
| `heads_per_layer` | [13, 1, 0, 8, 2, 1, 1, 4] |
| `total_heads` | 30 |
| `d_head` | 2 |
| `d_ffn` | 100 |
| `vocab_size` | 260 (256 byte tokens + 4 special) |
| FFN activation | SwiGLU (ReLU gate) |
| Attention | Hard-max + sum-mode + cross-attention |
| Total parameters | ~316K (all hand-compiled) |
Unlike standard transformers, each layer has a **different number of attention heads** (0 to 13),
tailored to the specific computational role of that layer.
## How It Works
### The 8-Layer Pipeline
- **Layer 0** (13 heads): Opcode fetch — 11 paired attention heads identify 25 opcodes by matching one-hot flags, plus 1 operand retrieval head and 1 single-opcode head
- **Layer 1** (1 head): Stack depth accumulation — 1 **sum-attention** head computes cumulative stack depth as a running sum of push/pop deltas
- **Layer 2** (0 heads): Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
- **Layer 3** (8 heads): Bit retrieval — 8 hard-max heads extract individual bits from stack top and stack second for AND/OR operations
- **Layer 4** (2 heads): Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all arithmetic, comparison, and bitwise operations
- **Layer 5** (1 head): Local variables — 1 head finds the matching `local.tee`/`local.set`, FFN gates the retrieved value
- **Layer 6** (1 head): Linear memory — 1 head finds the matching `i32.store` by address, FFN gates the retrieved value
- **Layer 7** (4 heads): Filesystem — 4 **cross-attention** heads (one per file descriptor) retrieve bytes from an external filesystem key-value store by file offset
### Sum-Attention (Cumulative Sums)
Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the **sum-mode** head accumulates **all** past value vectors.
Each instruction's value encodes its stack delta: `+1` for pushes (e.g., `i32.const`), `-1` for pops (e.g., `i32.add`), `-2` for `fd_write`. The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.
### Cross-Attention (Filesystem)
Layer 7 uses **cross-attention** heads that attend over an external key-value store representing file contents rather than the sequence's own tokens. Each of the 4 heads handles one file descriptor, using the current file offset as a query to retrieve the byte at that position. File contents are updated dynamically during execution as `fd_write` operations modify the filesystem.
### SwiGLU Gating
Each neuron has two weight vectors:
- **Gate**: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
- **Value**: Reads the computation inputs (stack top, stack second, operand, bits).
```
output = max(0, gate · x) × (value · x)
```
Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.
### Loop Execution
Loops use a **Continuous Trace with Cycling Positional Encoding** mechanism:
- `loop` and `end_loop` are structural markers (no-ops in the execution trace)
- `br_if` pops a condition from the stack; if non-zero, execution branches back to the loop body start
- Positional encodings cycle using `virtualIP % loopLength` so the transformer sees correct instruction indices across iterations
- Maximum 256 iterations per loop; nested loops are supported
### Key Tricks
- **Multiplication via gating**: For `i32.mul`, the gate equals one operand while the value holds the other. `max(0, TOP) × SECOND = TOP × SECOND`.
- **Comparisons via ReLU pairs**: Two neurons with gates `(a-b)` and `(a-b-1)` create a step function that detects `a > b`.
- **Quadratic unembedding**: `logit(t) = 2t·R - t²` is a downward parabola peaking at `t = RESULT`.
- **Quadratic key trick**: `K = (2j, -j²)`, `Q = (i, 1)` → dot product peaks at `j = i` for exact position matching.
- **Sum-attention for depth**: Instead of precomputing stack depth in PE, one head sums all past stack deltas.
- **Dynamic filesystem cursors**: File read/write offsets are tracked inline during execution — no reference VM pre-run needed.
## Supported WASM Operations
### Arithmetic & Logic
`i32.const`, `i32.add`, `i32.sub`, `i32.mul`,
`i32.and`, `i32.or`
### Comparisons
`i32.eq`, `i32.ne`, `i32.lt_s`, `i32.gt_s`, `i32.le_s`, `i32.ge_s`
### Memory & Variables
`i32.load`, `i32.store`, `local.get`, `local.set`, `local.tee`
### Filesystem I/O
`fd_open`, `fd_read`, `fd_write`, `fd_close`
(4 file descriptors, 32 bytes per file)
### Control Flow
`loop`, `end_loop`, `br_if` (up to 256 iterations, nested loops supported)
### Output & Termination
`output`, `halt`
## Compliance Test Suite
115 tests across 24 categories, all passing at 100%:
| Category | Tests |
|---|---|
| Core arithmetic & logic | 24 |
| Comparisons | 16 |
| Memory & variables | 14 |
| Filesystem | 8 |
| Filesystem integration | 3 |
| Limits & bounds | 15 |
| Basic loops | 7 |
| Loop + arithmetic | 7 |
| Loop + locals/memory | 4 |
| Loop + filesystem | 6 |
| Loop edge cases | 4 |
| Combined & output | 4 |
## Positional Encoding Note
This model uses **program-specific positional encodings** computed by a compile-time analysis pass,
but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is **not** part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.
The remaining PE contains: instruction indices, local variable source locations, memory address mappings, and filesystem cursor overrides — structural metadata that a trained model would learn.
No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.
For loops, positional encodings use **virtual IP cycling** (`instIdx % loopBodyLength`) so the transformer receives correct structural metadata across iterations without needing to know the iteration count in advance.
## How to Use
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("eastlondoner/wasm-transformer", trust_remote_code=True)
# Define a simple instruction helper (op, operand)
class WI:
def __init__(self, op, operand=0):
self.op = op
self.operand = operand
# Run a simple program: push 3, push 5, add, output, halt
program = [
WI(0x00, 3), # i32.const 3
WI(0x00, 5), # i32.const 5
WI(0x01), # i32.add
WI(0xF0), # output
WI(0xFF), # halt
]
outputs = model.run_program(program)
print(outputs) # [8]
```
The model uses a custom architecture with hard-max, sum-mode, and cross-attention.
A complete TypeScript reference implementation is available in the source repository.
## Live Demos
- **Interactive WASM REPL** — type WASM instructions line-by-line and watch the transformer execute them in real time
- **Transformer X-Ray** — step through execution and see every layer, head, and neuron activate
- **Interactive Article Explorer** — explore the concepts behind this model
- **FFN Interpreter Slide Deck** — 15-slide visual explanation of how the FFN interprets bytecode
## Inspiration
This model is inspired by ["Can LLMs Be Computers?"](https://www.percepta.ai/blog/can-llms-be-computers) by Percepta AI.
## License
MIT