Instructions to use eastlondoner/wasm-transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eastlondoner/wasm-transformer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="eastlondoner/wasm-transformer", trust_remote_code=True)

# Load model directly
from transformers import WasmInterpreterTransformer
model = WasmInterpreterTransformer.from_pretrained("eastlondoner/wasm-transformer", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use eastlondoner/wasm-transformer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "eastlondoner/wasm-transformer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eastlondoner/wasm-transformer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/eastlondoner/wasm-transformer

SGLang

How to use eastlondoner/wasm-transformer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "eastlondoner/wasm-transformer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eastlondoner/wasm-transformer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "eastlondoner/wasm-transformer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eastlondoner/wasm-transformer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use eastlondoner/wasm-transformer with Docker Model Runner:
```
docker model run hf.co/eastlondoner/wasm-transformer
```

wasm-transformer / README.md

eastlondoner

Super-squash branch 'main' using huggingface_hub

553102d about 2 months ago

preview code

raw

history blame contribute delete

9.01 kB

	---
	license: mit
	tags:
	- wasm
	- interpreter
	- hand-compiled
	- bytecode-execution
	- mechanistic
	- not-trained
	- sum-attention
	- cross-attention
	- filesystem
	- loops
	- control-flow
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	---

	# WASM Interpreter Transformer

	A hand-compiled transformer that executes WebAssembly bytecode via real forward passes.
	Every weight was set by a compiler — not by gradient descent. No training data, no loss function, no optimizer. Just linear algebra.

	## What This Is

	This is a complete WASM bytecode interpreter implemented as a transformer neural network.
	Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time —
	using real matrix multiplications, real attention, and real feed-forward network computations.

	The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating.
	The attention heads retrieve operands and stack values through quadratic key matching.
	Stack depth is computed internally by a cumulative-sum attention head — no precomputed depth values are needed.
	The unembedding converts numeric results to token predictions via a quadratic scoring trick.

	The transformer supports filesystem I/O (open, read, write, close across 4 file descriptors)
	and structured loops with `br_if` conditional branching, executing loops up to 256 iterations
	via a Continuous Trace with Cycling Positional Encoding mechanism.

	115/115 test programs pass with 100% accuracy.

	## Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| `d_model` \| 100 \|
	\| `n_layers` \| 8 \|
	\| `heads_per_layer` \| [13, 1, 0, 8, 2, 1, 1, 4] \|
	\| `total_heads` \| 30 \|
	\| `d_head` \| 2 \|
	\| `d_ffn` \| 100 \|
	\| `vocab_size` \| 260 (256 byte tokens + 4 special) \|
	\| FFN activation \| SwiGLU (ReLU gate) \|
	\| Attention \| Hard-max + sum-mode + cross-attention \|
	\| Total parameters \| ~316K (all hand-compiled) \|

	Unlike standard transformers, each layer has a different number of attention heads (0 to 13),
	tailored to the specific computational role of that layer.

	## How It Works

	### The 8-Layer Pipeline

	- Layer 0 (13 heads): Opcode fetch — 11 paired attention heads identify 25 opcodes by matching one-hot flags, plus 1 operand retrieval head and 1 single-opcode head
	- Layer 1 (1 head): Stack depth accumulation — 1 sum-attention head computes cumulative stack depth as a running sum of push/pop deltas
	- Layer 2 (0 heads): Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
	- Layer 3 (8 heads): Bit retrieval — 8 hard-max heads extract individual bits from stack top and stack second for AND/OR operations
	- Layer 4 (2 heads): Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all arithmetic, comparison, and bitwise operations
	- Layer 5 (1 head): Local variables — 1 head finds the matching `local.tee`/`local.set`, FFN gates the retrieved value
	- Layer 6 (1 head): Linear memory — 1 head finds the matching `i32.store` by address, FFN gates the retrieved value
	- Layer 7 (4 heads): Filesystem — 4 cross-attention heads (one per file descriptor) retrieve bytes from an external filesystem key-value store by file offset

	### Sum-Attention (Cumulative Sums)

	Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the sum-mode head accumulates all past value vectors.

	Each instruction's value encodes its stack delta: `+1` for pushes (e.g., `i32.const`), `-1` for pops (e.g., `i32.add`), `-2` for `fd_write`. The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.

	### Cross-Attention (Filesystem)

	Layer 7 uses cross-attention heads that attend over an external key-value store representing file contents rather than the sequence's own tokens. Each of the 4 heads handles one file descriptor, using the current file offset as a query to retrieve the byte at that position. File contents are updated dynamically during execution as `fd_write` operations modify the filesystem.

	### SwiGLU Gating

	Each neuron has two weight vectors:
	- Gate: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
	- Value: Reads the computation inputs (stack top, stack second, operand, bits).

	```
	output = max(0, gate · x) × (value · x)
	```

	Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.

	### Loop Execution

	Loops use a Continuous Trace with Cycling Positional Encoding mechanism:
	- `loop` and `end_loop` are structural markers (no-ops in the execution trace)
	- `br_if` pops a condition from the stack; if non-zero, execution branches back to the loop body start
	- Positional encodings cycle using `virtualIP % loopLength` so the transformer sees correct instruction indices across iterations
	- Maximum 256 iterations per loop; nested loops are supported

	### Key Tricks

	- Multiplication via gating: For `i32.mul`, the gate equals one operand while the value holds the other. `max(0, TOP) × SECOND = TOP × SECOND`.
	- Comparisons via ReLU pairs: Two neurons with gates `(a-b)` and `(a-b-1)` create a step function that detects `a > b`.
	- Quadratic unembedding: `logit(t) = 2t·R - t²` is a downward parabola peaking at `t = RESULT`.
	- Quadratic key trick: `K = (2j, -j²)`, `Q = (i, 1)` → dot product peaks at `j = i` for exact position matching.
	- Sum-attention for depth: Instead of precomputing stack depth in PE, one head sums all past stack deltas.
	- Dynamic filesystem cursors: File read/write offsets are tracked inline during execution — no reference VM pre-run needed.

	## Supported WASM Operations

	### Arithmetic & Logic
	`i32.const`, `i32.add`, `i32.sub`, `i32.mul`,
	`i32.and`, `i32.or`

	### Comparisons
	`i32.eq`, `i32.ne`, `i32.lt_s`, `i32.gt_s`, `i32.le_s`, `i32.ge_s`

	### Memory & Variables
	`i32.load`, `i32.store`, `local.get`, `local.set`, `local.tee`

	### Filesystem I/O
	`fd_open`, `fd_read`, `fd_write`, `fd_close`
	(4 file descriptors, 32 bytes per file)

	### Control Flow
	`loop`, `end_loop`, `br_if` (up to 256 iterations, nested loops supported)

	### Output & Termination
	`output`, `halt`

	## Compliance Test Suite

	115 tests across 24 categories, all passing at 100%:

	\| Category \| Tests \|
	\|---\|---\|
	\| Core arithmetic & logic \| 24 \|
	\| Comparisons \| 16 \|
	\| Memory & variables \| 14 \|
	\| Filesystem \| 8 \|
	\| Filesystem integration \| 3 \|
	\| Limits & bounds \| 15 \|
	\| Basic loops \| 7 \|
	\| Loop + arithmetic \| 7 \|
	\| Loop + locals/memory \| 4 \|
	\| Loop + filesystem \| 6 \|
	\| Loop edge cases \| 4 \|
	\| Combined & output \| 4 \|

	## Positional Encoding Note

	This model uses program-specific positional encodings computed by a compile-time analysis pass,
	but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is not part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.

	The remaining PE contains: instruction indices, local variable source locations, memory address mappings, and filesystem cursor overrides — structural metadata that a trained model would learn.
	No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.

	For loops, positional encodings use virtual IP cycling (`instIdx % loopBodyLength`) so the transformer receives correct structural metadata across iterations without needing to know the iteration count in advance.

	## How to Use

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained("eastlondoner/wasm-transformer", trust_remote_code=True)

	# Define a simple instruction helper (op, operand)
	class WI:
	def __init__(self, op, operand=0):
	self.op = op
	self.operand = operand

	# Run a simple program: push 3, push 5, add, output, halt
	program = [
	WI(0x00, 3), # i32.const 3
	WI(0x00, 5), # i32.const 5
	WI(0x01), # i32.add
	WI(0xF0), # output
	WI(0xFF), # halt
	]
	outputs = model.run_program(program)
	print(outputs) # [8]
	```

	The model uses a custom architecture with hard-max, sum-mode, and cross-attention.
	A complete TypeScript reference implementation is available in the source repository.

	## Live Demos

	- Interactive WASM REPL — type WASM instructions line-by-line and watch the transformer execute them in real time
	- Transformer X-Ray — step through execution and see every layer, head, and neuron activate
	- Interactive Article Explorer — explore the concepts behind this model
	- FFN Interpreter Slide Deck — 15-slide visual explanation of how the FFN interprets bytecode

	## Inspiration

	This model is inspired by ["Can LLMs Be Computers?"](https://www.percepta.ai/blog/can-llms-be-computers) by Percepta AI.

	## License

	MIT