Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call, text + image + audio)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-multimodal-coreml")

// Text-only
let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

// Image + text
let image: CGImage = // ... your image
let stream2 = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Describe this picture.")],
    image: image, maxTokens: 256)

// Audio + text (16 kHz mono PCM Float)
let pcm: [Float] = // ... your audio samples
let stream3 = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "What language is this?")],
    audio: pcm, maxTokens: 256)

Set the Xcode scheme env var LLM_VISION_FORCE_ANE=1 to route the vision encoder through the Apple Neural Engine (built ANE-targeted, 256 tokens per image at the LM hidden dim).

Gemma 4 E4B (multimodal) β€” Core ML (INT4, Apple Neural Engine)

Core ML port of google/gemma-4-E4B-it with vision (still image), video, and audio (Conformer) encoders. Sliding-window-attention chunks targeting Apple Neural Engine; vision encoder is ANE-targeted; audio runs on GPU + a small Swift/Accelerate projection sidecar.

iPhone 17 Pro validated 2026-05-03 β€” text decode 15.7 tok/s with correct outputs across all four input modalities (text / image / video / audio).

Built from john-rocky/CoreML-LLM; see docs/E4B_MULTIMODAL_BUILD.md for the full reproduction guide and scripts/assemble_gemma4_e4b_multimodal.sh for the assembly script.

Files

# Decode chunks (3-chunk Topology II β€” auto-detected by ChunkedEngine)
chunk1.mlmodelc/                # L0-11   β€” own KV
chunk2_3way.mlmodelc/           # L12-32  β€” merged 21 layers (own + KV-shared internal)
chunk3_3way.mlmodelc/           # L33-41 + lm_head + argmax

# Prefill chunks (legacy 4-chunk with prefill_b8 multifunction inside)
chunk2.mlmodelc/                # L12-22  prefill (own KV writes via recurrent shift)
chunk3.mlmodelc/                # L23-32  prefill (KV-shared)
chunk4.mlmodelc/                # L33-41  prefill + lm_head

# Vision encoder (ANE-targeted)
vision.ane.mlmodelc/            # SigLIP, output [1, 256, 2560]

# Audio encoder + Swift projection sidecars
audio.mlmodelc/                 # Conformer, output [1, 50, 1024]
audio_config.json
mel_filterbank.bin
output_proj_weight.npy          # 1024 -> 1536 (audio_soft_token_size)
output_proj_bias.npy
embed_proj_weight.npy           # 1536 -> 2560 (LM hidden) β€” E4B-specific shape

# Token / per-layer embeddings (mmap'd, dequantised on demand by Swift)
embed_tokens_q8.bin             640 MB  β€” INT8 token embeddings (262144 x 2560)
embed_tokens_scales.bin         512 KB
embed_tokens_per_layer_q8.bin   2.6 GB  β€” INT8 per-layer embeddings (PLE)
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin        53 MB
per_layer_norm_weight.bin       512 B

# RoPE cos/sin tables (pre-baked, mmap'd)
cos_sliding.npy / sin_sliding.npy
cos_full.npy    / sin_full.npy

# Tokenizer + runtime config
hf_model/
  tokenizer.json, tokenizer_config.json, config.json, generation_config.json
model_config.json

Total bundle size: ~7.6 GB.

Engine path on iPhone (what runs where)

Stage Compute Files used
Token / PLE embed lookup Swift CPU (mmap) embed_tokens*.bin, per_layer_*.bin
Decode (T=1) ANE chunk1 + chunk2_3way + chunk3_3way
Prefill (batched, T=8) ANE chunk1 + chunk2 + chunk3 + chunk4 (prefill_b8 multifunction)
Vision encoder ANE vision.ane.mlmodelc (with LLM_VISION_FORCE_ANE=1)
Audio encoder GPU audio.mlmodelc
Audio projection (1024 β†’ 1536 β†’ 2560) Swift / Accelerate output_proj_*.npy, embed_proj_weight.npy

The Swift runtime auto-detects Topology II by the presence of chunk2_3way + chunk3_3way and routes prefill through the legacy 4-chunk prefill_b8 multifunction (the engine's fillBatchMasksVisionAware keeps bidirectional within-image attention working at T=8 batches).

Why so many sidecars (vs a single model.mlpackage)?

Gemma 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB for E4B). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and blow up phys_footprint. We mmap the raw INT8 + scale .bin files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident.

The .npy RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a cos/sin builder.

The audio Swift projection (output_proj_* / embed_proj_weight) lives outside the ANE because of a Core ML GPU runtime bug with RMSNorm(with_scale=False) that produces all-zero outputs. Sgemm in Accelerate is fast enough on CPU.

Tokenizer

The Gemma 4 SentencePiece tokenizer ships in hf_model/. Three multimodal placeholder token IDs:

  • <|image|> = 258880 β€” image-pad span (256 per still image)
  • <|audio|> = 258881 β€” audio-pad span (~188 per 2 sec)
  • <|video|> = 258884 β€” video-pad span (64 per frame)

Vision encoder output rows replace <|image|>/<|video|> rows during prefill (and per-token at decode for tail spans). Audio output rows replace <|audio|>. per_layer_raw is forced to zero at multimodal positions β€” the chunks compute per_layer_combined entirely from the spliced hidden state.

License

This is a derivative work of google/gemma-4-E4B-it. Use is governed by the Gemma Terms of Use. Vision / audio extensions inherit the same license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/gemma-4-E4B-multimodal-coreml

Finetuned
(153)
this model