Qwen2.5-0.5B-Instruct-ANE-int8
Non-stateful Core ML build of Qwen/Qwen2.5-0.5B-Instruct, quantised to
int8 per-channel so 99.5% of the ops route to the Apple Neural Engine
(verified with MLComputePlan on iPhone 17 Pro / iOS 26.4.1).
Designed for batch perplexity scoring:
- Fixed input shape
input_ids = [1, 128](int32) - Output
logits = [1, 128, 151936](fp16) — full per-position logits - No
MLState, no KV cache, single forward pass per candidate text
Measured ~13 ms / forward on A19 Pro (iPhone 17 Pro) for the full 128-token batch.
Files
Qwen2.5-0.5B-Instruct-ANE-int8.mlpackage— the Core ML package- Tokenizer JSONs copied from the source HF model
Conversion
Produced via coremltools 8.x with ct.convert(convert_to="mlprogram", minimum_deployment_target=iOS17)
followed by linear_quantize_weights(granularity="per_channel", dtype="int8").
Conversion script lives at scripts/model-conversion/convert.py in the consumer project.
- Downloads last month
- 66
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support