Qwen2.5-0.5B-Instruct-ANE-int8

Non-stateful Core ML build of Qwen/Qwen2.5-0.5B-Instruct, quantised to int8 per-channel so 99.5% of the ops route to the Apple Neural Engine (verified with MLComputePlan on iPhone 17 Pro / iOS 26.4.1).

Designed for batch perplexity scoring:

Fixed input shape input_ids = [1, 128] (int32)
Output logits = [1, 128, 151936] (fp16) — full per-position logits
No MLState, no KV cache, single forward pass per candidate text

Measured ~13 ms / forward on A19 Pro (iPhone 17 Pro) for the full 128-token batch.

Files

Qwen2.5-0.5B-Instruct-ANE-int8.mlpackage — the Core ML package
Tokenizer JSONs copied from the source HF model

Conversion

Produced via coremltools 8.x with ct.convert(convert_to="mlprogram", minimum_deployment_target=iOS17) followed by linear_quantize_weights(granularity="per_channel", dtype="int8"). Conversion script lives at scripts/model-conversion/convert.py in the consumer project.

Downloads last month: 66

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for niduank/Qwen2.5-0.5B-Instruct-ANE-int8

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Quantized

(199)

this model