Qwen2.5-0.5B-Instruct-ANE-int8

Non-stateful Core ML build of Qwen/Qwen2.5-0.5B-Instruct, quantised to int8 per-channel so 99.5% of the ops route to the Apple Neural Engine (verified with MLComputePlan on iPhone 17 Pro / iOS 26.4.1).

Designed for batch perplexity scoring:

  • Fixed input shape input_ids = [1, 128] (int32)
  • Output logits = [1, 128, 151936] (fp16) — full per-position logits
  • No MLState, no KV cache, single forward pass per candidate text

Measured ~13 ms / forward on A19 Pro (iPhone 17 Pro) for the full 128-token batch.

Files

  • Qwen2.5-0.5B-Instruct-ANE-int8.mlpackage — the Core ML package
  • Tokenizer JSONs copied from the source HF model

Conversion

Produced via coremltools 8.x with ct.convert(convert_to="mlprogram", minimum_deployment_target=iOS17) followed by linear_quantize_weights(granularity="per_channel", dtype="int8"). Conversion script lives at scripts/model-conversion/convert.py in the consumer project.

Downloads last month
66
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for niduank/Qwen2.5-0.5B-Instruct-ANE-int8

Quantized
(199)
this model