MedGemma-1.5-4b-it ExecuTorch
This repository contains the medgemma-1.5-4b-it model converted to ExecuTorch format for on-device inference in the Android application.
Conversion Details
The model was converted using a custom fork of optimum-executorch that includes critical fixes for:
- Extended Context Window: Enables processing sequences up to 128K tokens (vs default 2048).
- Correct EOS Handling: Properly sets End-of-Sequence token IDs
[1, 106]for correct generation termination.
Prerequisites
# Setup environment
uv venv --python 3.12
source .venv/bin/activate
# Clone and setup the custom optimum-executorch repository
git clone https://github.com/kamalkraj/optimum-executorch.git
cd optimum-executorch
git checkout merge-eos-and-max-seq
# Install dependencies (requires torch 2.9.0 and torchao 0.14.1 for correct tracing)
uv pip install '.[dev]' torch==2.9.0 torchao==0.14.1
Export Command
The model is exported using the optimum-cli with XNNPACK recipe, using 8-bit dynamic activation and 4-bit weight quantization (8da4w).
optimum-cli export executorch \
--model "google/medgemma-1.5-4b-it" \
--task "multimodal-text-to-text" \
--recipe "xnnpack" \
--device cpu \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear 8da4w \
--qlinear_group_size 32 \
--qlinear_encoder "8da4w,8da8w" \
--qlinear_encoder_group_size 32 \
--qembedding "8w" \
--qembedding_encoder "8w" \
--max_seq_len 131072 \
--output_dir="medgemma-1.5-4b-it-8da4w-executorch"
Memory Usage Note: The above command uses a maximum context length of 128K tokens, requiring around 10-11GB of RAM on-device. To reduce memory usage, you can decrease
--max_seq_len(e.g., to4096or8192) before exporting, which will still allow effective inference while fitting within the constraints of lower-end devices.
- Downloads last month
- 63
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for kamalkraj/medgemma-1.5-4b-it-executorch
Base model
google/medgemma-1.5-4b-it