OpenMOSE/Qwen3-VL-REAP-24B-A3B
Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-VL-30B.
1. Model Summary
- Base model: Qwen3-VL-30B (vision–language MoE LLM)
- Variant name: Qwen3-VL-REAP-24B-A3B
- Architecture: Decoder-only Transformer + MoE MLP experts, with vision encoder + VL fusion as in Qwen3-VL
- Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
- Expert sparsity: ~25% of MoE experts pruned globally
- Active parameters: “A3B” indicates roughly ~3B active parameters per token (MoE sparse activation), while total parameters are reduced to about 24B
- Modality: Text + Vision (VL support kept intact)
- License: Apache 2.0
- Author / Maintainer: OpenMOSE
- Year: 2025
This is an unofficial community variant of Qwen3-VL, not affiliated with or endorsed by Alibaba or Cerebras Systems.
2. What Is REAP and What Did We Change?
REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:
- Router statistics (routing probabilities)
- Expert activation patterns on a calibration set
to identify under-used or redundant experts and prune them while preserving model quality as much as possible.
For this model:
- We applied REAP to Qwen3-VL-30B across its MoE MLP blocks.
- ~40% of experts are pruned, based on router-weighted activation statistics.
- The routing mechanism itself is not conceptually changed; we only changed which experts remain.
- We extended the original REAP implementation to support the Qwen3-VL architecture, i.e. vision encoder + VL fusion layers, so pruning can be applied without breaking VL functionality.
In short: same REAP algorithm, adapted to Qwen3-VL, and leaving VL functionality available.
3. Calibration Data
The REAP pruning statistics were computed using:
Calibration dataset: https://huggingface.co/datasets/OpenMOSE/reap-calib-mix
This dataset is mostly synthetic, generated by Qwen3-235B-Instruct on mixed prompts designed to cover:
- General instruction-following
- Reasoning and long-form text
The calibration set is not used for additional fine-tuning here; it is used to measure router/expert activations to decide which experts to prune.
4. Why 24B-A3B? (Motivation & Hardware Footprint)
By pruning ~25% of experts and keeping VL:
The model shrinks from ~30B total parameters to about 24B total parameters.
With sparse MoE activation, around 3B parameters are active per token (“A3B”).
In practice, this makes it feasible to deploy on a single 16 GB GPU:
- No CPU offload is strictly required in Q4_K_M Quantization
- Lower VRAM configurations may still use offload / agressive quantization.
The goal is to keep the full VL capability of Qwen3-VL-30B, while making it:
- Easier to experiment with REAP-style MoE pruning
- More practical to deploy and fine-tune on high-end but single-node hardware
5. Intended Use
Primary intended uses
Research on:
- MoE pruning and compression (especially REAP)
- Scaling behavior of pruned MoE VL models
- Trade-offs between expert sparsity and performance
Experimental deployment for:
- Vision–language assistants
- Multimodal chatbots
- Document + image understanding
Suitable tasks (examples)
- Multimodal chat (image + text → text)
- Image captioning / description
- Visual question answering
- General instruction-following and long-form text generation
Out-of-scope / high-risk uses
This model should not be used without additional safeguards for:
- Medical, legal, or financial advice
- Safety-critical decision making
- Political persuasion or targeted disinformation
- Any scenario where incorrect or biased outputs can cause real-world harm
6. Limitations & Risks
This model inherits all the limitations of Qwen3-VL-30B plus those introduced by pruning:
Hallucinations: The model can generate plausible but incorrect facts.
Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.
Distribution shift from pruning:
- Some long-tail behaviors may degrade due to pruning 40% of experts.
- Performance may be uneven across tasks, domains, or languages not well covered in the calibration set.
Multimodal edge cases:
- Complex compositional visual reasoning or extremely high-resolution images may not work reliably.
- VL behavior is preserved but not fully re-tuned after pruning.
Users should perform their own evaluation before relying on the model in any sensitive context.
7. How to Use
Note: Replace class names according to the
transformersversion you use. Here we assumeQwen3VLForConditionalGenerationandAutoProcessorare available (as in recent Qwen3-VL integrations).
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model_id = "OpenMOSE/Qwen3-VL-REAP-24B-A3B"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Example: image + text prompt
image = ... # PIL.Image or numpy array
prompt = "Describe this image and summarize its key elements in one paragraph."
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
generated = model.generate(
**inputs,
max_new_tokens=512,
)
output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text)
For text-only usage, provide only text= to the processor.
8. Evaluation (Status)
This release focuses on making the REAP-pruned VL model available.
Quantitative benchmarks (e.g., MMBench, general QA, reasoning benchmarks) are still work in progress.
Early qualitative checks show:
- VL behavior is preserved after pruning.
- Latency and memory usage are improved compared to Qwen3-VL-30B, especially on single 16 GB GPUs.
Community contributions with detailed benchmarks are very welcome.
9. Training & Distillation Details (High-Level)
Base model: Qwen3-VL-30B
Pruning method: REAP (Router-weighted Expert Activation Pruning)
Calibration data:
OpenMOSE/reap-calib-mix(mostly generated by Qwen3-235B-Instruct)Post-processing:
- Router / gating structure retained
- Experts pruned according to REAP scoring
- No additional large-scale pretraining is performed in this release
Future versions may include post-pruning fine-tuning or distillation to recover more performance.
10. Community & Contribution
Let’s grow this model together as a community.
You are encouraged to:
Run benchmarks and publish results
Contribute scripts for:
- Further pruning experiments
- Quantization (e.g., GGUF, AWQ, GPTQ)
- Long-context or domain-specific fine-tuning
Report issues or findings about failure modes, biases, or surprising behaviors
11. License
- Model & code (this repository): Apache License 2.0
- The original Qwen3-VL-30B model and any downstream use must also respect their respective licenses and usage terms.
12. Acknowledgements
- Qwen team for building the Qwen3-VL family of models.
- Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
- OpenMOSE community for experimentation, engineering, and calibration data generation.
2025 OpenMOSE
- Downloads last month
- 59