OpenMOSE/Qwen3-VL-REAP-24B-A3B

Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-VL-30B.

1. Model Summary

Base model: Qwen3-VL-30B (vision–language MoE LLM)
Variant name: Qwen3-VL-REAP-24B-A3B
Architecture: Decoder-only Transformer + MoE MLP experts, with vision encoder + VL fusion as in Qwen3-VL
Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
Expert sparsity: ~25% of MoE experts pruned globally
Active parameters: “A3B” indicates roughly ~3B active parameters per token (MoE sparse activation), while total parameters are reduced to about 24B
Modality: Text + Vision (VL support kept intact)
License: Apache 2.0
Author / Maintainer: OpenMOSE
Year: 2025

This is an unofficial community variant of Qwen3-VL, not affiliated with or endorsed by Alibaba or Cerebras Systems.

2. What Is REAP and What Did We Change?

REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:

Router statistics (routing probabilities)
Expert activation patterns on a calibration set

to identify under-used or redundant experts and prune them while preserving model quality as much as possible.

For this model:

We applied REAP to Qwen3-VL-30B across its MoE MLP blocks.
~40% of experts are pruned, based on router-weighted activation statistics.
The routing mechanism itself is not conceptually changed; we only changed which experts remain.
We extended the original REAP implementation to support the Qwen3-VL architecture, i.e. vision encoder + VL fusion layers, so pruning can be applied without breaking VL functionality.

In short: same REAP algorithm, adapted to Qwen3-VL, and leaving VL functionality available.

3. Calibration Data

The REAP pruning statistics were computed using:

Calibration dataset: https://huggingface.co/datasets/OpenMOSE/reap-calib-mix
This dataset is mostly synthetic, generated by Qwen3-235B-Instruct on mixed prompts designed to cover:
- General instruction-following
- Reasoning and long-form text

The calibration set is not used for additional fine-tuning here; it is used to measure router/expert activations to decide which experts to prune.

4. Why 24B-A3B? (Motivation & Hardware Footprint)

By pruning ~25% of experts and keeping VL:

The model shrinks from ~30B total parameters to about 24B total parameters.
With sparse MoE activation, around 3B parameters are active per token (“A3B”).
In practice, this makes it feasible to deploy on a single 16 GB GPU:
- No CPU offload is strictly required in Q4_K_M Quantization
- Lower VRAM configurations may still use offload / agressive quantization.

The goal is to keep the full VL capability of Qwen3-VL-30B, while making it:

Easier to experiment with REAP-style MoE pruning
More practical to deploy and fine-tune on high-end but single-node hardware

5. Intended Use

Primary intended uses

Research on:
- MoE pruning and compression (especially REAP)
- Scaling behavior of pruned MoE VL models
- Trade-offs between expert sparsity and performance
Experimental deployment for:
- Vision–language assistants
- Multimodal chatbots
- Document + image understanding

Suitable tasks (examples)

Multimodal chat (image + text → text)
Image captioning / description
Visual question answering
General instruction-following and long-form text generation

Out-of-scope / high-risk uses

This model should not be used without additional safeguards for:

Medical, legal, or financial advice
Safety-critical decision making
Political persuasion or targeted disinformation
Any scenario where incorrect or biased outputs can cause real-world harm

6. Limitations & Risks

This model inherits all the limitations of Qwen3-VL-30B plus those introduced by pruning:

Hallucinations: The model can generate plausible but incorrect facts.
Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.
Distribution shift from pruning:
- Some long-tail behaviors may degrade due to pruning 40% of experts.
- Performance may be uneven across tasks, domains, or languages not well covered in the calibration set.
Multimodal edge cases:
- Complex compositional visual reasoning or extremely high-resolution images may not work reliably.
- VL behavior is preserved but not fully re-tuned after pruning.

Users should perform their own evaluation before relying on the model in any sensitive context.

7. How to Use

Note: Replace class names according to the transformers version you use. Here we assume Qwen3VLForConditionalGeneration and AutoProcessor are available (as in recent Qwen3-VL integrations).

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_id = "OpenMOSE/Qwen3-VL-REAP-24B-A3B"

processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Example: image + text prompt
image = ...  # PIL.Image or numpy array
prompt = "Describe this image and summarize its key elements in one paragraph."

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
    generated = model.generate(
        **inputs,
        max_new_tokens=512,
    )

output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text)

For text-only usage, provide only text= to the processor.

8. Evaluation (Status)

This release focuses on making the REAP-pruned VL model available.
Quantitative benchmarks (e.g., MMBench, general QA, reasoning benchmarks) are still work in progress.
Early qualitative checks show:
- VL behavior is preserved after pruning.
- Latency and memory usage are improved compared to Qwen3-VL-30B, especially on single 16 GB GPUs.

Community contributions with detailed benchmarks are very welcome.

9. Training & Distillation Details (High-Level)

Base model: Qwen3-VL-30B
Pruning method: REAP (Router-weighted Expert Activation Pruning)
Calibration data: OpenMOSE/reap-calib-mix (mostly generated by Qwen3-235B-Instruct)
Post-processing:
- Router / gating structure retained
- Experts pruned according to REAP scoring
- No additional large-scale pretraining is performed in this release

Future versions may include post-pruning fine-tuning or distillation to recover more performance.

10. Community & Contribution

Let’s grow this model together as a community.

You are encouraged to:

Run benchmarks and publish results
Contribute scripts for:
- Further pruning experiments
- Quantization (e.g., GGUF, AWQ, GPTQ)
- Long-context or domain-specific fine-tuning
Report issues or findings about failure modes, biases, or surprising behaviors

11. License

Model & code (this repository): Apache License 2.0
The original Qwen3-VL-30B model and any downstream use must also respect their respective licenses and usage terms.

12. Acknowledgements

Qwen team for building the Qwen3-VL family of models.
Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
OpenMOSE community for experimentation, engineering, and calibration data generation.

2025 OpenMOSE

Downloads last month: 59

Safetensors

Model size

24B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/Qwen3-VL-REAP-24B-A3B

Base model

Qwen/Qwen3-VL-30B-A3B-Instruct

Finetuned

(11)

this model

Quantizations

2 models