OpenMOSE/Qwen3-VL-REAP-24B-A3B

Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-VL-30B.


1. Model Summary

  • Base model: Qwen3-VL-30B (vision–language MoE LLM)
  • Variant name: Qwen3-VL-REAP-24B-A3B
  • Architecture: Decoder-only Transformer + MoE MLP experts, with vision encoder + VL fusion as in Qwen3-VL
  • Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
  • Expert sparsity: ~25% of MoE experts pruned globally
  • Active parameters: “A3B” indicates roughly ~3B active parameters per token (MoE sparse activation), while total parameters are reduced to about 24B
  • Modality: Text + Vision (VL support kept intact)
  • License: Apache 2.0
  • Author / Maintainer: OpenMOSE
  • Year: 2025

This is an unofficial community variant of Qwen3-VL, not affiliated with or endorsed by Alibaba or Cerebras Systems.


2. What Is REAP and What Did We Change?

REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:

  • Router statistics (routing probabilities)
  • Expert activation patterns on a calibration set

to identify under-used or redundant experts and prune them while preserving model quality as much as possible.

For this model:

  • We applied REAP to Qwen3-VL-30B across its MoE MLP blocks.
  • ~40% of experts are pruned, based on router-weighted activation statistics.
  • The routing mechanism itself is not conceptually changed; we only changed which experts remain.
  • We extended the original REAP implementation to support the Qwen3-VL architecture, i.e. vision encoder + VL fusion layers, so pruning can be applied without breaking VL functionality.

In short: same REAP algorithm, adapted to Qwen3-VL, and leaving VL functionality available.


3. Calibration Data

The REAP pruning statistics were computed using:

The calibration set is not used for additional fine-tuning here; it is used to measure router/expert activations to decide which experts to prune.


4. Why 24B-A3B? (Motivation & Hardware Footprint)

By pruning ~25% of experts and keeping VL:

  • The model shrinks from ~30B total parameters to about 24B total parameters.

  • With sparse MoE activation, around 3B parameters are active per token (“A3B”).

  • In practice, this makes it feasible to deploy on a single 16 GB GPU:

    • No CPU offload is strictly required in Q4_K_M Quantization
    • Lower VRAM configurations may still use offload / agressive quantization.

The goal is to keep the full VL capability of Qwen3-VL-30B, while making it:

  • Easier to experiment with REAP-style MoE pruning
  • More practical to deploy and fine-tune on high-end but single-node hardware

5. Intended Use

Primary intended uses

  • Research on:

    • MoE pruning and compression (especially REAP)
    • Scaling behavior of pruned MoE VL models
    • Trade-offs between expert sparsity and performance
  • Experimental deployment for:

    • Vision–language assistants
    • Multimodal chatbots
    • Document + image understanding

Suitable tasks (examples)

  • Multimodal chat (image + text → text)
  • Image captioning / description
  • Visual question answering
  • General instruction-following and long-form text generation

Out-of-scope / high-risk uses

This model should not be used without additional safeguards for:

  • Medical, legal, or financial advice
  • Safety-critical decision making
  • Political persuasion or targeted disinformation
  • Any scenario where incorrect or biased outputs can cause real-world harm

6. Limitations & Risks

This model inherits all the limitations of Qwen3-VL-30B plus those introduced by pruning:

  • Hallucinations: The model can generate plausible but incorrect facts.

  • Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.

  • Distribution shift from pruning:

    • Some long-tail behaviors may degrade due to pruning 40% of experts.
    • Performance may be uneven across tasks, domains, or languages not well covered in the calibration set.
  • Multimodal edge cases:

    • Complex compositional visual reasoning or extremely high-resolution images may not work reliably.
    • VL behavior is preserved but not fully re-tuned after pruning.

Users should perform their own evaluation before relying on the model in any sensitive context.


7. How to Use

Note: Replace class names according to the transformers version you use. Here we assume Qwen3VLForConditionalGeneration and AutoProcessor are available (as in recent Qwen3-VL integrations).

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_id = "OpenMOSE/Qwen3-VL-REAP-24B-A3B"

processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Example: image + text prompt
image = ...  # PIL.Image or numpy array
prompt = "Describe this image and summarize its key elements in one paragraph."

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
    generated = model.generate(
        **inputs,
        max_new_tokens=512,
    )

output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text)

For text-only usage, provide only text= to the processor.


8. Evaluation (Status)

  • This release focuses on making the REAP-pruned VL model available.

  • Quantitative benchmarks (e.g., MMBench, general QA, reasoning benchmarks) are still work in progress.

  • Early qualitative checks show:

    • VL behavior is preserved after pruning.
    • Latency and memory usage are improved compared to Qwen3-VL-30B, especially on single 16 GB GPUs.

Community contributions with detailed benchmarks are very welcome.


9. Training & Distillation Details (High-Level)

  • Base model: Qwen3-VL-30B

  • Pruning method: REAP (Router-weighted Expert Activation Pruning)

  • Calibration data: OpenMOSE/reap-calib-mix (mostly generated by Qwen3-235B-Instruct)

  • Post-processing:

    • Router / gating structure retained
    • Experts pruned according to REAP scoring
    • No additional large-scale pretraining is performed in this release

Future versions may include post-pruning fine-tuning or distillation to recover more performance.


10. Community & Contribution

Let’s grow this model together as a community.

You are encouraged to:

  • Run benchmarks and publish results

  • Contribute scripts for:

    • Further pruning experiments
    • Quantization (e.g., GGUF, AWQ, GPTQ)
    • Long-context or domain-specific fine-tuning
  • Report issues or findings about failure modes, biases, or surprising behaviors


11. License

  • Model & code (this repository): Apache License 2.0
  • The original Qwen3-VL-30B model and any downstream use must also respect their respective licenses and usage terms.

12. Acknowledgements

  • Qwen team for building the Qwen3-VL family of models.
  • Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
  • OpenMOSE community for experimentation, engineering, and calibration data generation.

2025 OpenMOSE

Downloads last month
59
Safetensors
Model size
24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/Qwen3-VL-REAP-24B-A3B

Finetuned
(11)
this model
Quantizations
2 models