P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
π Paper | π» Code | π Project Page | π HiPhO Leaderboard
Flagship vision-language model achieving No.3 performance in physics reasoning
Model Description
P1-VL-235B-A22B is the flagship variant of the P1-VL series, a high-performance open-source vision-language model specialized in physics reasoning. It was introduced in P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads.
Built on Qwen3-VL-235B-A22B-Thinking and refined through multi-stage reinforcement learning on curated physics competition data, P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals on HiPhO, ranking No.3 in the model leaderboard. The model effectively solves tasks that require precise diagram-to-logic alignment, demonstrating exceptional performance in physics Olympiad competitions.
Key Highlights
- π₯ HiPhO Excellence: First open-source VLM to secure 12 gold medals, ranking No.3 globally. When augmented with PhysicsMinions, P1-VL-235B-A22B achieves No.2.
- π IPhO 2025 Gold-tier Performance: Achieving gold medal performance on International Physics Olympiad
- π FrontierScience-Olympiad: Total score of 64.3/100, outperforming text-only sibling by 2.3 points. When augmented with PhysicsMinions, secures state-of-the-art performance among all evaluated open-source models
- π― STEM Generalization: Consistent improvements over base model across math, and multimodal benchmarks
Performance Benchmarks
HiPhO Results
Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024β2025, P1-VL-235B-A22B demonstrates top-tier physics reasoning capabilities.
| Model | Ranking | Gold Medals | Performance |
|---|---|---|---|
| P1-VL-235B-A22B | No. 3 | 12 π₯ | First open-source VLM with 12 gold medals |
| P1-VL-235B-A22B+PhysicsMinions | No. 2 | 12 π₯ | Trailing only Gemini-3-Pro globally |
FrontierScience-Olympiad Benchmark
P1-VL-235B-A22B achieves significant gains over its base counterpart across all three scientific domains. Remarkably, even on this predominantly text-based benchmark, the multimodal P1-VL-235B-A22B outperforms its text-only sibling (P1-235B-A22B) by a margin of 2.3 points.
| Model | Biology/10 | Chemistry/40 | Physics/50 | Total/100 |
|---|---|---|---|---|
| P1-VL-235B-A22B+PhysicsMinions | 26.3 | 77.2 | 67.3 | 67.1 |
| P1-VL-235B-A22B | 30.0 | 71.3 | 65.5 | 64.3 |
| P1-235B-A22B+PhysicsMinions | 30.0 | 71.0 | 68.0 | 65.4 |
| P1-235B-A22B | 22.5 | 67.2 | 65.8 | 62.0 |
| Qwen3-VL-235B-A22B-Thinking | 26.3 | 61.9 | 57.8 | 56.3 |
| Qwen3-235B-A22B-Thinking-2507 | 26.3 | 58.1 | 57.3 | 54.5 |
STEM Benchmarks
Beyond physics reasoning, P1-VL-235B-A22B demonstrates strong generalization across multiple domains, consistently outperforming its base model Qwen3-VL-235B-A22B-Thinking on both text-only and multimodal benchmarks.
| Benchmark | P1-VL-235B-A22B | Qwen3-VL-235B-A22B-Thinking |
|---|---|---|
| AIME24 | 93.8 | 93.3 |
| AIME25 | 92.1 | 90.8 |
| HMMT-Feb | 83.3 | 72.9 |
| HMMT-Nov | 88.3 | 84.2 |
| IMO-Answerbench | 70.6 | 62.3 |
| AMOBench | 47.5 | 39.0 |
| BeyondAIME | 70.6 | 68.5 |
| Brumo | 93.3 | 90.0 |
| CMICC | 83.1 | 81.6 |
| GPQA | 81.4 | 77.1 |
| LiveBench | 79.9 | 79.4 |
| HLE | 15.9 | 13.9 |
| MMMU | 78.0 | 77.2 |
| MMMU-Pro | 70.2 | 69.7 |
| EMMA-Mini | 71.3 | 69.6 |
| MathVista-Mini | 83.9 | 82.6 |
Usage
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
from PIL import Image
model_name = "PRIME-RL/P1-VL-235B-A22B"
# Load model and processor
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
model_name, dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Load diagram image
image = Image.open("physics_diagram.png")
# Physics problem with visual input
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image,
},
{
"type": "text",
"text": """Analyze this physics diagram and solve the problem:
A block of mass m is placed on an inclined plane with angle ΞΈ.
The coefficient of kinetic friction is ΞΌ.
Calculate the acceleration of the block down the incline.""",
},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
π Acknowledgements
We are grateful to the open-source community for their invaluable contributions. Special thanks to:
- Qwen3-VL - for providing the foundational base models that powered our research
- verl - for the versatile reinforcement learning framework that enabled our training pipeline
- vLLM - for the efficient LLM serving and inference infrastructure
- Megatron-LM - for the large-scale model training framework
Citation
@misc{p1vl2025,
title={P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads},
author={P1 Team},
year={2026},
url={https://arxiv.org/abs/2602.09443}
}
- Downloads last month
- 22