P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

πŸ“œ Paper | πŸ’» Code | 🌐 Project Page | πŸ† HiPhO Leaderboard

Flagship vision-language model achieving No.3 performance in physics reasoning

Model Description

P1-VL-235B-A22B is the flagship variant of the P1-VL series, a high-performance open-source vision-language model specialized in physics reasoning. It was introduced in P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads.

Built on Qwen3-VL-235B-A22B-Thinking and refined through multi-stage reinforcement learning on curated physics competition data, P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals on HiPhO, ranking No.3 in the model leaderboard. The model effectively solves tasks that require precise diagram-to-logic alignment, demonstrating exceptional performance in physics Olympiad competitions.

Key Highlights

  • πŸ₯‡ HiPhO Excellence: First open-source VLM to secure 12 gold medals, ranking No.3 globally. When augmented with PhysicsMinions, P1-VL-235B-A22B achieves No.2.
  • πŸ† IPhO 2025 Gold-tier Performance: Achieving gold medal performance on International Physics Olympiad
  • πŸ“Š FrontierScience-Olympiad: Total score of 64.3/100, outperforming text-only sibling by 2.3 points. When augmented with PhysicsMinions, secures state-of-the-art performance among all evaluated open-source models
  • 🎯 STEM Generalization: Consistent improvements over base model across math, and multimodal benchmarks

Performance Benchmarks

HiPhO Results

Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024–2025, P1-VL-235B-A22B demonstrates top-tier physics reasoning capabilities.

Model Ranking Gold Medals Performance
P1-VL-235B-A22B No. 3 12 πŸ₯‡ First open-source VLM with 12 gold medals
P1-VL-235B-A22B+PhysicsMinions No. 2 12 πŸ₯‡ Trailing only Gemini-3-Pro globally

FrontierScience-Olympiad Benchmark

P1-VL-235B-A22B achieves significant gains over its base counterpart across all three scientific domains. Remarkably, even on this predominantly text-based benchmark, the multimodal P1-VL-235B-A22B outperforms its text-only sibling (P1-235B-A22B) by a margin of 2.3 points.

Model Biology/10 Chemistry/40 Physics/50 Total/100
P1-VL-235B-A22B+PhysicsMinions 26.3 77.2 67.3 67.1
P1-VL-235B-A22B 30.0 71.3 65.5 64.3
P1-235B-A22B+PhysicsMinions 30.0 71.0 68.0 65.4
P1-235B-A22B 22.5 67.2 65.8 62.0
Qwen3-VL-235B-A22B-Thinking 26.3 61.9 57.8 56.3
Qwen3-235B-A22B-Thinking-2507 26.3 58.1 57.3 54.5

STEM Benchmarks

Beyond physics reasoning, P1-VL-235B-A22B demonstrates strong generalization across multiple domains, consistently outperforming its base model Qwen3-VL-235B-A22B-Thinking on both text-only and multimodal benchmarks.

Benchmark P1-VL-235B-A22B Qwen3-VL-235B-A22B-Thinking
AIME24 93.8 93.3
AIME25 92.1 90.8
HMMT-Feb 83.3 72.9
HMMT-Nov 88.3 84.2
IMO-Answerbench 70.6 62.3
AMOBench 47.5 39.0
BeyondAIME 70.6 68.5
Brumo 93.3 90.0
CMICC 83.1 81.6
GPQA 81.4 77.1
LiveBench 79.9 79.4
HLE 15.9 13.9
MMMU 78.0 77.2
MMMU-Pro 70.2 69.7
EMMA-Mini 71.3 69.6
MathVista-Mini 83.9 82.6

Usage

from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
from PIL import Image

model_name = "PRIME-RL/P1-VL-235B-A22B"

# Load model and processor
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    model_name, dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Load diagram image
image = Image.open("physics_diagram.png")

# Physics problem with visual input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {
                "type": "text",
                "text": """Analyze this physics diagram and solve the problem:

A block of mass m is placed on an inclined plane with angle ΞΈ.
The coefficient of kinetic friction is ΞΌ.
Calculate the acceleration of the block down the incline.""",
            },
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

πŸ™ Acknowledgements

We are grateful to the open-source community for their invaluable contributions. Special thanks to:

  • Qwen3-VL - for providing the foundational base models that powered our research
  • verl - for the versatile reinforcement learning framework that enabled our training pipeline
  • vLLM - for the efficient LLM serving and inference infrastructure
  • Megatron-LM - for the large-scale model training framework

Citation

@misc{p1vl2025,
  title={P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads},
  author={P1 Team},
  year={2026},
  url={https://arxiv.org/abs/2602.09443}
}
Downloads last month
22
Safetensors
Model size
236B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for PRIME-RL/P1-VL-235B-A22B