Any-to-Any
Transformers
Safetensors
qwen3_vl
image-to-text

OneThinker: All-in-one Reasoning Model for Image and Video

This repository contains the OneThinker-8B model presented in: OneThinker: All-in-one Reasoning Model for Image and Video

Since OneThinker-8B shares the same architecture as Qwen3-VL-8B, it naturally supports easy and efficient inference.

For inference, please refer to:

Code: https://github.com/tulerfeng/OneThinker

πŸ‘€ About OneThinker

OneThinker teaser

We introduce OneThinker, an all-in-one multimodal reasoning generalist that is capable of thinking across a wide range of fundamental visual tasks within a single model.

We construct the large-scale OneThinker-600k multi-task training corpus and build OneThinker-SFT-340k with high-quality CoT annotations for cold-start SFT. Moreover, we propose EMA-GRPO, a new RL method that balances heterogeneous reward signals across diverse visual tasks, via simply tracking task-wise moving averages of reward std.

OneThinker demonstrates strong performance on 31 benchmarks across 10 fundamental vision tasks, while showing cross-task knowledge transfer and promising zero-shot generalization toward a unified multimodal reasoning generalist.

All code, models, and data are fully released.

πŸ† Performance

Our model obtains significant performance gains after training based on Qwen3-VL-Instruct-8B across diverse visual tasks. For examle, OneThinker-8B reaches 70.6% accuracy on MMMU, 64.3% on MathVerse, 66.2% on VideoMMMU, 93.7 on Refcoco-testA, 54.9 J&F on ReasonVOS.

Descriptive alt text

Besides, we also observe beneficial cross-task and cross-modality knowledge transfer, along with promising preliminary zero-shot generalization under unified training. This highlights the effectiveness and generalization ability of our unified training framework across diverse visual tasks.

πŸŽ₯ Demo

Some reasoning examples are as follows.

Demo 1 (QA)

Question:
As shown in the figure, AB is the diameter of βŠ™O, and points C and D are on βŠ™O. If ∠ABD = 50.0, then the degree of ∠BCD is () Choices: (A) 30Β° (B) 35Β° (C) 40Β° (D) 45Β°

Thinking Process (click to expand) <think> Got it, let's try to solve this problem step by step. First, let's recall some circle theorems.

We know that AB is the diameter of circle O. So, any angle inscribed in a semicircle is a right angle. That means ∠ACB should be 90Β° because triangle ACB is inscribed in the semicircle with AB as the diameter. Now, let's look at ∠ABD. The problem says ∠ABD is 50Β°. Let's think about quadrilateral or triangles here. Maybe we can use the inscribed angle theorem, which states that an inscribed angle is half the measure of its intercepted arc. First, let's consider arc AD or arc AB. Since AB is the diameter, the arc AB is 180Β°. ∠ABD is an inscribed angle that intercepts arc AD. So ∠ABD intercepts arc AD, so the measure of arc AD would be 2 Γ— ∠ABD = 2 Γ— 50Β° = 100Β°? Wait, maybe not. Let's check another angle. Alternatively, ∠BAD and ∠BCD might be related. Let's see. ∠BAD is an inscribed angle intercepting arc BD. ∠BCD is also an inscribed angle intercepting arc BD. So they might be equal or supplementary?

Wait, AB is the diameter, so ∠ADB is also a right angle (since angle inscribed in a semicircle). So triangle ADB is a right-angled triangle at D. In triangle ADB, ∠ABD is 50°, so ∠BAD = 90° - 50° = 40°.

Now, ∠BCD and ∠BAD are inscribed angles that intercept the same arc BD. Because ∠BAD intercepts arc BD, and ∠BCD also intercepts arc BD. So they should be equal. Therefore, ∠BCD = ∠BAD = 40°. Let's verify again.

AB is diameter, so ∠ACB = 90Β°, ∠ADB = 90Β°. In β–³ABD, ∠BAD = 90Β° - 50Β° = 40Β°. ∠BCD and ∠BAD both subtend arc BD, so they are equal. Hence, ∠BCD is 40Β°, which is option C.</think>

<answe>C</answer>

Demo 2 (Tracking)

Question:
Given the bounding box [537,403,768,703] of the target object in the first frame, track this object in each frame.

Thinking Process (click to expand) <think> To track the white Ford Mustang convertible across each second up to 32 seconds, observe its movement pattern: the car travels forward along the road, so its position in the frame shifts subtly over time. Initially, the car occupies a central area of the frame. As seconds pass, the car moves closer to the camera, causing its bounding box to adjust incrementally. Over each second, the car’s position shifts slightly toward the lower portion of the frame (as it approaches the camera) and its bounding box expands a bit to account for the closer perspective. Throughout the sequence, the car remains the central focus, with smooth, gradual changes in its placement and the size of the bounding box reflecting its forward motion and proximity to the camera.</think>

<answer>{"boxes": {"1": [537, 403, 768, 703], "2": [537, 353, 797, 690], "3": [532, 323, 827, 687], "4": [519, 282, 850, 690], "5": [515, 260, 877, 694], "6": [510, 246, 887, 694], "7": [509, 246, 897, 703], "8": [509, 232, 947, 710], "9": [509, 210, 972, 732], "10": [482, 200, 989, 787], "11": [469, 196, 995, 856], "12": [459, 203, 995, 929], "13": [422, 221, 995, 987], "14": [397, 244, 995, 994], "15": [389, 250, 995, 994], "16": [382, 250, 995, 994], "17": [370, 250, 995, 994], "18": [364, 250, 995, 994], "19": [362, 250, 995, 994], "20": [354, 244, 995, 994], "21": [350, 239, 995, 994], "22": [347, 239, 995, 994], "23": [347, 244, 965, 979], "24": [350, 244, 945, 944], "25": [350, 244, 904, 910], "26": [350, 244, 845, 860], "27": [347, 244, 797, 779], "28": [347, 244, 772, 750], "29": [350, 250, 764, 742], "30": [350, 250, 757, 739], "31": [350, 250, 750, 739], "32": [350, 250, 750, 739]}}</answer>

Demo 3 (Segmentation)

Question:
Pleae segment the shorter one.

Thinking Process (click to expand) <think> To identify the shorter individual, focus on the person in the yellow dress. At the start of the video (around 0.0 seconds), this person is clearly visible. For positive reference points, select areas on the yellow dress itselfβ€”such as the chest, midsection, and skirt. For negative reference points, choose areas outside the person, like the road or the sky in the background.</think>

<answer>{"time": 0.0, "boxes": [426, 300, 528, 775], "positive_points": [[454, 450], [468, 550], [472, 650]], "negative_points": [[300, 500], [600, 500], [400, 200]]}</answer>

Citations

If you find our work helpful for your research, please consider citing our work.

@article{feng2025onethinker,
  title={OneThinker: All-in-one Reasoning Model for Image and Video},
  author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others},
  journal={arXiv preprint arXiv:2512.03043},
  year={2025}
}
Downloads last month
40
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for OneThink/OneThinker-8B

Finetuned
(79)
this model
Quantizations
1 model

Dataset used to train OneThink/OneThinker-8B