Instructions to use invincible-jha/Orsta-32B-0321 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use invincible-jha/Orsta-32B-0321 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="invincible-jha/Orsta-32B-0321")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("invincible-jha/Orsta-32B-0321")
model = AutoModelForImageTextToText.from_pretrained("invincible-jha/Orsta-32B-0321")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use invincible-jha/Orsta-32B-0321 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "invincible-jha/Orsta-32B-0321"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "invincible-jha/Orsta-32B-0321",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/invincible-jha/Orsta-32B-0321

SGLang

How to use invincible-jha/Orsta-32B-0321 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "invincible-jha/Orsta-32B-0321" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "invincible-jha/Orsta-32B-0321",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "invincible-jha/Orsta-32B-0321" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "invincible-jha/Orsta-32B-0321",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use invincible-jha/Orsta-32B-0321 with Docker Model Runner:
```
docker model run hf.co/invincible-jha/Orsta-32B-0321
```

One RL to See Them All

🐙 GitHub Repo: MiniMax-AI/One-RL-to-See-Them-All
📜 Paper (arXiv): V-Triune: One RL to See Them All (arXiv:2505.18129)
💾 Dataset: Orsta-Data-47k on Hugging Face

Model Overview

Orsta-32B-0321 is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with V-Triune, our novel unified reinforcement learning (RL) system.

The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-32B-0321 has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.

Training with V-Triune

Orsta-32B-0321's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

Unified RL Framework (V-Triune): V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:
- Sample-Level Data Formatting (to unify diverse task inputs)
- Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers)
- Source-Level Metric Monitoring (to diagnose problems at the data-source level) * It also incorporates an innovative Dynamic IoU reward mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: V-Triune
Diverse Joint Task Optimization: Orsta-32B-0321 was jointly optimized on the following eight visual tasks:
- Visual Reasoning Tasks: Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
- Visual Perception Tasks: Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

This comprehensive training allows Orsta-32B-0321 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

Performance

Model	Knowledge	Mathematics	Perception	Coding	Info. Ex.	Planning	Science	Metrics	MEGA-Bench Core
QwenVL-2.5-32B-0321	8.48	12.62	11.99	13.59	15.44	8.61	16.78	14.91	11.87
MM-Eureka-32B 💡	12.20	20.19	21.88	15.86	21.23	15.47	19.95	22.77	18.57
VL-Rethinker-32B 💡	12.16	28.09	22.99	11.89	21.50	15.09	28.10	15.73	19.41
Orsta-32B-0321 (Ours) 💡	21.33	28.55	32.23	19.44	26.38	17.78	33.20	24.18	25.94
-	-	-	-	-	-	-	-	-	-
Δ (Ours - Backbone)	+12.9	+15.9	+20.2	+5.9	+10.9	+9.2	+16.4	+9.3	+14.1

How to Use

Orsta-32B-0321 is developed by post-training the Qwen2.5-VL-32B-Instruct (0321 checkpoint) model using our V-Triune reinforcement learning system. The Qwen2.5-VL-32B-Instruct (0321 checkpoint) is a publicly available baseline known for its reliable core reasoning abilities, alongside certain recognized limitations in perception and output formatting (which have been addressed in subsequent Qwen releases). Applying V-Triune to this specific baseline demonstrates its powerful post-training capability to unlock the model's inherent potential and significantly elevate its performance by refining and amplifying existing strengths.

Consequently, the core usage of Orsta-32B-0321, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series. Users familiar with Qwen2.5-VL models should find the interface intuitive.

For comprehensive details on the general capabilities of Qwen2.5-VL models, including multi-turn dialogue format and image input specifics, we recommend referring to the official Qwen2.5-VL series documentation (please ensure to consult information relevant to the 32B Instruct version).

Citation 🏆

If you use Orsta-32B-0321 or the V-Triune system in your research, please cite our work:

@article{ma2025one,
      title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, 
      author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
      journal={arXiv preprint arXiv:2505.18129},
      year={2025}
}

Downloads last month: 2

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for invincible-jha/Orsta-32B-0321

Base model

Qwen/Qwen2.5-VL-32B-Instruct

Finetuned

(67)

this model

Dataset used to train invincible-jha/Orsta-32B-0321

Paper for invincible-jha/Orsta-32B-0321

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Paper • 2505.18129 • Published May 23, 2025 • 62