# PAPO Trainer

[![model badge](https://img.shields.io/badge/All_models-PAPO-blue)](https://huggingface.co/models?other=papo,trl)

TRL supports the Perception-Aware Policy Optimization (PAPO) as described in the paper [Perception-Aware Policy Optimization for Multimodal Reasoning](https://huggingface.co/papers/2507.06448) by [Zhenhailong Wang](https://huggingface.co/mikewang), Xuehang Guo, Sofia Stoica, [Haiyang Xu](https://huggingface.co/xhyandwyy), Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

The abstract from the paper is the following:

> Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

## PAPOTrainer[[trl.experimental.papo.PAPOTrainer]]

#### trl.experimental.papo.PAPOTrainer[[trl.experimental.papo.PAPOTrainer]]

[Source](https://github.com/huggingface/trl/blob/main/trl/experimental/papo/papo_trainer.py#L27)

Trainer for Perception-Aware Policy Optimization (PAPO).

PAPO extends GRPO/DAPO for multimodal reasoning by adding an implicit perception loss that encourages the model to
better utilize visual information. The key innovation is computing KL divergence between model outputs on original
vs. corrupted (masked) images.

Two variants are supported:
- PAPO-G: PAPO + GRPO (use loss_type="grpo")
- PAPO-D: PAPO + DAPO (use loss_type="dapo")

Example:

```python
from datasets import load_dataset
from trl.experimental.papo import PAPOTrainer, PAPOConfig

dataset = load_dataset("your-vlm-dataset", split="train")

def reward_func(completions, **kwargs):
    # Your reward function for multimodal reasoning
    return [compute_reward(c) for c in completions]

# PAPO-G
config = PAPOConfig(
    loss_type="grpo",  # Use GRPO as base
    perception_loss_weight=0.1,
    mask_ratio=0.3,
)

# PAPO-G
config = PAPOConfig(
    loss_type="dapo",  # Use DAPO as base
    perception_loss_weight=0.1,
    mask_ratio=0.3,
)

trainer = PAPOTrainer(
    model="Qwen/Qwen2-VL-2B-Instruct",
    reward_funcs=reward_func,
    args=config,
    train_dataset=dataset,
)

trainer.train()
```

traintrl.experimental.papo.PAPOTrainer.trainhttps://github.com/huggingface/trl/blob/main/transformers/trainer.py#L1325[{"name": "resume_from_checkpoint", "val": ": str | bool | None = None"}, {"name": "trial", "val": ": optuna.Trial | dict[str, Any] | None = None"}, {"name": "ignore_keys_for_eval", "val": ": list[str] | None = None"}]- **resume_from_checkpoint** (`str` or `bool`, *optional*) --
  If a `str`, local path to a saved checkpoint as saved by a previous instance of `Trainer`. If a
  `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance
  of `Trainer`. If present, training will resume from the model/optimizer/scheduler states loaded here.
- **trial** (`optuna.Trial` or `dict[str, Any]`, *optional*) --
  The trial run or the hyperparameter dictionary for hyperparameter search.
- **ignore_keys_for_eval** (`list[str]`, *optional*) --
  A list of keys in the output of your model (if it is a dictionary) that should be ignored when
  gathering predictions for evaluation during the training.0`~trainer_utils.TrainOutput`Object containing the global step count, training loss, and metrics.

Main training entry point.

**Parameters:**

model (`Union[str, PreTrainedModel]`) : Model to be trained (must be a vision-language model).

reward_funcs (`Union[RewardFunc, list[RewardFunc]]`) : Reward functions for computing rewards (same as GRPO).

args (`PAPOConfig`, *optional*, defaults to `None`) : Configuration for this trainer. If `None`, a default configuration is used.

train_dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [IterableDataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.IterableDataset)) : Dataset to use for training. Must include "prompt" and "image" columns.

eval_dataset : Same requirements as train_dataset.

processing_class : Processing class (tokenizer/processor) for the model.

reward_processing_classes : Processing classes for reward models.

callbacks : Training callbacks.

optimizers : Optimizer and scheduler tuple.

peft_config : PEFT configuration if using parameter-efficient fine-tuning.

**Returns:**

``~trainer_utils.TrainOutput``

Object containing the global step count, training loss, and metrics.
#### save_model[[trl.experimental.papo.PAPOTrainer.save_model]]

[Source](https://github.com/huggingface/trl/blob/main/transformers/trainer.py#L3752)

Will save the model, so you can reload it using `from_pretrained()`.

Will only save from the main process.
#### push_to_hub[[trl.experimental.papo.PAPOTrainer.push_to_hub]]

[Source](https://github.com/huggingface/trl/blob/main/transformers/trainer.py#L3999)

Upload `self.model` and `self.processing_class` to the 🤗 model hub on the repo `self.args.hub_model_id`.

**Parameters:**

commit_message (`str`, *optional*, defaults to `"End of training"`) : Message to commit while pushing.

blocking (`bool`, *optional*, defaults to `True`) : Whether the function should return only when the `git push` has finished.

token (`str`, *optional*, defaults to `None`) : Token with write permission to overwrite Trainer's original args.

revision (`str`, *optional*) : The git revision to commit from. Defaults to the head of the "main" branch.

kwargs (`dict[str, Any]`, *optional*) : Additional keyword arguments passed along to `~Trainer.create_model_card`.

**Returns:**

The URL of the repository where the model was pushed if `blocking=False`, or a `Future` object tracking the
progress of the commit if `blocking=True`.

## PAPOConfig[[trl.experimental.papo.PAPOConfig]]

#### trl.experimental.papo.PAPOConfig[[trl.experimental.papo.PAPOConfig]]

[Source](https://github.com/huggingface/trl/blob/main/trl/experimental/papo/papo_config.py#L22)

Configuration class for PAPOTrainer.

PAPO (Perception-Aware Policy Optimization) extends GRPO/DAPO for multimodal reasoning by adding an implicit
perception loss and double entropy regularization.

**Parameters:**

perception_loss_weight (`float`, *optional*, defaults to `0.1`) : gamma Weight coefficient for the perception loss term. This encourages the model to be sensitive to visual changes. 

mask_ratio (`float`, *optional*, defaults to `0.3`) : Ratio of the image to mask when computing perception loss. 

mask_type (`Literal["random", "patch", "grid"]`, *optional*, defaults to `"random"`) : Type of masking strategy to use. 

der_loss_weight1 (`float`, *optional*, defaults to `0.03`) : eta1 Weight coefficient for the Double Entropy Regularization (DER) term. This term encourages confident predictions with original images (low entropy) and uncertain predictions with masked images (high entropy). 

der_loss_weight2 (`float`, *optional*, defaults to `0.03`) : eta2 Weight coefficient for the Double Entropy Regularization (DER) term. This term encourages confident predictions with original images (low entropy) and uncertain predictions with masked images (high entropy). 

loss_type (`Literal["grpo", "dapo"]`, inherited from GRPOConfig) : Base loss type to use. Set to "grpo" for PAPO-G or "dapo" for PAPO-D.

