Lang Feng commited on
Update README and Add FAQ (#173)
Browse files* update readme
* clarify data prepare
- README.md +6 -5
- examples/data_preprocess/prepare.py +5 -0
- examples/gigpo_trainer/run_alfworld_lora.sh +1 -0
- examples/gigpo_trainer/run_blackjack.sh +1 -0
- examples/gigpo_trainer/run_ezpoints.sh +1 -0
- examples/gigpo_trainer/run_numberline.sh +1 -0
- examples/gigpo_trainer/run_sokoban.sh +1 -0
- examples/gigpo_trainer/run_webshop_qwen3.sh +1 -0
- examples/grpo_trainer/run_balckjack.sh +1 -0
- examples/grpo_trainer/run_sokoban.sh +1 -0
- examples/ppo_trainer/run_alfworld.sh +1 -0
- examples/ppo_trainer/run_webshop.sh +1 -0
README.md
CHANGED
|
@@ -31,13 +31,14 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
|
|
| 31 |
`verl-agent` provides a **diverse set of RL algorithms** (including our new algorithm GiGPO) and a **rich suite of agent environments**, enabling the development of reasoning agents in both visual and text-based tasks.
|
| 32 |
|
| 33 |
# News
|
|
|
|
|
|
|
| 34 |
- [2025.09] [GiGPO](https://arxiv.org/abs/2505.10978) accepted at [NeurIPS 2025](https://neurips.cc/)! 🎉🎉🎉
|
| 35 |
- [2025.08] Add **Search-R1 experiments** and **similarity-based GiGPO**! Check out GiGPO's superior performance in Search-R1 experiments [here](#results).
|
| 36 |
- [2025.07] `GiGPO` & `verl-agent` talks at [Agent for SWE meetup](https://lu.ma/e498qhsi) by LF AI & Data Singapore on 7/11.
|
| 37 |
- [2025.07] Add modular memory manager. See [here](./agent_system/memory).
|
| 38 |
- [2025.06] ***Major update***: Merge all features from the latest [veRL](https://github.com/volcengine/verl). For example, `verl-agent` now supports Qwen3, LoRA, REINFORCE++, and more. Feel free to explore!
|
| 39 |
-
- [2025.05]
|
| 40 |
-
- [2025.05] Code released.
|
| 41 |
|
| 42 |
# Quick Feature Summary
|
| 43 |
| Feature Category | Supported Capabilities|
|
|
@@ -82,7 +83,7 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
|
|
| 82 |
- [6. GiGPO (dynamic)](#6-gigpo-dynamic)
|
| 83 |
- [LoRA](#lora)
|
| 84 |
- [Prompt-based Agent with GPT-4o](#prompt-based-agent-with-gpt-4o)
|
| 85 |
-
- [
|
| 86 |
- [1. Customize Memory Module](#1-customize-memory-module)
|
| 87 |
- [2. Data Preparation](#2-data-preparation)
|
| 88 |
- [3. Customize Your Own Prompts](#3-customize-your-own-prompts)
|
|
@@ -446,7 +447,7 @@ We also provide a prompt-based GPT-4o agent.
|
|
| 446 |
bash examples/prompt_agent/run_gpt4o_agent.sh
|
| 447 |
```
|
| 448 |
|
| 449 |
-
#
|
| 450 |
|
| 451 |
## 1. Customize Memory Module
|
| 452 |
`verl-agent` supports a customizable and flexible memory system for managing and formatting interaction history between the agent and the environment. We provide a [SimpleMemory](./agent_system/memory/memory.py) implementation as a default starting point. This memory module is invoked within [env_manager.py](./agent_system/environments/env_manager.py) (i.e., `build_text_obs()`) to construct the observation at each step.
|
|
@@ -454,7 +455,7 @@ bash examples/prompt_agent/run_gpt4o_agent.sh
|
|
| 454 |
Developers are encouraged to extend this module with custom memory strategies, such as dynamic summarization, selective memory retention, or external knowledge integration, to improve the handling of long-horizon interaction histories.
|
| 455 |
|
| 456 |
## 2. Data Preparation
|
| 457 |
-
For most environments (e.g., AFLWorld, WebShop), we only use data preparation to indicate the modality, either "text" or "visual". For example, if the task is purely text-based, the data will just be an empty string "". If it involves visual input, it will be "\<image\>". As for agent input (including task instruction, observation and prompt), we follow the classical RL pipeline. That means the input of LLM agent comes from the environment's feedback through `env.step()`. In the case of search-r1 experiments where tasks are drawn from a dataset, we leverage the env_kwargs parameter to pass tasks into the environment, using:
|
| 458 |
|
| 459 |
## 3. Customize Your Own Prompts
|
| 460 |
We adopt a simple and minimal prompt format in our implementation. For example, in the WebShop environment:
|
|
|
|
| 31 |
`verl-agent` provides a **diverse set of RL algorithms** (including our new algorithm GiGPO) and a **rich suite of agent environments**, enabling the development of reasoning agents in both visual and text-based tasks.
|
| 32 |
|
| 33 |
# News
|
| 34 |
+
- [2025.09] `GiGPO` is now supported by [ROLL](https://github.com/alibaba/ROLL)! [[Document](https://alibaba.github.io/ROLL/docs/English/UserGuide/agentic/agentic_GiGPO)] [[Train Curves](https://github.com/alibaba/ROLL/issues/173#issuecomment-3332106534)].
|
| 35 |
+
- [2025.09] `verl-agent`-style training pipeline is now supported by [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL)!
|
| 36 |
- [2025.09] [GiGPO](https://arxiv.org/abs/2505.10978) accepted at [NeurIPS 2025](https://neurips.cc/)! 🎉🎉🎉
|
| 37 |
- [2025.08] Add **Search-R1 experiments** and **similarity-based GiGPO**! Check out GiGPO's superior performance in Search-R1 experiments [here](#results).
|
| 38 |
- [2025.07] `GiGPO` & `verl-agent` talks at [Agent for SWE meetup](https://lu.ma/e498qhsi) by LF AI & Data Singapore on 7/11.
|
| 39 |
- [2025.07] Add modular memory manager. See [here](./agent_system/memory).
|
| 40 |
- [2025.06] ***Major update***: Merge all features from the latest [veRL](https://github.com/volcengine/verl). For example, `verl-agent` now supports Qwen3, LoRA, REINFORCE++, and more. Feel free to explore!
|
| 41 |
+
- [2025.05] Code released and paper on `GiGPO` released.
|
|
|
|
| 42 |
|
| 43 |
# Quick Feature Summary
|
| 44 |
| Feature Category | Supported Capabilities|
|
|
|
|
| 83 |
- [6. GiGPO (dynamic)](#6-gigpo-dynamic)
|
| 84 |
- [LoRA](#lora)
|
| 85 |
- [Prompt-based Agent with GPT-4o](#prompt-based-agent-with-gpt-4o)
|
| 86 |
+
- [FAQ](#faq)
|
| 87 |
- [1. Customize Memory Module](#1-customize-memory-module)
|
| 88 |
- [2. Data Preparation](#2-data-preparation)
|
| 89 |
- [3. Customize Your Own Prompts](#3-customize-your-own-prompts)
|
|
|
|
| 447 |
bash examples/prompt_agent/run_gpt4o_agent.sh
|
| 448 |
```
|
| 449 |
|
| 450 |
+
# FAQ
|
| 451 |
|
| 452 |
## 1. Customize Memory Module
|
| 453 |
`verl-agent` supports a customizable and flexible memory system for managing and formatting interaction history between the agent and the environment. We provide a [SimpleMemory](./agent_system/memory/memory.py) implementation as a default starting point. This memory module is invoked within [env_manager.py](./agent_system/environments/env_manager.py) (i.e., `build_text_obs()`) to construct the observation at each step.
|
|
|
|
| 455 |
Developers are encouraged to extend this module with custom memory strategies, such as dynamic summarization, selective memory retention, or external knowledge integration, to improve the handling of long-horizon interaction histories.
|
| 456 |
|
| 457 |
## 2. Data Preparation
|
| 458 |
+
For most environments (e.g., AFLWorld, WebShop, Sokoban), we only use data preparation to indicate the modality, either "text" or "visual". For example, if the task is purely text-based, the data will just be an empty string "". If it involves visual input, it will be "\<image\>". As for agent input (including task instruction, observation and prompt), we follow the classical RL pipeline. That means the input of LLM agent comes from the environment's feedback through `env.step()`. In the case of search-r1 experiments where tasks are drawn from a dataset, we leverage the [env_kwargs](./examples/data_preprocess/preprocess_search_r1_dataset.py#L90) parameter to pass tasks into the environment, using: [envs.reset(kwargs=gen_batch.non_tensor_batch.pop('env_kwargs', None))](./agent_system/multi_turn_rollout/rollout_loop.py#L301).
|
| 459 |
|
| 460 |
## 3. Customize Your Own Prompts
|
| 461 |
We adopt a simple and minimal prompt format in our implementation. For example, in the WebShop environment:
|
examples/data_preprocess/prepare.py
CHANGED
|
@@ -34,6 +34,11 @@ if __name__ == '__main__':
|
|
| 34 |
args.local_dir = os.path.join(args.local_dir, args.mode)
|
| 35 |
|
| 36 |
data_source = 'hiyouga/geometry3k'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
dataset = datasets.load_dataset(data_source)
|
| 39 |
|
|
|
|
| 34 |
args.local_dir = os.path.join(args.local_dir, args.mode)
|
| 35 |
|
| 36 |
data_source = 'hiyouga/geometry3k'
|
| 37 |
+
"""
|
| 38 |
+
**NOTE**: This is a frequently asked question.
|
| 39 |
+
We do NOT use the data in 'hiyouga/geometry3k', instead we only use it to indicate the modality and the data size.
|
| 40 |
+
See details: https://github.com/langfengQ/verl-agent?tab=readme-ov-file#2-data-preparation
|
| 41 |
+
"""
|
| 42 |
|
| 43 |
dataset = datasets.load_dataset(data_source)
|
| 44 |
|
examples/gigpo_trainer/run_alfworld_lora.sh
CHANGED
|
@@ -9,6 +9,7 @@ val_data_size=128
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_std_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
|
|
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'text' \
|
| 14 |
--train_data_size $train_data_size \
|
|
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_std_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
| 12 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 13 |
python3 -m examples.data_preprocess.prepare \
|
| 14 |
--mode 'text' \
|
| 15 |
--train_data_size $train_data_size \
|
examples/gigpo_trainer/run_blackjack.sh
CHANGED
|
@@ -9,6 +9,7 @@ val_data_size=128
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
|
|
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'visual' \
|
| 14 |
--train_data_size $train_data_size \
|
|
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
| 12 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 13 |
python3 -m examples.data_preprocess.prepare \
|
| 14 |
--mode 'visual' \
|
| 15 |
--train_data_size $train_data_size \
|
examples/gigpo_trainer/run_ezpoints.sh
CHANGED
|
@@ -9,6 +9,7 @@ val_data_size=128
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
|
|
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'visual' \
|
| 14 |
--train_data_size ${train_data_size} \
|
|
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
| 12 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 13 |
python3 -m examples.data_preprocess.prepare \
|
| 14 |
--mode 'visual' \
|
| 15 |
--train_data_size ${train_data_size} \
|
examples/gigpo_trainer/run_numberline.sh
CHANGED
|
@@ -9,6 +9,7 @@ val_data_size=128
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
|
|
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'visual' \
|
| 14 |
--train_data_size $train_data_size \
|
|
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
| 12 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 13 |
python3 -m examples.data_preprocess.prepare \
|
| 14 |
--mode 'visual' \
|
| 15 |
--train_data_size $train_data_size \
|
examples/gigpo_trainer/run_sokoban.sh
CHANGED
|
@@ -9,6 +9,7 @@ val_data_size=128
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
|
|
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'visual' \
|
| 14 |
--train_data_size $train_data_size \
|
|
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
| 12 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 13 |
python3 -m examples.data_preprocess.prepare \
|
| 14 |
--mode 'visual' \
|
| 15 |
--train_data_size $train_data_size \
|
examples/gigpo_trainer/run_webshop_qwen3.sh
CHANGED
|
@@ -9,6 +9,7 @@ val_data_size=128
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
|
|
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'text' \
|
| 14 |
--train_data_size $train_data_size \
|
|
|
|
| 9 |
group_size=8
|
| 10 |
mode="mean_norm" # "mean_norm" or "mean_std_norm"
|
| 11 |
|
| 12 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 13 |
python3 -m examples.data_preprocess.prepare \
|
| 14 |
--mode 'text' \
|
| 15 |
--train_data_size $train_data_size \
|
examples/grpo_trainer/run_balckjack.sh
CHANGED
|
@@ -8,6 +8,7 @@ train_data_size=32
|
|
| 8 |
val_data_size=128
|
| 9 |
group_size=8
|
| 10 |
|
|
|
|
| 11 |
python3 -m examples.data_preprocess.prepare \
|
| 12 |
--mode 'visual' \
|
| 13 |
--train_data_size $train_data_size \
|
|
|
|
| 8 |
val_data_size=128
|
| 9 |
group_size=8
|
| 10 |
|
| 11 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'visual' \
|
| 14 |
--train_data_size $train_data_size \
|
examples/grpo_trainer/run_sokoban.sh
CHANGED
|
@@ -8,6 +8,7 @@ train_data_size=32
|
|
| 8 |
val_data_size=128
|
| 9 |
group_size=8
|
| 10 |
|
|
|
|
| 11 |
python3 -m examples.data_preprocess.prepare \
|
| 12 |
--mode 'visual' \
|
| 13 |
--train_data_size $train_data_size \
|
|
|
|
| 8 |
val_data_size=128
|
| 9 |
group_size=8
|
| 10 |
|
| 11 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 12 |
python3 -m examples.data_preprocess.prepare \
|
| 13 |
--mode 'visual' \
|
| 14 |
--train_data_size $train_data_size \
|
examples/ppo_trainer/run_alfworld.sh
CHANGED
|
@@ -7,6 +7,7 @@ num_cpus_per_env_worker=0.1 # The CPU resource allocated for each environment wo
|
|
| 7 |
train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
|
| 8 |
val_data_size=128
|
| 9 |
|
|
|
|
| 10 |
python3 -m examples.data_preprocess.prepare \
|
| 11 |
--mode 'text' \
|
| 12 |
--train_data_size $train_data_size \
|
|
|
|
| 7 |
train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
|
| 8 |
val_data_size=128
|
| 9 |
|
| 10 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 11 |
python3 -m examples.data_preprocess.prepare \
|
| 12 |
--mode 'text' \
|
| 13 |
--train_data_size $train_data_size \
|
examples/ppo_trainer/run_webshop.sh
CHANGED
|
@@ -7,6 +7,7 @@ num_cpus_per_env_worker=0.1 # The CPU resource allocated for each environment wo
|
|
| 7 |
train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
|
| 8 |
val_data_size=128
|
| 9 |
|
|
|
|
| 10 |
python3 -m examples.data_preprocess.prepare \
|
| 11 |
--mode 'text' \
|
| 12 |
--train_data_size $train_data_size \
|
|
|
|
| 7 |
train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
|
| 8 |
val_data_size=128
|
| 9 |
|
| 10 |
+
# We only use data preparation to indicate the modality and the data size.
|
| 11 |
python3 -m examples.data_preprocess.prepare \
|
| 12 |
--mode 'text' \
|
| 13 |
--train_data_size $train_data_size \
|