Lang Feng commited on
Commit
be05a67
·
unverified ·
1 Parent(s): 220e0ab

Update README and Add FAQ (#173)

Browse files

* update readme

* clarify data prepare

README.md CHANGED
@@ -31,13 +31,14 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
31
  `verl-agent` provides a **diverse set of RL algorithms** (including our new algorithm GiGPO) and a **rich suite of agent environments**, enabling the development of reasoning agents in both visual and text-based tasks.
32
 
33
  # News
 
 
34
  - [2025.09] [GiGPO](https://arxiv.org/abs/2505.10978) accepted at [NeurIPS 2025](https://neurips.cc/)! 🎉🎉🎉
35
  - [2025.08] Add **Search-R1 experiments** and **similarity-based GiGPO**! Check out GiGPO's superior performance in Search-R1 experiments [here](#results).
36
  - [2025.07] `GiGPO` & `verl-agent` talks at [Agent for SWE meetup](https://lu.ma/e498qhsi) by LF AI & Data Singapore on 7/11.
37
  - [2025.07] Add modular memory manager. See [here](./agent_system/memory).
38
  - [2025.06] ***Major update***: Merge all features from the latest [veRL](https://github.com/volcengine/verl). For example, `verl-agent` now supports Qwen3, LoRA, REINFORCE++, and more. Feel free to explore!
39
- - [2025.05] Our paper on GiGPO released. See [link](https://arxiv.org/abs/2505.10978).
40
- - [2025.05] Code released.
41
 
42
  # Quick Feature Summary
43
  | Feature Category | Supported Capabilities|
@@ -82,7 +83,7 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
82
  - [6. GiGPO (dynamic)](#6-gigpo-dynamic)
83
  - [LoRA](#lora)
84
  - [Prompt-based Agent with GPT-4o](#prompt-based-agent-with-gpt-4o)
85
- - [Tips](#tips)
86
  - [1. Customize Memory Module](#1-customize-memory-module)
87
  - [2. Data Preparation](#2-data-preparation)
88
  - [3. Customize Your Own Prompts](#3-customize-your-own-prompts)
@@ -446,7 +447,7 @@ We also provide a prompt-based GPT-4o agent.
446
  bash examples/prompt_agent/run_gpt4o_agent.sh
447
  ```
448
 
449
- # Tips
450
 
451
  ## 1. Customize Memory Module
452
  `verl-agent` supports a customizable and flexible memory system for managing and formatting interaction history between the agent and the environment. We provide a [SimpleMemory](./agent_system/memory/memory.py) implementation as a default starting point. This memory module is invoked within [env_manager.py](./agent_system/environments/env_manager.py) (i.e., `build_text_obs()`) to construct the observation at each step.
@@ -454,7 +455,7 @@ bash examples/prompt_agent/run_gpt4o_agent.sh
454
  Developers are encouraged to extend this module with custom memory strategies, such as dynamic summarization, selective memory retention, or external knowledge integration, to improve the handling of long-horizon interaction histories.
455
 
456
  ## 2. Data Preparation
457
- For most environments (e.g., AFLWorld, WebShop), we only use data preparation to indicate the modality, either "text" or "visual". For example, if the task is purely text-based, the data will just be an empty string "". If it involves visual input, it will be "\<image\>". As for agent input (including task instruction, observation and prompt), we follow the classical RL pipeline. That means the input of LLM agent comes from the environment's feedback through `env.step()`. In the case of search-r1 experiments where tasks are drawn from a dataset, we leverage the env_kwargs parameter to pass tasks into the environment, using: `envs.reset(kwargs=gen_batch.non_tensor_batch.pop('env_kwargs', None))`.
458
 
459
  ## 3. Customize Your Own Prompts
460
  We adopt a simple and minimal prompt format in our implementation. For example, in the WebShop environment:
 
31
  `verl-agent` provides a **diverse set of RL algorithms** (including our new algorithm GiGPO) and a **rich suite of agent environments**, enabling the development of reasoning agents in both visual and text-based tasks.
32
 
33
  # News
34
+ - [2025.09] `GiGPO` is now supported by [ROLL](https://github.com/alibaba/ROLL)! [[Document](https://alibaba.github.io/ROLL/docs/English/UserGuide/agentic/agentic_GiGPO)] [[Train Curves](https://github.com/alibaba/ROLL/issues/173#issuecomment-3332106534)].
35
+ - [2025.09] `verl-agent`-style training pipeline is now supported by [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL)!
36
  - [2025.09] [GiGPO](https://arxiv.org/abs/2505.10978) accepted at [NeurIPS 2025](https://neurips.cc/)! 🎉🎉🎉
37
  - [2025.08] Add **Search-R1 experiments** and **similarity-based GiGPO**! Check out GiGPO's superior performance in Search-R1 experiments [here](#results).
38
  - [2025.07] `GiGPO` & `verl-agent` talks at [Agent for SWE meetup](https://lu.ma/e498qhsi) by LF AI & Data Singapore on 7/11.
39
  - [2025.07] Add modular memory manager. See [here](./agent_system/memory).
40
  - [2025.06] ***Major update***: Merge all features from the latest [veRL](https://github.com/volcengine/verl). For example, `verl-agent` now supports Qwen3, LoRA, REINFORCE++, and more. Feel free to explore!
41
+ - [2025.05] Code released and paper on `GiGPO` released.
 
42
 
43
  # Quick Feature Summary
44
  | Feature Category | Supported Capabilities|
 
83
  - [6. GiGPO (dynamic)](#6-gigpo-dynamic)
84
  - [LoRA](#lora)
85
  - [Prompt-based Agent with GPT-4o](#prompt-based-agent-with-gpt-4o)
86
+ - [FAQ](#faq)
87
  - [1. Customize Memory Module](#1-customize-memory-module)
88
  - [2. Data Preparation](#2-data-preparation)
89
  - [3. Customize Your Own Prompts](#3-customize-your-own-prompts)
 
447
  bash examples/prompt_agent/run_gpt4o_agent.sh
448
  ```
449
 
450
+ # FAQ
451
 
452
  ## 1. Customize Memory Module
453
  `verl-agent` supports a customizable and flexible memory system for managing and formatting interaction history between the agent and the environment. We provide a [SimpleMemory](./agent_system/memory/memory.py) implementation as a default starting point. This memory module is invoked within [env_manager.py](./agent_system/environments/env_manager.py) (i.e., `build_text_obs()`) to construct the observation at each step.
 
455
  Developers are encouraged to extend this module with custom memory strategies, such as dynamic summarization, selective memory retention, or external knowledge integration, to improve the handling of long-horizon interaction histories.
456
 
457
  ## 2. Data Preparation
458
+ For most environments (e.g., AFLWorld, WebShop, Sokoban), we only use data preparation to indicate the modality, either "text" or "visual". For example, if the task is purely text-based, the data will just be an empty string "". If it involves visual input, it will be "\<image\>". As for agent input (including task instruction, observation and prompt), we follow the classical RL pipeline. That means the input of LLM agent comes from the environment's feedback through `env.step()`. In the case of search-r1 experiments where tasks are drawn from a dataset, we leverage the [env_kwargs](./examples/data_preprocess/preprocess_search_r1_dataset.py#L90) parameter to pass tasks into the environment, using: [envs.reset(kwargs=gen_batch.non_tensor_batch.pop('env_kwargs', None))](./agent_system/multi_turn_rollout/rollout_loop.py#L301).
459
 
460
  ## 3. Customize Your Own Prompts
461
  We adopt a simple and minimal prompt format in our implementation. For example, in the WebShop environment:
examples/data_preprocess/prepare.py CHANGED
@@ -34,6 +34,11 @@ if __name__ == '__main__':
34
  args.local_dir = os.path.join(args.local_dir, args.mode)
35
 
36
  data_source = 'hiyouga/geometry3k'
 
 
 
 
 
37
 
38
  dataset = datasets.load_dataset(data_source)
39
 
 
34
  args.local_dir = os.path.join(args.local_dir, args.mode)
35
 
36
  data_source = 'hiyouga/geometry3k'
37
+ """
38
+ **NOTE**: This is a frequently asked question.
39
+ We do NOT use the data in 'hiyouga/geometry3k', instead we only use it to indicate the modality and the data size.
40
+ See details: https://github.com/langfengQ/verl-agent?tab=readme-ov-file#2-data-preparation
41
+ """
42
 
43
  dataset = datasets.load_dataset(data_source)
44
 
examples/gigpo_trainer/run_alfworld_lora.sh CHANGED
@@ -9,6 +9,7 @@ val_data_size=128
9
  group_size=8
10
  mode="mean_std_norm" # "mean_norm" or "mean_std_norm"
11
 
 
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'text' \
14
  --train_data_size $train_data_size \
 
9
  group_size=8
10
  mode="mean_std_norm" # "mean_norm" or "mean_std_norm"
11
 
12
+ # We only use data preparation to indicate the modality and the data size.
13
  python3 -m examples.data_preprocess.prepare \
14
  --mode 'text' \
15
  --train_data_size $train_data_size \
examples/gigpo_trainer/run_blackjack.sh CHANGED
@@ -9,6 +9,7 @@ val_data_size=128
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
 
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'visual' \
14
  --train_data_size $train_data_size \
 
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
12
+ # We only use data preparation to indicate the modality and the data size.
13
  python3 -m examples.data_preprocess.prepare \
14
  --mode 'visual' \
15
  --train_data_size $train_data_size \
examples/gigpo_trainer/run_ezpoints.sh CHANGED
@@ -9,6 +9,7 @@ val_data_size=128
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
 
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'visual' \
14
  --train_data_size ${train_data_size} \
 
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
12
+ # We only use data preparation to indicate the modality and the data size.
13
  python3 -m examples.data_preprocess.prepare \
14
  --mode 'visual' \
15
  --train_data_size ${train_data_size} \
examples/gigpo_trainer/run_numberline.sh CHANGED
@@ -9,6 +9,7 @@ val_data_size=128
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
 
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'visual' \
14
  --train_data_size $train_data_size \
 
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
12
+ # We only use data preparation to indicate the modality and the data size.
13
  python3 -m examples.data_preprocess.prepare \
14
  --mode 'visual' \
15
  --train_data_size $train_data_size \
examples/gigpo_trainer/run_sokoban.sh CHANGED
@@ -9,6 +9,7 @@ val_data_size=128
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
 
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'visual' \
14
  --train_data_size $train_data_size \
 
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
12
+ # We only use data preparation to indicate the modality and the data size.
13
  python3 -m examples.data_preprocess.prepare \
14
  --mode 'visual' \
15
  --train_data_size $train_data_size \
examples/gigpo_trainer/run_webshop_qwen3.sh CHANGED
@@ -9,6 +9,7 @@ val_data_size=128
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
 
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'text' \
14
  --train_data_size $train_data_size \
 
9
  group_size=8
10
  mode="mean_norm" # "mean_norm" or "mean_std_norm"
11
 
12
+ # We only use data preparation to indicate the modality and the data size.
13
  python3 -m examples.data_preprocess.prepare \
14
  --mode 'text' \
15
  --train_data_size $train_data_size \
examples/grpo_trainer/run_balckjack.sh CHANGED
@@ -8,6 +8,7 @@ train_data_size=32
8
  val_data_size=128
9
  group_size=8
10
 
 
11
  python3 -m examples.data_preprocess.prepare \
12
  --mode 'visual' \
13
  --train_data_size $train_data_size \
 
8
  val_data_size=128
9
  group_size=8
10
 
11
+ # We only use data preparation to indicate the modality and the data size.
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'visual' \
14
  --train_data_size $train_data_size \
examples/grpo_trainer/run_sokoban.sh CHANGED
@@ -8,6 +8,7 @@ train_data_size=32
8
  val_data_size=128
9
  group_size=8
10
 
 
11
  python3 -m examples.data_preprocess.prepare \
12
  --mode 'visual' \
13
  --train_data_size $train_data_size \
 
8
  val_data_size=128
9
  group_size=8
10
 
11
+ # We only use data preparation to indicate the modality and the data size.
12
  python3 -m examples.data_preprocess.prepare \
13
  --mode 'visual' \
14
  --train_data_size $train_data_size \
examples/ppo_trainer/run_alfworld.sh CHANGED
@@ -7,6 +7,7 @@ num_cpus_per_env_worker=0.1 # The CPU resource allocated for each environment wo
7
  train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
8
  val_data_size=128
9
 
 
10
  python3 -m examples.data_preprocess.prepare \
11
  --mode 'text' \
12
  --train_data_size $train_data_size \
 
7
  train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
8
  val_data_size=128
9
 
10
+ # We only use data preparation to indicate the modality and the data size.
11
  python3 -m examples.data_preprocess.prepare \
12
  --mode 'text' \
13
  --train_data_size $train_data_size \
examples/ppo_trainer/run_webshop.sh CHANGED
@@ -7,6 +7,7 @@ num_cpus_per_env_worker=0.1 # The CPU resource allocated for each environment wo
7
  train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
8
  val_data_size=128
9
 
 
10
  python3 -m examples.data_preprocess.prepare \
11
  --mode 'text' \
12
  --train_data_size $train_data_size \
 
7
  train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
8
  val_data_size=128
9
 
10
+ # We only use data preparation to indicate the modality and the data size.
11
  python3 -m examples.data_preprocess.prepare \
12
  --mode 'text' \
13
  --train_data_size $train_data_size \