Update README and Add FAQ (#173)

Browse files

* update readme

* clarify data prepare

Files changed (12) hide show

README.md +6 -5
examples/data_preprocess/prepare.py +5 -0
examples/gigpo_trainer/run_alfworld_lora.sh +1 -0
examples/gigpo_trainer/run_blackjack.sh +1 -0
examples/gigpo_trainer/run_ezpoints.sh +1 -0
examples/gigpo_trainer/run_numberline.sh +1 -0
examples/gigpo_trainer/run_sokoban.sh +1 -0
examples/gigpo_trainer/run_webshop_qwen3.sh +1 -0
examples/grpo_trainer/run_balckjack.sh +1 -0
examples/grpo_trainer/run_sokoban.sh +1 -0
examples/ppo_trainer/run_alfworld.sh +1 -0
examples/ppo_trainer/run_webshop.sh +1 -0

README.md CHANGED Viewed

@@ -31,13 +31,14 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
 `verl-agent` provides a **diverse set of RL algorithms** (including our new algorithm GiGPO) and a **rich suite of agent environments**, enabling the development of reasoning agents in both visual and text-based tasks.
 # News
 - [2025.09] [GiGPO](https://arxiv.org/abs/2505.10978) accepted at [NeurIPS 2025](https://neurips.cc/)! 🎉🎉🎉
 - [2025.08] Add **Search-R1 experiments** and **similarity-based GiGPO**! Check out GiGPO's superior performance in Search-R1 experiments [here](#results).
 - [2025.07] `GiGPO` & `verl-agent` talks at [Agent for SWE meetup](https://lu.ma/e498qhsi) by LF AI & Data Singapore on 7/11.
 - [2025.07] Add modular memory manager. See [here](./agent_system/memory).
 - [2025.06] ***Major update***: Merge all features from the latest [veRL](https://github.com/volcengine/verl). For example, `verl-agent` now supports Qwen3, LoRA, REINFORCE++, and more. Feel free to explore!
-- [2025.05] Our paper on GiGPO released. See [link](https://arxiv.org/abs/2505.10978).
-- [2025.05] Code released.
 # Quick Feature Summary
 | Feature Category | Supported Capabilities|
@@ -82,7 +83,7 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
     - [6. GiGPO (dynamic)](#6-gigpo-dynamic)
   - [LoRA](#lora)
   - [Prompt-based Agent with GPT-4o](#prompt-based-agent-with-gpt-4o)
-- [Tips](#tips)
   - [1. Customize Memory Module](#1-customize-memory-module)
   - [2. Data Preparation](#2-data-preparation)
   - [3. Customize Your Own Prompts](#3-customize-your-own-prompts)
@@ -446,7 +447,7 @@ We also provide a prompt-based GPT-4o agent.
 bash examples/prompt_agent/run_gpt4o_agent.sh
 ```
-# Tips
 ## 1. Customize Memory Module
 `verl-agent` supports a customizable and flexible memory system for managing and formatting interaction history between the agent and the environment. We provide a [SimpleMemory](./agent_system/memory/memory.py) implementation as a default starting point. This memory module is invoked within [env_manager.py](./agent_system/environments/env_manager.py) (i.e., `build_text_obs()`) to construct the observation at each step.
@@ -454,7 +455,7 @@ bash examples/prompt_agent/run_gpt4o_agent.sh
 Developers are encouraged to extend this module with custom memory strategies, such as dynamic summarization, selective memory retention, or external knowledge integration, to improve the handling of long-horizon interaction histories.
 ## 2. Data Preparation
-For most environments (e.g., AFLWorld, WebShop), we only use data preparation to indicate the modality, either "text" or "visual". For example, if the task is purely text-based, the data will just be an empty string "". If it involves visual input, it will be "\<image\>". As for agent input (including task instruction, observation and prompt), we follow the classical RL pipeline. That means the input of LLM agent comes from the environment's feedback through `env.step()`. In the case of search-r1 experiments where tasks are drawn from a dataset, we leverage the env_kwargs parameter to pass tasks into the environment, using: `envs.reset(kwargs=gen_batch.non_tensor_batch.pop('env_kwargs', None))`.
 ## 3. Customize Your Own Prompts
 We adopt a simple and minimal prompt format in our implementation. For example, in the WebShop environment:

 `verl-agent` provides a **diverse set of RL algorithms** (including our new algorithm GiGPO) and a **rich suite of agent environments**, enabling the development of reasoning agents in both visual and text-based tasks.
 # News
+- [2025.09] `GiGPO` is now supported by [ROLL](https://github.com/alibaba/ROLL)! [[Document](https://alibaba.github.io/ROLL/docs/English/UserGuide/agentic/agentic_GiGPO)] [[Train Curves](https://github.com/alibaba/ROLL/issues/173#issuecomment-3332106534)].
+- [2025.09] `verl-agent`-style training pipeline is now supported by [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL)!
 - [2025.09] [GiGPO](https://arxiv.org/abs/2505.10978) accepted at [NeurIPS 2025](https://neurips.cc/)! 🎉🎉🎉
 - [2025.08] Add **Search-R1 experiments** and **similarity-based GiGPO**! Check out GiGPO's superior performance in Search-R1 experiments [here](#results).
 - [2025.07] `GiGPO` & `verl-agent` talks at [Agent for SWE meetup](https://lu.ma/e498qhsi) by LF AI & Data Singapore on 7/11.
 - [2025.07] Add modular memory manager. See [here](./agent_system/memory).
 - [2025.06] ***Major update***: Merge all features from the latest [veRL](https://github.com/volcengine/verl). For example, `verl-agent` now supports Qwen3, LoRA, REINFORCE++, and more. Feel free to explore!
+- [2025.05] Code released and paper on `GiGPO` released.
 # Quick Feature Summary
 | Feature Category | Supported Capabilities|
     - [6. GiGPO (dynamic)](#6-gigpo-dynamic)
   - [LoRA](#lora)
   - [Prompt-based Agent with GPT-4o](#prompt-based-agent-with-gpt-4o)
+- [FAQ](#faq)
   - [1. Customize Memory Module](#1-customize-memory-module)
   - [2. Data Preparation](#2-data-preparation)
   - [3. Customize Your Own Prompts](#3-customize-your-own-prompts)
 bash examples/prompt_agent/run_gpt4o_agent.sh
 ```
+# FAQ
 ## 1. Customize Memory Module
 `verl-agent` supports a customizable and flexible memory system for managing and formatting interaction history between the agent and the environment. We provide a [SimpleMemory](./agent_system/memory/memory.py) implementation as a default starting point. This memory module is invoked within [env_manager.py](./agent_system/environments/env_manager.py) (i.e., `build_text_obs()`) to construct the observation at each step.
 Developers are encouraged to extend this module with custom memory strategies, such as dynamic summarization, selective memory retention, or external knowledge integration, to improve the handling of long-horizon interaction histories.
 ## 2. Data Preparation
+For most environments (e.g., AFLWorld, WebShop, Sokoban), we only use data preparation to indicate the modality, either "text" or "visual". For example, if the task is purely text-based, the data will just be an empty string "". If it involves visual input, it will be "\<image\>". As for agent input (including task instruction, observation and prompt), we follow the classical RL pipeline. That means the input of LLM agent comes from the environment's feedback through `env.step()`. In the case of search-r1 experiments where tasks are drawn from a dataset, we leverage the [env_kwargs](./examples/data_preprocess/preprocess_search_r1_dataset.py#L90) parameter to pass tasks into the environment, using: [envs.reset(kwargs=gen_batch.non_tensor_batch.pop('env_kwargs', None))](./agent_system/multi_turn_rollout/rollout_loop.py#L301).
 ## 3. Customize Your Own Prompts
 We adopt a simple and minimal prompt format in our implementation. For example, in the WebShop environment:

examples/data_preprocess/prepare.py CHANGED Viewed

@@ -34,6 +34,11 @@ if __name__ == '__main__':
     args.local_dir = os.path.join(args.local_dir, args.mode)
     data_source = 'hiyouga/geometry3k'
     dataset = datasets.load_dataset(data_source)

     args.local_dir = os.path.join(args.local_dir, args.mode)
     data_source = 'hiyouga/geometry3k'
+    """
+    **NOTE**: This is a frequently asked question.
+    We do NOT use the data in 'hiyouga/geometry3k', instead we only use it to indicate the modality and the data size.
+    See details: https://github.com/langfengQ/verl-agent?tab=readme-ov-file#2-data-preparation
+    """
     dataset = datasets.load_dataset(data_source)

examples/gigpo_trainer/run_alfworld_lora.sh CHANGED Viewed

@@ -9,6 +9,7 @@ val_data_size=128
 group_size=8
 mode="mean_std_norm" # "mean_norm" or "mean_std_norm"
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

 group_size=8
 mode="mean_std_norm" # "mean_norm" or "mean_std_norm"
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

examples/gigpo_trainer/run_blackjack.sh CHANGED Viewed

@@ -9,6 +9,7 @@ val_data_size=128
 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

examples/gigpo_trainer/run_ezpoints.sh CHANGED Viewed

@@ -9,6 +9,7 @@ val_data_size=128
 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size ${train_data_size} \

 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size ${train_data_size} \

examples/gigpo_trainer/run_numberline.sh CHANGED Viewed

@@ -9,6 +9,7 @@ val_data_size=128
 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

examples/gigpo_trainer/run_sokoban.sh CHANGED Viewed

@@ -9,6 +9,7 @@ val_data_size=128
 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

examples/gigpo_trainer/run_webshop_qwen3.sh CHANGED Viewed

@@ -9,6 +9,7 @@ val_data_size=128
 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

 group_size=8
 mode="mean_norm" # "mean_norm" or "mean_std_norm"
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

examples/grpo_trainer/run_balckjack.sh CHANGED Viewed

@@ -8,6 +8,7 @@ train_data_size=32
 val_data_size=128
 group_size=8
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

 val_data_size=128
 group_size=8
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

examples/grpo_trainer/run_sokoban.sh CHANGED Viewed

@@ -8,6 +8,7 @@ train_data_size=32
 val_data_size=128
 group_size=8
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

 val_data_size=128
 group_size=8
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'visual' \
     --train_data_size $train_data_size \

examples/ppo_trainer/run_alfworld.sh CHANGED Viewed

@@ -7,6 +7,7 @@ num_cpus_per_env_worker=0.1 # The CPU resource allocated for each environment wo
 train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
 val_data_size=128
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

 train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
 val_data_size=128
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

examples/ppo_trainer/run_webshop.sh CHANGED Viewed

@@ -7,6 +7,7 @@ num_cpus_per_env_worker=0.1 # The CPU resource allocated for each environment wo
 train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
 val_data_size=128
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \

 train_data_size=128 # match GRPO and GiGPO configuration (16 × 8)
 val_data_size=128
+# We only use data preparation to indicate the modality and the data size.
 python3 -m examples.data_preprocess.prepare \
     --mode 'text' \
     --train_data_size $train_data_size \