--- library_name: peft base_model: Qwen/Qwen3-0.6B tags: - reinforcement-learning - grpo - ant-colony - qwen3 - lora license: apache-2.0 --- # ant-llm-grpo: GRPO-trained Ant Colony Agent LoRA adapter for [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) trained with **Group Relative Policy Optimization (GRPO)** on an ant colony foraging simulation. ## Overview This model is part of the [ant-llm](https://github.com/detrin/ant-llm) project — training small LLMs to exhibit emergent swarm intelligence in a grid-based ant colony environment. Ants must navigate a grid, find food sources, carry food back to the nest, and coordinate via pheromone signals. ## Training Pipeline 1. **Base model**: Qwen/Qwen3-0.6B 2. **SFT stage**: Supervised fine-tuning on heuristic-generated trajectories ([hermanda/ant-llm-sft](https://huggingface.co/hermanda/ant-llm-sft)) 3. **GRPO stage** (this model): Reinforcement learning with GRPO on multi-turn environment rollouts ## GRPO Training Details - **Algorithm**: Group Relative Policy Optimization with KL penalty + PPO clipping - **Reward**: Composite reward based on food delivery, valid actions, exploration, and pheromone usage - **LoRA config**: rank=16, alpha=32, dropout=0.05, targets=q_proj/k_proj/v_proj/o_proj - **Training**: 30 iterations, group_size=4, batch_size=1, lr=5e-6, bf16 - **Curriculum**: 7-tier progressive difficulty (5x5 open grids up to 30x30 mazes with food respawn) - **Hardware**: Google Colab T4 GPU ### Training Results (30 iterations, Tier 0) - Valid action rate: 100% - Average food delivered: ~2.7 per episode - Best reward: 80.59 - The model did not advance past tier 0 (5x5 grid). More training iterations or hyperparameter tuning needed. ## Usage This is a LoRA adapter. To use it, first merge the SFT adapter into the base model, then apply this GRPO adapter: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Step 1: Load base + merge SFT adapter base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype="bfloat16") sft_model = PeftModel.from_pretrained(base_model, "hermanda/ant-llm-sft") merged_model = sft_model.merge_and_unload() # Step 2: Apply GRPO adapter grpo_model = PeftModel.from_pretrained(merged_model, "hermanda/ant-llm-grpo") # Step 3: Load tokenizer tokenizer = AutoTokenizer.from_pretrained("hermanda/ant-llm-grpo") ``` ## Action Format The model generates structured actions for the ant colony environment: ``` THINK: Found food nearby, picking up ACT: PICK_UP ACT: REMEMBER "food at [3,2]" ``` Available actions: MOVE, PICK_UP, DROP, DEPOSIT, SIGNAL, REMEMBER, REINFORCE, FORGET. ## Citation ```bibtex @misc{ant-llm-grpo-2025, title={ant-llm: Training LLMs for Emergent Swarm Intelligence via GRPO}, author={Daniel Herman}, year={2025}, url={https://github.com/detrin/ant-llm} } ```