Instructions to use hermanda/ant-llm-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use hermanda/ant-llm-grpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B") model = PeftModel.from_pretrained(base_model, "hermanda/ant-llm-grpo") - Notebooks
- Google Colab
- Kaggle
ant-llm-grpo: GRPO-trained Ant Colony Agent
LoRA adapter for Qwen/Qwen3-0.6B trained with Group Relative Policy Optimization (GRPO) on an ant colony foraging simulation.
Overview
This model is part of the ant-llm project — training small LLMs to exhibit emergent swarm intelligence in a grid-based ant colony environment. Ants must navigate a grid, find food sources, carry food back to the nest, and coordinate via pheromone signals.
Training Pipeline
- Base model: Qwen/Qwen3-0.6B
- SFT stage: Supervised fine-tuning on heuristic-generated trajectories (hermanda/ant-llm-sft)
- GRPO stage (this model): Reinforcement learning with GRPO on multi-turn environment rollouts
GRPO Training Details
- Algorithm: Group Relative Policy Optimization with KL penalty + PPO clipping
- Reward: Composite reward based on food delivery, valid actions, exploration, and pheromone usage
- LoRA config: rank=16, alpha=32, dropout=0.05, targets=q_proj/k_proj/v_proj/o_proj
- Training: 30 iterations, group_size=4, batch_size=1, lr=5e-6, bf16
- Curriculum: 7-tier progressive difficulty (5x5 open grids up to 30x30 mazes with food respawn)
- Hardware: Google Colab T4 GPU
Training Results (30 iterations, Tier 0)
- Valid action rate: 100%
- Average food delivered: ~2.7 per episode
- Best reward: 80.59
- The model did not advance past tier 0 (5x5 grid). More training iterations or hyperparameter tuning needed.
Usage
This is a LoRA adapter. To use it, first merge the SFT adapter into the base model, then apply this GRPO adapter:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Step 1: Load base + merge SFT adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype="bfloat16")
sft_model = PeftModel.from_pretrained(base_model, "hermanda/ant-llm-sft")
merged_model = sft_model.merge_and_unload()
# Step 2: Apply GRPO adapter
grpo_model = PeftModel.from_pretrained(merged_model, "hermanda/ant-llm-grpo")
# Step 3: Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("hermanda/ant-llm-grpo")
Action Format
The model generates structured actions for the ant colony environment:
THINK: Found food nearby, picking up
ACT: PICK_UP
ACT: REMEMBER "food at [3,2]"
Available actions: MOVE, PICK_UP, DROP, DEPOSIT, SIGNAL, REMEMBER, REINFORCE, FORGET.
Citation
@misc{ant-llm-grpo-2025,
title={ant-llm: Training LLMs for Emergent Swarm Intelligence via GRPO},
author={Daniel Herman},
year={2025},
url={https://github.com/detrin/ant-llm}
}
- Downloads last month
- -