ant-llm-grpo: GRPO-trained Ant Colony Agent

LoRA adapter for Qwen/Qwen3-0.6B trained with Group Relative Policy Optimization (GRPO) on an ant colony foraging simulation.

Overview

This model is part of the ant-llm project — training small LLMs to exhibit emergent swarm intelligence in a grid-based ant colony environment. Ants must navigate a grid, find food sources, carry food back to the nest, and coordinate via pheromone signals.

Training Pipeline

Base model: Qwen/Qwen3-0.6B
SFT stage: Supervised fine-tuning on heuristic-generated trajectories (hermanda/ant-llm-sft)
GRPO stage (this model): Reinforcement learning with GRPO on multi-turn environment rollouts

GRPO Training Details

Algorithm: Group Relative Policy Optimization with KL penalty + PPO clipping
Reward: Composite reward based on food delivery, valid actions, exploration, and pheromone usage
LoRA config: rank=16, alpha=32, dropout=0.05, targets=q_proj/k_proj/v_proj/o_proj
Training: 30 iterations, group_size=4, batch_size=1, lr=5e-6, bf16
Curriculum: 7-tier progressive difficulty (5x5 open grids up to 30x30 mazes with food respawn)
Hardware: Google Colab T4 GPU

Training Results (30 iterations, Tier 0)

Valid action rate: 100%
Average food delivered: ~2.7 per episode
Best reward: 80.59
The model did not advance past tier 0 (5x5 grid). More training iterations or hyperparameter tuning needed.

Usage

This is a LoRA adapter. To use it, first merge the SFT adapter into the base model, then apply this GRPO adapter:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Step 1: Load base + merge SFT adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype="bfloat16")
sft_model = PeftModel.from_pretrained(base_model, "hermanda/ant-llm-sft")
merged_model = sft_model.merge_and_unload()

# Step 2: Apply GRPO adapter
grpo_model = PeftModel.from_pretrained(merged_model, "hermanda/ant-llm-grpo")

# Step 3: Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("hermanda/ant-llm-grpo")

Action Format

The model generates structured actions for the ant colony environment:

THINK: Found food nearby, picking up
ACT: PICK_UP
ACT: REMEMBER "food at [3,2]"

Available actions: MOVE, PICK_UP, DROP, DEPOSIT, SIGNAL, REMEMBER, REINFORCE, FORGET.

Citation

@misc{ant-llm-grpo-2025,
  title={ant-llm: Training LLMs for Emergent Swarm Intelligence via GRPO},
  author={Daniel Herman},
  year={2025},
  url={https://github.com/detrin/ant-llm}
}

Downloads last month: -

Video Preview

Reinforcement Learning

Model tree for hermanda/ant-llm-grpo

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Adapter

(387)

this model