---
library_name: peft
base_model: Qwen/Qwen3-0.6B
tags:
  - reinforcement-learning
  - grpo
  - ant-colony
  - qwen3
  - lora
license: apache-2.0
---

# ant-llm-grpo: GRPO-trained Ant Colony Agent

LoRA adapter for [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) trained with **Group Relative Policy Optimization (GRPO)** on an ant colony foraging simulation.

## Overview

This model is part of the [ant-llm](https://github.com/detrin/ant-llm) project — training small LLMs to exhibit emergent swarm intelligence in a grid-based ant colony environment. Ants must navigate a grid, find food sources, carry food back to the nest, and coordinate via pheromone signals.

## Training Pipeline

1. **Base model**: Qwen/Qwen3-0.6B
2. **SFT stage**: Supervised fine-tuning on heuristic-generated trajectories ([hermanda/ant-llm-sft](https://huggingface.co/hermanda/ant-llm-sft))
3. **GRPO stage** (this model): Reinforcement learning with GRPO on multi-turn environment rollouts

## GRPO Training Details

- **Algorithm**: Group Relative Policy Optimization with KL penalty + PPO clipping
- **Reward**: Composite reward based on food delivery, valid actions, exploration, and pheromone usage
- **LoRA config**: rank=16, alpha=32, dropout=0.05, targets=q_proj/k_proj/v_proj/o_proj
- **Training**: 30 iterations, group_size=4, batch_size=1, lr=5e-6, bf16
- **Curriculum**: 7-tier progressive difficulty (5x5 open grids up to 30x30 mazes with food respawn)
- **Hardware**: Google Colab T4 GPU

### Training Results (30 iterations, Tier 0)

- Valid action rate: 100%
- Average food delivered: ~2.7 per episode
- Best reward: 80.59
- The model did not advance past tier 0 (5x5 grid). More training iterations or hyperparameter tuning needed.

## Usage

This is a LoRA adapter. To use it, first merge the SFT adapter into the base model, then apply this GRPO adapter:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Step 1: Load base + merge SFT adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype="bfloat16")
sft_model = PeftModel.from_pretrained(base_model, "hermanda/ant-llm-sft")
merged_model = sft_model.merge_and_unload()

# Step 2: Apply GRPO adapter
grpo_model = PeftModel.from_pretrained(merged_model, "hermanda/ant-llm-grpo")

# Step 3: Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("hermanda/ant-llm-grpo")
```

## Action Format

The model generates structured actions for the ant colony environment:

```
THINK: Found food nearby, picking up
ACT: PICK_UP
ACT: REMEMBER "food at [3,2]"
```

Available actions: MOVE, PICK_UP, DROP, DEPOSIT, SIGNAL, REMEMBER, REINFORCE, FORGET.

## Citation

```bibtex
@misc{ant-llm-grpo-2025,
  title={ant-llm: Training LLMs for Emergent Swarm Intelligence via GRPO},
  author={Daniel Herman},
  year={2025},
  url={https://github.com/detrin/ant-llm}
}
```