Qwen2.5-7B SFT + GRPO Checkpoint 200
This model was trained using GRPO (Group Relative Policy Optimization) on multi-hop tool-use tasks.
Training Details
- Base Model: Qwen/Qwen2.5-7B-Instruct
- Training Method: SFT followed by GRPO
- Task: Multi-hop tool-use (3-6-9 hop)
- Checkpoint: Step 200
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Anna4242/qwen25-7b-sft-grpo-checkpoint-200")
tokenizer = AutoTokenizer.from_pretrained("Anna4242/qwen25-7b-sft-grpo-checkpoint-200")