Qwen2.5-7B SFT + GRPO Checkpoint 200

This model was trained using GRPO (Group Relative Policy Optimization) on multi-hop tool-use tasks.

Training Details

  • Base Model: Qwen/Qwen2.5-7B-Instruct
  • Training Method: SFT followed by GRPO
  • Task: Multi-hop tool-use (3-6-9 hop)
  • Checkpoint: Step 200

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Anna4242/qwen25-7b-sft-grpo-checkpoint-200")
tokenizer = AutoTokenizer.from_pretrained("Anna4242/qwen25-7b-sft-grpo-checkpoint-200")
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for Anna4242/qwen25-7b-sft-grpo-checkpoint-200

Base model

Qwen/Qwen2.5-7B
Finetuned
(2200)
this model