PursuitOfDataScience/qwen2.5-0.5b-r1-dpo

This repository contains a three-stage fine-tuned version of Qwen/Qwen2.5-0.5B:

  1. Supervised fine-tuning (SFT) on a local copy of HuggingFaceH4/ultrachat_200k using an instruction-style, multi-turn chat objective.
  2. Reasoning training to enhance step-by-step reasoning capabilities, building on the SFT model.
  3. Direct Preference Optimization (DPO) on a local copy of mlabonne/orpo-dpo-mix-40k, using preference pairs derived from instruction-following data.

Model details

  • Base model: Qwen/Qwen2.5-0.5B
  • Stage 1 objective: Supervised fine-tuning for helpful, concise chat responses on Ultrachat-style conversations.
  • Stage 2 objective: Specialized reasoning training to improve logical reasoning and Chain of Thought (CoT) capabilities.
  • Stage 3 objective: DPO on preference pairs (chosen vs. rejected responses) from mlabonne/orpo-dpo-mix-40k.
  • Context length: Up to 32768 tokens (subject to the base model config).
  • Training data:
    • SFT: multi-turn dialogues from HuggingFaceH4/ultrachat_200k.
    • Reasoning: open-r1/Mixture-of-Thoughts dataset with step-by-step reasoning traces.
    • DPO: preference pairs from mlabonne/orpo-dpo-mix-40k.

Inference usage

The model is trained in a chat-style setup. At inference time, prompts are built as a list of messages and passed through the model's native chat_template via tokenizer.apply_chat_template:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "PursuitOfDataScience/qwen2.5-0.5b-r1-dpo"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful, concise assistant. "
            "Write clear, well-structured answers that follow the user's constraints."
        ),
    },
    {
        "role": "user",
        "content": "Explain how someone can build a consistent daily learning habit.",
    },
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

# Decode only the generated continuation (excluding the prompt tokens)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(response)

Multi-turn example

messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful, concise assistant. "
            "Write clear, well-structured answers that follow the user's constraints."
        ),
    },
    {
        "role": "user",
        "content": "Describe the main trade-offs between using small and large language models.",
    },
    {
        "role": "assistant",
        "content": "Small models are cheaper and faster, while large models are usually more capable...",
    },
    {
        "role": "user",
        "content": "Give me a bullet-point summary from the perspective of a startup.",
    },
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(response)

Chain of Thought (CoT) reasoning example

For reasoning tasks, the model can generate step-by-step thoughts using <think> tags:

messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful, concise assistant. "
            "Use Chain of Thought reasoning with <think> tags for complex problems."
        ),
    },
    {
        "role": "user",
        "content": "If a train travels 60 km in 1 hour, how long will it take to travel 180 km?",
    },
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(response)
# Example output: <think> The train travels 60 km in 1 hour, so speed is 60 km/h. For 180 km, time = distance / speed = 180 / 60 = 3 hours. </think> It will take 3 hours.

Training pipeline (summary)

  1. Instruction SFT (Ultrachat):

    • Conversations are converted into lists of messages.
    • For each assistant turn, a single training example is built using tokenizer.apply_chat_template.
    • Loss is applied only on assistant tokens; system and user tokens are masked.
  2. Reasoning Training:

    • Fine-tuning on the open-r1/Mixture-of-Thoughts dataset with step-by-step reasoning traces to enhance CoT capabilities.
    • Uses reinforcement learning or supervised methods to align with logical reasoning patterns.
  3. DPO finetuning (ORPO DPO Mix):

    • Preference pairs (chosen/rejected responses) are pre-processed using the same chat template logic to ensure consistency with inference.
    • The DPO objective is applied using TRL's DPOTrainer, with prompts and chosen/rejected continuations derived from the pre-tokenized data.
    • See the dpo.py script used in this project for full configuration (batch sizing, max length, etc.).

Limitations

  • This is a relatively small (0.5B parameter) model and may hallucinate or struggle on complex, multi-step reasoning tasks.
  • Outputs may be inaccurate, unsafe, or biased. Always verify critical information before using it in production.
Downloads last month
20
Safetensors
Model size
0.5B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for PursuitOfDataScience/qwen2.5-0.5b-r1-dpo

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(489)
this model