| --- |
| language: |
| - en |
| license: other |
| pipeline_tag: text-generation |
| tags: |
| - qwen2.5 |
| - chat |
| - sft |
| - dpo |
| - ultrachat |
| - orpo-dpo-mix |
| base_model: Qwen/Qwen2.5-0.5B |
| library_name: transformers |
| --- |
| |
| # PursuitOfDataScience/qwen2.5-0.5b-dpo |
|
|
| This repository contains a two-stage fine-tuned version of **Qwen/Qwen2.5-0.5B**: |
|
|
| 1. **Supervised fine-tuning (SFT)** on a local copy of **HuggingFaceH4/ultrachat_200k** |
| using an instruction-style, multi-turn chat objective. |
| 2. **Direct Preference Optimization (DPO)** on a local copy of |
| **mlabonne/orpo-dpo-mix-40k**, using preference pairs derived from instruction-following data. |
| |
| ## Model details |
| |
| - **Base model**: `Qwen/Qwen2.5-0.5B` |
| - **Stage 1 objective**: Supervised fine-tuning for helpful, concise chat responses |
| on Ultrachat-style conversations. |
| - **Stage 2 objective**: DPO on preference pairs (chosen vs. rejected responses) |
| from `mlabonne/orpo-dpo-mix-40k`. |
| - **Context length**: Up to 32768 tokens (subject to the base model config). |
| - **Training data**: |
| - SFT: multi-turn dialogues from `HuggingFaceH4/ultrachat_200k`. |
| - DPO: preference pairs from `mlabonne/orpo-dpo-mix-40k`. |
| |
| ## Inference usage |
| |
| The model is trained in a **chat-style** setup. At inference time, prompts are built |
| as a list of `messages` and passed through the model's native `chat_template` |
| via `tokenizer.apply_chat_template`: |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| repo_id = "PursuitOfDataScience/qwen2.5-0.5b-dpo" |
| |
| tokenizer = AutoTokenizer.from_pretrained(repo_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| repo_id, |
| device_map="auto", |
| ) |
| |
| messages = [ |
| { |
| "role": "system", |
| "content": ( |
| "You are a helpful, concise assistant. " |
| "Write clear, well-structured answers that follow the user's constraints." |
| ), |
| }, |
| { |
| "role": "user", |
| "content": "Explain how someone can build a consistent daily learning habit.", |
| }, |
| ] |
| |
| prompt_text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device) |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=512, |
| eos_token_id=tokenizer.eos_token_id, |
| pad_token_id=tokenizer.eos_token_id, |
| temperature=0.7, |
| top_p=0.9, |
| do_sample=True, |
| ) |
| |
| # Decode only the generated continuation (excluding the prompt tokens) |
| generated_tokens = outputs[0][inputs["input_ids"].shape[1]:] |
| response = tokenizer.decode(generated_tokens, skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ### Multi-turn example |
|
|
| ```python |
| messages = [ |
| { |
| "role": "system", |
| "content": ( |
| "You are a helpful, concise assistant. " |
| "Write clear, well-structured answers that follow the user's constraints." |
| ), |
| }, |
| { |
| "role": "user", |
| "content": "Describe the main trade-offs between using small and large language models.", |
| }, |
| { |
| "role": "assistant", |
| "content": "Small models are cheaper and faster, while large models are usually more capable...", |
| }, |
| { |
| "role": "user", |
| "content": "Give me a bullet-point summary from the perspective of a startup.", |
| }, |
| ] |
| |
| prompt_text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device) |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=256, |
| eos_token_id=tokenizer.eos_token_id, |
| pad_token_id=tokenizer.eos_token_id, |
| temperature=0.7, |
| top_p=0.9, |
| do_sample=True, |
| ) |
| response = tokenizer.decode( |
| outputs[0][inputs["input_ids"].shape[1]:], |
| skip_special_tokens=True, |
| ) |
| print(response) |
| ``` |
|
|
| ## Training pipeline (summary) |
|
|
| 1. **Instruction SFT (Ultrachat)**: |
| - Conversations are converted into lists of `messages`. |
| - For each assistant turn, a single training example is built using |
| `tokenizer.apply_chat_template`. |
| - Loss is applied only on assistant tokens; system and user tokens are masked. |
| |
| 2. **DPO finetuning (ORPO DPO Mix)**: |
| - Preference pairs (chosen/rejected responses) are pre-processed using the |
| same chat template logic to ensure consistency with inference. |
| - The DPO objective is applied using TRL's `DPOTrainer`, with prompts and |
| chosen/rejected continuations derived from the pre-tokenized data. |
| - See the `dpo.py` script used in this project for full configuration |
| (batch sizing, max length, etc.). |
| |
| ## Limitations |
|
|
| - This is a relatively small (0.5B parameter) model and may hallucinate or |
| struggle on complex, multi-step reasoning tasks. |
| - Outputs may be inaccurate, unsafe, or biased. Always verify critical |
| information before using it in production. |
|
|