Qwen3-4B-SFT
This repository hosts the released SFT model Qwen3-4B-SFT, based on Qwen3-4B-Base. In open-source practice, fully reproducible "pre-RL SFT base" releases are still relatively rare: many projects release only final Reinforcement Learning models or partial pipelines.
This model fills that gap by giving the community a practical intermediate checkpoint that:
- applies Math-forward, Reasoning-focused, Format-aligned SFT
- starts from a public base model (
Qwen3-4B-Base) - is suitable as a warm-start and clean model for later RL stages
Base Info
- Base:
Qwen3-4B-Base - This model:
Qwen3-4B-SFT - Fine-tuning type: full-parameter SFT
- Training framework:
verl
Training Infrastructure
Cluster: MeluXina Supercomputer (LuxProvide)
Node Config: NVIDIA A100 (4-GPU nodes)
Final SFT Run: 12 Node-hours (16× A100 for 3 hours)
Total R&D Investment: ~700 Node-hours > * Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.
Project Links
- Model repository (this page): https://huggingface.co/96kevinli29/Qwen3-4B-SFT
- Dataset card used for SFT: https://huggingface.co/datasets/96kevinli29/SFT-Dataset
- Training code repository: https://github.com/96kevinli29/base-model-sft-verl
Troubleshooting: Model does not stop generation
If the model answers correctly but continues generating (repetition/looping), the EOS setting is likely mismatched with the ChatML template.
For ChatML SFT, the model usually ends turns with <|im_end|> (often ID 151645), while some default configs still use <|endoftext|> (often ID 151643) as EOS.
Please make sure:
tokenizer_config.json"eos_token": "<|im_end|>"
generation_config.json"eos_token_id": 151645(must match<|im_end|>)
Also verify the actual token ID in tokenizer.json:
Benchmark Snapshot
The following results compare Base (4B), SFT (RLCER), and Ours.
| Dataset | Base (4B) | SFT (RLCER) | Ours |
|---|---|---|---|
| AIME 2024 | 11.25% | 17.29% | 20.8% |
| AIME 2025 | 6.46% | 18.96% | 19.4% |
| AMC 2023 | 31.09% | 59.53% | 58.0% |
| GPQA-Diamond | 7.77% | 24.43% | 29.1% |
For context, the SFT stage in the RLCER paper (arXiv:2602.10885) was not fully released as an open checkpoint/pipeline. This release is provided to fill that practical gap for the community: we open a reproducible pre-RL SFT base and report results that reach the SFT-level performance target discussed in that work.
Limitations
- Not optimized for factual correctness in all domains
- May still produce hallucinations or unsafe outputs
- Performance is sensitive to prompt style and decoding settings
Citation
If you use this model, please cite this checkpoint, bibTeX for this release :
@misc{qwen3-4b-sft-math-2025,
title = {{Qwen3-4B-SFT}: Supervised Fine-Tuned {Qwen3}-4B for Reasoning},
author = {96kevinli29},
year = {2026},
howpublished = {Hugging Face},
url = {https://huggingface.co/96kevinli29/Qwen3-4B-SFT},
note = {Checkpoint trained with verl; warm-start for pre-RL alignment research.}
}
Also cite as appropriate:
- The base model (
Qwen3-4B-Base) — use the official Qwen3 / Alibaba citation from its Hugging Face model card. - The training code repository:
https://github.com/96kevinli29/base-model-sft-verl/ - The original source datasets listed in the dataset recipe.
- Downloads last month
- 640
Model tree for 96kevinli29/Qwen3-4B-SFT
Base model
Qwen/Qwen3-4B-BasePaper for 96kevinli29/Qwen3-4B-SFT
Evaluation results
- accuracy on AIME 2024self-reported20.800
- accuracy on AIME 2025self-reported19.400
- accuracy on AMC 2023self-reported58.000
- accuracy on GPQA-Diamondself-reported29.100