Qwen3-4B-SFT

This repository hosts the released SFT model Qwen3-4B-SFT, based on Qwen3-4B-Base. In open-source practice, fully reproducible "pre-RL SFT base" releases are still relatively rare: many projects release only final Reinforcement Learning models or partial pipelines.

This model fills that gap by giving the community a practical intermediate checkpoint that:

  • applies Math-forward, Reasoning-focused, Format-aligned SFT
  • starts from a public base model (Qwen3-4B-Base)
  • is suitable as a warm-start and clean model for later RL stages

Base Info

  • Base: Qwen3-4B-Base
  • This model: Qwen3-4B-SFT
  • Fine-tuning type: full-parameter SFT
  • Training framework: verl

Training Infrastructure

Cluster: MeluXina Supercomputer (LuxProvide)

Node Config: NVIDIA A100 (4-GPU nodes)

Final SFT Run: 12 Node-hours (16× A100 for 3 hours)

Total R&D Investment: ~700 Node-hours > * Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.

Project Links

Troubleshooting: Model does not stop generation

If the model answers correctly but continues generating (repetition/looping), the EOS setting is likely mismatched with the ChatML template.

For ChatML SFT, the model usually ends turns with <|im_end|> (often ID 151645), while some default configs still use <|endoftext|> (often ID 151643) as EOS.

Please make sure:

  • tokenizer_config.json

    • "eos_token": "<|im_end|>"
  • generation_config.json

    • "eos_token_id": 151645 (must match <|im_end|>)

Also verify the actual token ID in tokenizer.json:

Benchmark Snapshot

The following results compare Base (4B), SFT (RLCER), and Ours.

Dataset Base (4B) SFT (RLCER) Ours
AIME 2024 11.25% 17.29% 20.8%
AIME 2025 6.46% 18.96% 19.4%
AMC 2023 31.09% 59.53% 58.0%
GPQA-Diamond 7.77% 24.43% 29.1%

For context, the SFT stage in the RLCER paper (arXiv:2602.10885) was not fully released as an open checkpoint/pipeline. This release is provided to fill that practical gap for the community: we open a reproducible pre-RL SFT base and report results that reach the SFT-level performance target discussed in that work.

Limitations

  • Not optimized for factual correctness in all domains
  • May still produce hallucinations or unsafe outputs
  • Performance is sensitive to prompt style and decoding settings

Citation

If you use this model, please cite this checkpoint, bibTeX for this release :

@misc{qwen3-4b-sft-math-2025,
  title        = {{Qwen3-4B-SFT}: Supervised Fine-Tuned {Qwen3}-4B for Reasoning},
  author       = {96kevinli29},
  year         = {2026},
  howpublished = {Hugging Face},
  url          = {https://huggingface.co/96kevinli29/Qwen3-4B-SFT},
  note         = {Checkpoint trained with verl; warm-start for pre-RL alignment research.}
}

Also cite as appropriate:

  • The base model (Qwen3-4B-Base) — use the official Qwen3 / Alibaba citation from its Hugging Face model card.
  • The training code repository: https://github.com/96kevinli29/base-model-sft-verl/
  • The original source datasets listed in the dataset recipe.
Downloads last month
640
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 96kevinli29/Qwen3-4B-SFT

Finetuned
(254)
this model

Paper for 96kevinli29/Qwen3-4B-SFT

Evaluation results