Qwen3-4B-SFT

This repository hosts the released SFT model Qwen3-4B-SFT, based on Qwen3-4B-Base. In open-source practice, fully reproducible "pre-RL SFT base" releases are still relatively rare: many projects release only final Reinforcement Learning models or partial pipelines.

This model fills that gap by giving the community a practical intermediate checkpoint that:

applies Math-forward, Reasoning-focused, Format-aligned SFT
starts from a public base model (Qwen3-4B-Base)
is suitable as a warm-start and clean model for later RL stages

Base Info

Base: Qwen3-4B-Base
This model: Qwen3-4B-SFT
Fine-tuning type: full-parameter SFT
Training framework: verl

Training Infrastructure

Cluster: MeluXina Supercomputer (LuxProvide)

Node Config: NVIDIA A100 (4-GPU nodes)

Final SFT Run: 12 Node-hours (16× A100 for 3 hours)

Total R&D Investment: ~700 Node-hours > * Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.

Project Links

Model repository (this page): https://huggingface.co/96kevinli29/Qwen3-4B-SFT
Dataset card used for SFT: https://huggingface.co/datasets/96kevinli29/SFT-Dataset
Training code repository: https://github.com/96kevinli29/base-model-sft-verl

Troubleshooting: Model does not stop generation

If the model answers correctly but continues generating (repetition/looping), the EOS setting is likely mismatched with the ChatML template.

For ChatML SFT, the model usually ends turns with <|im_end|> (often ID 151645), while some default configs still use <|endoftext|> (often ID 151643) as EOS.

Please make sure:

tokenizer_config.json
- "eos_token": "<|im_end|>"
generation_config.json
- "eos_token_id": 151645 (must match <|im_end|>)

Also verify the actual token ID in tokenizer.json:

Benchmark Snapshot

The following results compare Base (4B), SFT (RLCER), and Ours.

Dataset	Base (4B)	SFT (RLCER)	Ours
AIME 2024	11.25%	17.29%	20.8%
AIME 2025	6.46%	18.96%	19.4%
AMC 2023	31.09%	59.53%	58.0%
GPQA-Diamond	7.77%	24.43%	29.1%

For context, the SFT stage in the RLCER paper (arXiv:2602.10885) was not fully released as an open checkpoint/pipeline. This release is provided to fill that practical gap for the community: we open a reproducible pre-RL SFT base and report results that reach the SFT-level performance target discussed in that work.

Limitations

Not optimized for factual correctness in all domains
May still produce hallucinations or unsafe outputs
Performance is sensitive to prompt style and decoding settings

Citation

If you use this model, please cite this checkpoint, bibTeX for this release :

@misc{qwen3-4b-sft-math-2025,
  title        = {{Qwen3-4B-SFT}: Supervised Fine-Tuned {Qwen3}-4B for Reasoning},
  author       = {96kevinli29},
  year         = {2026},
  howpublished = {Hugging Face},
  url          = {https://huggingface.co/96kevinli29/Qwen3-4B-SFT},
  note         = {Checkpoint trained with verl; warm-start for pre-RL alignment research.}
}

Also cite as appropriate:

The base model (Qwen3-4B-Base) — use the official Qwen3 / Alibaba citation from its Hugging Face model card.
The training code repository: https://github.com/96kevinli29/base-model-sft-verl/
The original source datasets listed in the dataset recipe.

Downloads last month: 640

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for 96kevinli29/Qwen3-4B-SFT

Base model

Qwen/Qwen3-4B-Base

Finetuned

(254)

this model

Paper for 96kevinli29/Qwen3-4B-SFT

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Paper • 2602.10885 • Published Feb 11 • 1

Evaluation results

accuracy on AIME 2024
self-reported

20.800
accuracy on AIME 2025
self-reported

19.400
accuracy on AMC 2023
self-reported

58.000
accuracy on GPQA-Diamond
self-reported

29.100