Moonlight-16B-A3B β fine-tuning compatible code
This repo contains only the code files (no weights) for
moonshotai/Moonlight-16B-A3B,
with patches applied to modeling_deepseek.py that are required for
fine-tuning with DeepSpeed (especially with AutoEP + Muon optimizer).
The same patch has been submitted upstream as moonshotai/Moonlight-16B-A3B PR #9.
What changed in modeling_deepseek.py
1. Training-path MoE dispatch (DeepseekV3MoE.forward)
The original code only had if not self.training: y = self.moe_infer(...),
leaving y undefined during training and causing UnboundLocalError when
shared experts were enabled (reported in
Discussion #8).
Added a proper if self.training: branch using sort-based dispatch:
- Each token is expanded
top_ktimes and sorted by expert ID so each expert receives a contiguous slice β enabling a single GPUβCPU sync for all expert boundaries instead of one per expert. - Supports gradient flow (no
torch.no_grad()).
2. assert not self.training removed from moe_infer
Deleted so that eval steps inside a training loop do not crash.
3. KV cache API compatibility (get_usable_length β get_seq_length)
past_key_value.get_usable_length() and past_key_values.seen_tokens are
deprecated in transformers >= 4.40. Replaced with get_seq_length().
Usage
Option A β Use the upstream PR branch directly (simplest)
If the PR has not been merged yet, you can load the patched code straight from the PR branch. Weights are downloaded from the original repo as usual.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"moonshotai/Moonlight-16B-A3B",
revision="refs/pr/9",
trust_remote_code=True,
torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(
"moonshotai/Moonlight-16B-A3B",
revision="refs/pr/9",
trust_remote_code=True,
)
Once PR #9 is merged, remove
revision="refs/pr/9"and use the repo as normal.
Option B β Copy the patched file into your local snapshot
Useful for air-gapped clusters or when you want a permanent local copy.
from huggingface_hub import snapshot_download, hf_hub_download
import shutil
# Download the original model weights
model_path = snapshot_download("moonshotai/Moonlight-16B-A3B")
# Overwrite modeling_deepseek.py with the patched version from this repo
patched = hf_hub_download("delock/Moonlight-16B-A3B-finetune-fixed", "modeling_deepseek.py")
shutil.copy(patched, model_path + "/modeling_deepseek.py")
# Load from the local path
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
The weights (~32 GB) are downloaded only once and stay in your local HuggingFace cache. You only need to overwrite
modeling_deepseek.pyonce.
DeepSpeed fine-tuning
This repo was developed alongside the DeepSpeed AutoEP branch. For fine-tuning with DeepSpeed + AutoEP + Muon optimizer, use either option above as the model source.
License
modeling_deepseek.py is derived from the original DeepSeek-AI /
HuggingFace code and is redistributed under the
Apache 2.0 License.
Model tree for delock/Moonlight-16B-A3B-finetune-fixed
Base model
moonshotai/Moonlight-16B-A3B