Moonlight-16B-A3B β€” fine-tuning compatible code

This repo contains only the code files (no weights) for moonshotai/Moonlight-16B-A3B, with patches applied to modeling_deepseek.py that are required for fine-tuning with DeepSpeed (especially with AutoEP + Muon optimizer).

The same patch has been submitted upstream as moonshotai/Moonlight-16B-A3B PR #9.

What changed in modeling_deepseek.py

1. Training-path MoE dispatch (DeepseekV3MoE.forward)

The original code only had if not self.training: y = self.moe_infer(...), leaving y undefined during training and causing UnboundLocalError when shared experts were enabled (reported in Discussion #8).

Added a proper if self.training: branch using sort-based dispatch:

  • Each token is expanded top_k times and sorted by expert ID so each expert receives a contiguous slice β€” enabling a single GPUβ†’CPU sync for all expert boundaries instead of one per expert.
  • Supports gradient flow (no torch.no_grad()).

2. assert not self.training removed from moe_infer

Deleted so that eval steps inside a training loop do not crash.

3. KV cache API compatibility (get_usable_length β†’ get_seq_length)

past_key_value.get_usable_length() and past_key_values.seen_tokens are deprecated in transformers >= 4.40. Replaced with get_seq_length().

Usage

Option A β€” Use the upstream PR branch directly (simplest)

If the PR has not been merged yet, you can load the patched code straight from the PR branch. Weights are downloaded from the original repo as usual.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Moonlight-16B-A3B",
    revision="refs/pr/9",
    trust_remote_code=True,
    torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(
    "moonshotai/Moonlight-16B-A3B",
    revision="refs/pr/9",
    trust_remote_code=True,
)

Once PR #9 is merged, remove revision="refs/pr/9" and use the repo as normal.

Option B β€” Copy the patched file into your local snapshot

Useful for air-gapped clusters or when you want a permanent local copy.

from huggingface_hub import snapshot_download, hf_hub_download
import shutil

# Download the original model weights
model_path = snapshot_download("moonshotai/Moonlight-16B-A3B")

# Overwrite modeling_deepseek.py with the patched version from this repo
patched = hf_hub_download("delock/Moonlight-16B-A3B-finetune-fixed", "modeling_deepseek.py")
shutil.copy(patched, model_path + "/modeling_deepseek.py")

# Load from the local path
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

The weights (~32 GB) are downloaded only once and stay in your local HuggingFace cache. You only need to overwrite modeling_deepseek.py once.

DeepSpeed fine-tuning

This repo was developed alongside the DeepSpeed AutoEP branch. For fine-tuning with DeepSpeed + AutoEP + Muon optimizer, use either option above as the model source.

License

modeling_deepseek.py is derived from the original DeepSeek-AI / HuggingFace code and is redistributed under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for delock/Moonlight-16B-A3B-finetune-fixed

Finetuned
(4)
this model