Change log to Qwen 3

#20
by danielhanchen - opened
Unsloth AI org
β€’
edited May 23

For a full list of changes:

  1. Updated quants due to chat template not working in llama.cpp / lm studio due to [::-1] and other jinja template issues - now worked for llama.cpp
  2. Updated again since lm studio didn't like llama.cpp's chat template - will work with lm studio in the future to test templates
  3. Updated with an updated dynamic 2.0 quant methodology (2.1) upgrading our dataset to over 1 million tokens with both short and long context lengths to improve accuracy. Also fixed 235B imatrix quants - in fact we're the only provider for imatrix 235B quants.
  4. Updated again due to tool calling issues as mentioned in https://www.reddit.com/r/LocalLLaMA/comments/1klltt4/the_qwen3_chat_template_is_still_bugged/ - other people's quants I think are still buggy
  5. Updated all quants due to speculative decoding not working (BOS tokens mismatched)
  6. Should now be fully stable

FAQ:

  1. Why are there no IQ1, IQ2, IQ3 quants? We found nans and issues with imatrix, hence we can only provide some dynamic quants. For now, we're the only provider of quants with imatrix as well.
danielhanchen pinned discussion

Thank you very much for the quants & articles!
I'm a big fan of what you've been doing over the past months/years!

I have a small and respectful and possibly useful / interesting to you observation / suggestion which may or may not be interesting to your workflow for future GGUF conversions.

I'll explain the context / use case as relates to the present models here:

This present discussion thread changelog includes this item:
"Updated all quants due to speculative decoding not working (BOS tokens mismatched)" --
though neither the commit ID nor date is listed (possible areas for improvement in change logs).

Looking at the commit history I see one of the most recent commits is:

https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/commit/90e0d618386f1be1e11a604c2e0484d44c09737e

Which looks like it changes README.md, config.json (okay, no problem, these are easy / small to download and diffs are shown so the changes can be easily seen).

Also changed are five GGUF files, every one "part1" only of a split group e.g.
Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf

The split parts after "part1" were not changed in this commit. That seems consistent with what to possibly expect
when ONLY GGUF metadata is changed (presumably due to the deletion of: "bos_token_id": 151643 in the config.json delta)
and NOT (possibly) the tensor data subsequent to the metadata.

Though due to LFS the commit only shows that LFS content of 250+ GBy of deleted/added GGUF part1 files for the five part1 GGUFs were modified by the delta with the file size of each part1 LFS file changing by merely ~36 bytes or so.

So I'm guessing that changing 36 bytes of GGUF header metadata has caused you and your GGUF users to upload/download
potentially 250 GBy of GGUF files to effect only a 36 byte metadata change which itself isn't fully diffable or documented (in the sense of what the GGUF metadata field(s) / value(s) affected precisely were).

So when looking at the metadata of the current branch of unsloth/Qwen3-235B-A22B-GGUF (because it sadly can't do diffs of old vs. new delta metadata)
via:
https://huggingface.co/spaces/CISCai/gguf-editor

I see this metadata is present now:
tokenizer.ggml.add_bos_token,BOOL,false

whereas "tokenizer.ggml.bos_token_id" is a valid potential metadata
element tag but it doesn't seem to be present in the part1 GGUFs I inspected.

So I'm guessing that the latter might previously have been set per. "bos_token_id": 151643" in the old config.json but now the former "tokenizer.ggml.add_bos_token,BOOL,false" may now, instead, be set in the GGUFs.

So if such a needed delta was made clear then users with "old, now not-current" GGUFs would have the option to use a local GGUF metadata editor script to alter the relevant GGUF files without downloading 50GBy * N files to get a 36 byte change.

Even better -- which is the essence of my suggestion for your consideration for future GGUF composition workflows --
would be to consider the merits of using the following GGUF creation option which AFAICT allows literally any GGUF conversion to the created as a split GGUF type output where part1 has NO tensor content, ONLY header/metatata, and therefore it would ALWAYS be just a few kilobyte size file which could trivially be edited / uploaded / downloaded / replaced and pulled with zero inconvenience / overhead. This present case may exemplify how using this very slightly alternative GGUF creation option could save 100s or several 1000s of GBy of unnecessary uploads / downloads in cases (which are not so uncommon when new models are released and the metadata isn't stable for days / weeks) where the metadata has to be updated independently of the tensors (a common case).

The option that enables the metadata ONLY part1 splitting GGUF composition is shown as follows:

https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20--no-tensor-first-split&type=code

"--no-tensor-first-split do not add tensors to the first split (disabled by default)"

It seems like there's zero downside to using this standard option of GGUF conversion / format and literally
enormous upside since I've seen this kind of problem occur many times now with various quantizers' GGUF files being
changed 2, 3, 4 or such times over days / weeks after a model release before the metadata for tokens, templates, rope, whatever is stable.

So perhaps you might consider defaulting to use this option in the future if you agree with the potential benefits
to yourself as well as your GGUF consumers? If it would enable people to pull an updated metadata file and
not affect their old tensor versions it'd be a big win. The only practical alternative I know is to use a script or patch mechanism to edit the metadata in a header+tensor combined file which seems much less ideal than relying on what seems to be a standard GGUF converter / loader feature.

XET would have the potential of letting people download mostly only the changed parts of a large file if it contained the combined header / metadata + tensor data for part1 but that's ONLY if people used XET to download the original GGUFs, AND kept the full associated cache directory containing the needed segment/rolling hash data needed to identify changed chunks and enable their efficient selective retrieval. But AFAIK that would imply a large space overhead of saving such data which might not even by default be saved by a XET transfer series (the cache can't grow forever...) and only a portion of clients and repos even use / enable XET. And it sadly lacks the capability to efficiently handle partial file updates if one did not use XET and keep the chunk cache data when the older version was downloaded (unlike rsync which just always works). So I don't see XET (which only works on HF anyway) as a practical amelioration.

Unsloth AI org
β€’
edited Jun 10

@ideosphere Oh wait I just noticed it was you! I like your idea! Will add it for large uploads from now on! Sorry missed your reply!

Sign up or log in to comment