Instructions to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF",
	filename="Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./build/bin/llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Use Docker

docker model run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

LM Studio
Jan

vLLM

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Ollama
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Ollama:
```
ollama run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
```

Unsloth Studio

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF to start chatting

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Run Hermes

hermes

Docker Model Runner
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Docker Model Runner:
```
docker model run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
```

Lemonade

How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE

Run and chat with the model

lemonade run user.Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF-MXFP4_MOE

List all available models

lemonade list

magiccodingman commited on Nov 16, 2025

Commit

1e7d5b0

1 Parent(s): 2b5e172

Add GGUF models + tokenizer with LFS

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_code.txt +173 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_general.txt +173 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_math.txt +173 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +174 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +173 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +173 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +173 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt +11 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt +175 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt +175 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt +175 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt +0 -0
Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.gguf filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B F16                 |  15.26 GiB |     8.19 B | CUDA       |  35 |             pp8 |        160.47 ± 2.46 |
+| qwen3vl 8B F16                 |  15.26 GiB |     8.19 B | CUDA       |  35 |           tg128 |         21.03 ± 0.10 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21167 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/Qwen3-VL-8B-Thinking-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  24:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  25:               general.quantization_version u32              = 2
+llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 15.26 GiB (16.00 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
+load_tensors:        CUDA0 model buffer size =  3680.32 MiB
+load_tensors:        CUDA1 model buffer size =  3680.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 107.906 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.12 seconds per pass - ETA 2.28 minutes
+[1]2.6509,[2]2.1039,[3]1.6432,[4]1.5273,[5]1.6200,[6]1.6770,[7]1.6384,[8]1.6182,[9]1.5503,[10]1.5079,[11]1.4801,[12]1.4844,[13]1.4567,[14]1.4394,[15]1.4524,[16]1.4358,[17]1.4224,[18]1.4252,[19]1.4135,[20]1.3983,[21]1.3922,[22]1.3897,[23]1.4105,[24]1.4004,[25]1.4054,[26]1.3923,[27]1.3842,[28]1.3817,[29]1.3954,[30]1.3965,[31]1.3872,[32]1.3794,[33]1.3801,[34]1.3784,[35]1.3768,[36]1.4000,[37]1.4096,[38]1.4145,[39]1.4215,[40]1.4224,[41]1.4174,[42]1.4309,[43]1.4308,[44]1.4317,
+Final estimate: PPL = 1.4317 +/- 0.00925
+llama_perf_context_print:        load time =    2083.42 ms
+llama_perf_context_print: prompt eval time =  115427.31 ms / 90112 tokens (    1.28 ms per token,   780.68 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  117016.09 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 15713 + ( 5252 =  3680 +      80 +    1491) +        3141 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19612 + ( 3858 =  3680 +      80 +      98) +         653 |
+llama_memory_breakdown_print: |   - Host               |                  15763 = 15623 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21167 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/Qwen3-VL-8B-Thinking-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  24:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  25:               general.quantization_version u32              = 2
+llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 15.26 GiB (16.00 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
+load_tensors:        CUDA0 model buffer size =  3680.32 MiB
+load_tensors:        CUDA1 model buffer size =  3680.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.688 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.89 seconds per pass - ETA 0.72 minutes
+[1]7.5799,[2]9.1122,[3]9.4288,[4]9.1032,[5]8.9461,[6]7.6063,[7]6.8620,[8]6.9145,[9]7.2421,[10]7.3723,[11]7.3883,[12]7.7442,[13]7.8003,[14]7.9170,[15]7.9733,
+Final estimate: PPL = 7.9733 +/- 0.17204
+llama_perf_context_print:        load time =    2089.58 ms
+llama_perf_context_print: prompt eval time =   39479.57 ms / 30720 tokens (    1.29 ms per token,   778.12 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   39987.46 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 15715 + ( 5252 =  3680 +      80 +    1491) +        3139 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19612 + ( 3858 =  3680 +      80 +      98) +         653 |
+llama_memory_breakdown_print: |   - Host               |                  15763 = 15623 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21168 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/Qwen3-VL-8B-Thinking-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  24:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  25:               general.quantization_version u32              = 2
+llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 15.26 GiB (16.00 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
+load_tensors:        CUDA0 model buffer size =  3680.32 MiB
+load_tensors:        CUDA1 model buffer size =  3680.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.167 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.98 seconds per pass - ETA 0.78 minutes
+[1]5.0611,[2]5.6021,[3]5.7480,[4]5.8819,[5]6.0666,[6]6.0144,[7]5.9922,[8]5.9096,[9]5.9390,[10]5.9285,[11]5.9540,[12]5.9421,[13]6.0108,[14]6.0192,[15]6.0065,[16]5.9985,
+Final estimate: PPL = 5.9985 +/- 0.10966
+llama_perf_context_print:        load time =    2085.72 ms
+llama_perf_context_print: prompt eval time =   42414.95 ms / 32768 tokens (    1.29 ms per token,   772.56 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   42855.58 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 15717 + ( 5252 =  3680 +      80 +    1491) +        3137 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19612 + ( 3858 =  3680 +      80 +      98) +         653 |
+llama_memory_breakdown_print: |   - Host               |                  15763 = 15623 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   9.72 GiB |     8.19 B | CUDA       |  35 |             pp8 |        171.51 ± 2.69 |
+| qwen3vl 8B MXFP4 MoE           |   9.72 GiB |     8.19 B | CUDA       |  35 |           tg128 |         23.90 ± 1.21 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21382 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:   38 tensors
+llama_model_loader: - type q8_0:  216 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 9.72 GiB (10.19 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  5742.53 MiB
+load_tensors:        CUDA0 model buffer size =  2105.32 MiB
+load_tensors:        CUDA1 model buffer size =  2105.32 MiB
+...............................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.902 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.21 seconds per pass - ETA 1.62 minutes
+[1]2.6487,[2]2.0993,[3]1.6410,[4]1.5260,[5]1.6196,[6]1.6767,[7]1.6385,[8]1.6178,[9]1.5501,[10]1.5076,[11]1.4798,[12]1.4842,[13]1.4565,[14]1.4393,[15]1.4523,[16]1.4357,[17]1.4224,[18]1.4251,[19]1.4134,[20]1.3983,[21]1.3922,[22]1.3896,[23]1.4106,[24]1.4005,[25]1.4055,[26]1.3924,[27]1.3842,[28]1.3818,[29]1.3955,[30]1.3966,[31]1.3872,[32]1.3794,[33]1.3801,[34]1.3785,[35]1.3770,[36]1.4001,[37]1.4097,[38]1.4147,[39]1.4216,[40]1.4226,[41]1.4176,[42]1.4311,[43]1.4310,[44]1.4318,
+Final estimate: PPL = 1.4318 +/- 0.00925
+llama_perf_context_print:        load time =    3555.99 ms
+llama_perf_context_print: prompt eval time =   84152.72 ms / 90112 tokens (    0.93 ms per token,  1070.82 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   85447.98 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 17506 + (3677 =  2105 +      80 +    1491) +        2923 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21188 + (2283 =  2105 +      80 +      98) +         652 |
+llama_memory_breakdown_print: |   - Host               |                  5882 =  5742 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21382 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:   38 tensors
+llama_model_loader: - type q8_0:  216 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 9.72 GiB (10.19 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  5742.53 MiB
+load_tensors:        CUDA0 model buffer size =  2105.32 MiB
+load_tensors:        CUDA1 model buffer size =  2105.32 MiB
+...............................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 195.846 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.16 seconds per pass - ETA 0.53 minutes
+[1]7.5980,[2]9.1402,[3]9.4494,[4]9.1211,[5]8.9602,[6]7.6188,[7]6.8745,[8]6.9243,[9]7.2510,[10]7.3825,[11]7.3992,[12]7.7564,[13]7.8131,[14]7.9294,[15]7.9857,
+Final estimate: PPL = 7.9857 +/- 0.17241
+llama_perf_context_print:        load time =    1740.10 ms
+llama_perf_context_print: prompt eval time =   28979.72 ms / 30720 tokens (    0.94 ms per token,  1060.05 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   29591.66 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 17508 + (3677 =  2105 +      80 +    1491) +        2921 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21188 + (2283 =  2105 +      80 +      98) +         652 |
+llama_memory_breakdown_print: |   - Host               |                  5882 =  5742 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21380 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type  f16:   38 tensors
+llama_model_loader: - type q8_0:  216 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 9.72 GiB (10.19 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  5742.53 MiB
+load_tensors:        CUDA0 model buffer size =  2105.32 MiB
+load_tensors:        CUDA1 model buffer size =  2105.32 MiB
+...............................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1491.75 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.116 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.28 seconds per pass - ETA 0.60 minutes
+[1]5.0583,[2]5.5938,[3]5.7474,[4]5.8798,[5]6.0643,[6]6.0127,[7]5.9910,[8]5.9104,[9]5.9396,[10]5.9283,[11]5.9538,[12]5.9433,[13]6.0120,[14]6.0208,[15]6.0074,[16]6.0000,
+Final estimate: PPL = 6.0000 +/- 0.10970
+llama_perf_context_print:        load time =    1630.49 ms
+llama_perf_context_print: prompt eval time =   31058.33 ms / 32768 tokens (    0.95 ms per token,  1055.05 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   31573.35 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 17506 + (3677 =  2105 +      80 +    1491) +        2923 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21188 + (2283 =  2105 +      80 +      98) +         652 |
+llama_memory_breakdown_print: |   - Host               |                  5882 =  5742 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.24 GiB |     8.19 B | CUDA       |  35 |             pp8 |       283.49 ± 18.12 |
+| qwen3vl 8B MXFP4 MoE           |   7.24 GiB |     8.19 B | CUDA       |  35 |           tg128 |         42.70 ± 2.51 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21380 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.24 GiB (7.60 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3668.22 MiB
+load_tensors:        CUDA0 model buffer size =  1875.32 MiB
+load_tensors:        CUDA1 model buffer size =  1875.32 MiB
+.............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   638.59 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 116.358 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.83 seconds per pass - ETA 1.33 minutes
+[1]2.6816,[2]2.1245,[3]1.6554,[4]1.5356,[5]1.6287,[6]1.6862,[7]1.6462,[8]1.6246,[9]1.5563,[10]1.5128,[11]1.4852,[12]1.4892,[13]1.4615,[14]1.4438,[15]1.4562,[16]1.4394,[17]1.4261,[18]1.4292,[19]1.4176,[20]1.4021,[21]1.3959,[22]1.3936,[23]1.4149,[24]1.4050,[25]1.4103,[26]1.3972,[27]1.3891,[28]1.3868,[29]1.4009,[30]1.4020,[31]1.3927,[32]1.3847,[33]1.3852,[34]1.3835,[35]1.3821,[36]1.4054,[37]1.4148,[38]1.4197,[39]1.4270,[40]1.4278,[41]1.4227,[42]1.4364,[43]1.4362,[44]1.4371,
+Final estimate: PPL = 1.4371 +/- 0.00935
+llama_perf_context_print:        load time =    3129.92 ms
+llama_perf_context_print: prompt eval time =   70001.58 ms / 90112 tokens (    0.78 ms per token,  1287.29 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   71174.51 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18889 + (2593 =  1875 +      80 +     638) +        2623 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21430 + (2053 =  1875 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3808 =  3668 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21380 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.24 GiB (7.60 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3668.22 MiB
+load_tensors:        CUDA0 model buffer size =  1875.32 MiB
+load_tensors:        CUDA1 model buffer size =  1875.32 MiB
+.............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   638.59 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.748 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.51 seconds per pass - ETA 0.62 minutes
+[1]7.8037,[2]9.3697,[3]9.6610,[4]9.3034,[5]9.1287,[6]7.7470,[7]6.9907,[8]7.0282,[9]7.3555,[10]7.4973,[11]7.5223,[12]7.8922,[13]7.9420,[14]8.0671,[15]8.1239,
+Final estimate: PPL = 8.1239 +/- 0.17640
+llama_perf_context_print:        load time =    5914.73 ms
+llama_perf_context_print: prompt eval time =   25837.30 ms / 30720 tokens (    0.84 ms per token,  1188.98 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26918.12 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18688 + (2593 =  1875 +      80 +     638) +        2824 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21430 + (2053 =  1875 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3808 =  3668 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21581 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.24 GiB (7.60 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3668.22 MiB
+load_tensors:        CUDA0 model buffer size =  1875.32 MiB
+load_tensors:        CUDA1 model buffer size =  1875.32 MiB
+.............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   638.59 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 43.903 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.83 seconds per pass - ETA 0.48 minutes
+[1]5.1482,[2]5.6855,[3]5.8356,[4]5.9563,[5]6.1539,[6]6.0893,[7]6.0613,[8]5.9814,[9]6.0097,[10]5.9931,[11]6.0209,[12]6.0079,[13]6.0755,[14]6.0845,[15]6.0698,[16]6.0594,
+Final estimate: PPL = 6.0594 +/- 0.11154
+llama_perf_context_print:        load time =    1298.78 ms
+llama_perf_context_print: prompt eval time =   25637.97 ms / 32768 tokens (    0.78 ms per token,  1278.10 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26066.72 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18895 + (2593 =  1875 +      80 +     638) +        2618 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21430 + (2053 =  1875 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3808 =  3668 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.46 GiB |     8.19 B | CUDA       |  35 |             pp8 |       244.46 ± 53.49 |
+| qwen3vl 8B MXFP4 MoE           |   7.46 GiB |     8.19 B | CUDA       |  35 |           tg128 |         35.54 ± 2.99 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q5_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.46 GiB (7.82 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3848.59 MiB
+load_tensors:        CUDA0 model buffer size =  1895.32 MiB
+load_tensors:        CUDA1 model buffer size =  1895.32 MiB
+............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   712.78 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 111.604 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.83 seconds per pass - ETA 1.33 minutes
+[1]2.6617,[2]2.1069,[3]1.6451,[4]1.5298,[5]1.6233,[6]1.6807,[7]1.6417,[8]1.6206,[9]1.5522,[10]1.5094,[11]1.4820,[12]1.4857,[13]1.4580,[14]1.4408,[15]1.4535,[16]1.4370,[17]1.4237,[18]1.4266,[19]1.4151,[20]1.3997,[21]1.3936,[22]1.3911,[23]1.4120,[24]1.4019,[25]1.4068,[26]1.3937,[27]1.3854,[28]1.3830,[29]1.3967,[30]1.3978,[31]1.3885,[32]1.3806,[33]1.3814,[34]1.3796,[35]1.3780,[36]1.4014,[37]1.4109,[38]1.4159,[39]1.4229,[40]1.4239,[41]1.4189,[42]1.4324,[43]1.4323,[44]1.4330,
+Final estimate: PPL = 1.4330 +/- 0.00925
+llama_perf_context_print:        load time =    1333.63 ms
+llama_perf_context_print: prompt eval time =   75378.35 ms / 90112 tokens (    0.84 ms per token,  1195.46 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   76965.11 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18803 + (2688 =  1895 +      80 +     712) +        2615 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21410 + (2073 =  1895 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3988 =  3848 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q5_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.46 GiB (7.82 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3848.59 MiB
+load_tensors:        CUDA0 model buffer size =  1895.32 MiB
+load_tensors:        CUDA1 model buffer size =  1895.32 MiB
+............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   712.78 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 72.35 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.13 seconds per pass - ETA 0.53 minutes
+[1]7.6646,[2]9.1815,[3]9.5227,[4]9.1665,[5]8.9984,[6]7.6519,[7]6.8977,[8]6.9411,[9]7.2758,[10]7.4113,[11]7.4226,[12]7.7769,[13]7.8292,[14]7.9405,[15]8.0006,
+Final estimate: PPL = 8.0006 +/- 0.17302
+llama_perf_context_print:        load time =    2975.11 ms
+llama_perf_context_print: prompt eval time =   24772.57 ms / 30720 tokens (    0.81 ms per token,  1240.08 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   25238.18 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18803 + (2688 =  1895 +      80 +     712) +        2615 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21410 + (2073 =  1895 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3988 =  3848 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q5_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.46 GiB (7.82 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3848.59 MiB
+load_tensors:        CUDA0 model buffer size =  1895.32 MiB
+load_tensors:        CUDA1 model buffer size =  1895.32 MiB
+............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   712.78 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 74.866 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.97 seconds per pass - ETA 0.52 minutes
+[1]5.0711,[2]5.6158,[3]5.7610,[4]5.9023,[5]6.0909,[6]6.0439,[7]6.0261,[8]5.9441,[9]5.9726,[10]5.9623,[11]5.9863,[12]5.9753,[13]6.0463,[14]6.0555,[15]6.0422,[16]6.0353,
+Final estimate: PPL = 6.0353 +/- 0.11051
+llama_perf_context_print:        load time =    1656.13 ms
+llama_perf_context_print: prompt eval time =   27793.88 ms / 32768 tokens (    0.85 ms per token,  1178.96 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   29083.79 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18799 + (2688 =  1895 +      80 +     712) +        2619 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21410 + (2073 =  1895 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  3988 =  3848 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.69 GiB |     8.19 B | CUDA       |  35 |             pp8 |        267.72 ± 8.60 |
+| qwen3vl 8B MXFP4 MoE           |   7.69 GiB |     8.19 B | CUDA       |  35 |           tg128 |         41.84 ± 1.10 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21587 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q6_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.69 GiB (8.06 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4040.24 MiB
+load_tensors:        CUDA0 model buffer size =  1916.57 MiB
+load_tensors:        CUDA1 model buffer size =  1916.57 MiB
+..........................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   791.61 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.499 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.88 seconds per pass - ETA 1.37 minutes
+[1]2.6473,[2]2.0992,[3]1.6406,[4]1.5254,[5]1.6189,[6]1.6766,[7]1.6381,[8]1.6178,[9]1.5504,[10]1.5079,[11]1.4801,[12]1.4846,[13]1.4570,[14]1.4398,[15]1.4528,[16]1.4361,[17]1.4227,[18]1.4253,[19]1.4137,[20]1.3985,[21]1.3925,[22]1.3900,[23]1.4110,[24]1.4009,[25]1.4058,[26]1.3927,[27]1.3845,[28]1.3821,[29]1.3959,[30]1.3969,[31]1.3876,[32]1.3797,[33]1.3804,[34]1.3788,[35]1.3773,[36]1.4006,[37]1.4101,[38]1.4152,[39]1.4221,[40]1.4231,[41]1.4180,[42]1.4316,[43]1.4315,[44]1.4324,
+Final estimate: PPL = 1.4324 +/- 0.00926
+llama_perf_context_print:        load time =    1355.02 ms
+llama_perf_context_print: prompt eval time =   73422.56 ms / 90112 tokens (    0.81 ms per token,  1227.31 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   74695.84 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18703 + (2788 =  1916 +      80 +     791) +        2615 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21388 + (2094 =  1916 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  4180 =  4040 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21587 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q6_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.69 GiB (8.06 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4040.24 MiB
+load_tensors:        CUDA0 model buffer size =  1916.57 MiB
+load_tensors:        CUDA1 model buffer size =  1916.57 MiB
+..........................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   791.61 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.557 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.89 seconds per pass - ETA 0.47 minutes
+[1]7.6213,[2]9.1568,[3]9.4635,[4]9.1397,[5]8.9830,[6]7.6382,[7]6.8894,[8]6.9382,[9]7.2656,[10]7.3986,[11]7.4152,[12]7.7717,[13]7.8277,[14]7.9478,[15]8.0054,
+Final estimate: PPL = 8.0054 +/- 0.17304
+llama_perf_context_print:        load time =    1397.79 ms
+llama_perf_context_print: prompt eval time =   24926.15 ms / 30720 tokens (    0.81 ms per token,  1232.44 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   25331.66 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18701 + (2788 =  1916 +      80 +     791) +        2617 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21388 + (2094 =  1916 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  4180 =  4040 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,174 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q6_K:   38 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.69 GiB (8.06 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4040.24 MiB
+load_tensors:        CUDA0 model buffer size =  1916.57 MiB
+load_tensors:        CUDA1 model buffer size =  1916.57 MiB
+..........................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   791.61 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.629 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.92 seconds per pass - ETA 0.50 minutes
+[1]5.0688,[2]5.6108,[3]5.7595,[4]5.8901,[5]6.0777,[6]6.0242,[7]6.0018,[8]5.9206,[9]5.9485,[10]5.9376,[11]5.9650,[12]5.9543,[13]6.0222,[14]6.0285,[15]6.0150,[16]6.0078,
+Final estimate: PPL = 6.0078 +/- 0.10988
+llama_perf_context_print:        load time =    1414.36 ms
+llama_perf_context_print: prompt eval time =   27052.06 ms / 32768 tokens (    0.83 ms per token,  1211.29 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   27492.46 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18702 + (2788 =  1916 +      80 +     791) +        2615 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21388 + (2094 =  1916 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  4180 =  4040 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   8.11 GiB |     8.19 B | CUDA       |  35 |             pp8 |        235.73 ± 5.61 |
+| qwen3vl 8B MXFP4 MoE           |   8.11 GiB |     8.19 B | CUDA       |  35 |           tg128 |         32.05 ± 2.12 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21588 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 8.11 GiB (8.50 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4389.72 MiB
+load_tensors:        CUDA0 model buffer size =  1955.32 MiB
+load_tensors:        CUDA1 model buffer size =  1955.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   935.34 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 109.879 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.92 seconds per pass - ETA 1.40 minutes
+[1]2.6483,[2]2.1013,[3]1.6419,[4]1.5265,[5]1.6199,[6]1.6775,[7]1.6392,[8]1.6186,[9]1.5508,[10]1.5083,[11]1.4804,[12]1.4847,[13]1.4571,[14]1.4399,[15]1.4527,[16]1.4360,[17]1.4228,[18]1.4255,[19]1.4138,[20]1.3986,[21]1.3925,[22]1.3898,[23]1.4108,[24]1.4007,[25]1.4056,[26]1.3925,[27]1.3844,[28]1.3819,[29]1.3956,[30]1.3967,[31]1.3875,[32]1.3796,[33]1.3803,[34]1.3787,[35]1.3772,[36]1.4003,[37]1.4099,[38]1.4150,[39]1.4219,[40]1.4229,[41]1.4179,[42]1.4315,[43]1.4314,[44]1.4322,
+Final estimate: PPL = 1.4322 +/- 0.00926
+llama_perf_context_print:        load time =    1451.83 ms
+llama_perf_context_print: prompt eval time =   75459.88 ms / 90112 tokens (    0.84 ms per token,  1194.17 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   76786.52 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18519 + (2970 =  1955 +      80 +     935) +        2617 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21350 + (2133 =  1955 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  4529 =  4389 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21588 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 8.11 GiB (8.50 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4389.72 MiB
+load_tensors:        CUDA0 model buffer size =  1955.32 MiB
+load_tensors:        CUDA1 model buffer size =  1955.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   935.34 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.478 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.95 seconds per pass - ETA 0.48 minutes
+[1]7.6034,[2]9.1323,[3]9.4408,[4]9.1136,[5]8.9542,[6]7.6151,[7]6.8714,[8]6.9203,[9]7.2476,[10]7.3810,[11]7.3968,[12]7.7530,[13]7.8096,[14]7.9281,[15]7.9846,
+Final estimate: PPL = 7.9846 +/- 0.17232
+llama_perf_context_print:        load time =    1541.75 ms
+llama_perf_context_print: prompt eval time =   26161.57 ms / 30720 tokens (    0.85 ms per token,  1174.24 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26597.65 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18520 + (2970 =  1955 +      80 +     935) +        2615 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21350 + (2133 =  1955 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  4529 =  4389 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,173 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21578 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  254 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 8.11 GiB (8.50 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  4389.72 MiB
+load_tensors:        CUDA0 model buffer size =  1955.32 MiB
+load_tensors:        CUDA1 model buffer size =  1955.32 MiB
+.......................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   935.34 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 79.689 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.14 seconds per pass - ETA 0.57 minutes
+[1]5.0604,[2]5.5995,[3]5.7505,[4]5.8841,[5]6.0706,[6]6.0197,[7]5.9978,[8]5.9144,[9]5.9425,[10]5.9319,[11]5.9574,[12]5.9453,[13]6.0137,[14]6.0216,[15]6.0091,[16]6.0019,
+Final estimate: PPL = 6.0019 +/- 0.10972
+llama_perf_context_print:        load time =    1653.03 ms
+llama_perf_context_print: prompt eval time =   30129.05 ms / 32768 tokens (    0.92 ms per token,  1087.59 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   30616.99 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18511 + (2970 =  1955 +      80 +     935) +        2624 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21350 + (2133 =  1955 +      80 +      98) +         640 |
+llama_memory_breakdown_print: |   - Host               |                  4529 =  4389 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3vl 8B MXFP4 MoE           |   7.21 GiB |     8.19 B | CUDA       |  35 |             pp8 |        290.93 ± 5.73 |
+| qwen3vl 8B MXFP4 MoE           |   7.21 GiB |     8.19 B | CUDA       |  35 |           tg128 |         45.82 ± 1.57 |
+build: 92bb442ad (7040)

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,175 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21581 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:    1 tensors
+llama_model_loader: - type mxfp4:   37 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.21 GiB (7.56 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3641.67 MiB
+load_tensors:        CUDA0 model buffer size =  1870.32 MiB
+load_tensors:        CUDA1 model buffer size =  1870.32 MiB
+..............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   620.05 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 150.435 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 2.86 seconds per pass - ETA 2.08 minutes
+[1]2.7260,[2]2.1441,[3]1.6679,[4]1.5442,[5]1.6418,[6]1.7019,[7]1.6613,[8]1.6423,[9]1.5712,[10]1.5282,[11]1.4980,[12]1.4998,[13]1.4703,[14]1.4518,[15]1.4659,[16]1.4485,[17]1.4342,[18]1.4372,[19]1.4250,[20]1.4094,[21]1.4027,[22]1.3996,[23]1.4210,[24]1.4103,[25]1.4153,[26]1.4017,[27]1.3934,[28]1.3906,[29]1.4037,[30]1.4047,[31]1.3951,[32]1.3871,[33]1.3876,[34]1.3857,[35]1.3841,[36]1.4080,[37]1.4179,[38]1.4232,[39]1.4310,[40]1.4315,[41]1.4260,[42]1.4404,[43]1.4401,[44]1.4410,
+Final estimate: PPL = 1.4410 +/- 0.00923
+llama_perf_context_print:        load time =    1390.22 ms
+llama_perf_context_print: prompt eval time =   72975.93 ms / 90112 tokens (    0.81 ms per token,  1234.82 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   74201.95 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18909 + (2570 =  1870 +      80 +     620) +        2626 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21434 + (2048 =  1870 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  3781 =  3641 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,175 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21581 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:    1 tensors
+llama_model_loader: - type mxfp4:   37 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.21 GiB (7.56 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3641.67 MiB
+load_tensors:        CUDA0 model buffer size =  1870.32 MiB
+load_tensors:        CUDA1 model buffer size =  1870.32 MiB
+..............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   620.05 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 74.452 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.90 seconds per pass - ETA 0.47 minutes
+[1]7.8150,[2]9.3362,[3]9.6859,[4]9.3691,[5]9.1857,[6]7.8080,[7]7.0354,[8]7.0834,[9]7.4342,[10]7.5682,[11]7.6025,[12]7.9602,[13]8.0216,[14]8.1600,[15]8.2160,
+Final estimate: PPL = 8.2160 +/- 0.17522
+llama_perf_context_print:        load time =    1397.05 ms
+llama_perf_context_print: prompt eval time =   24635.38 ms / 30720 tokens (    0.80 ms per token,  1246.99 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   25079.68 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18911 + (2570 =  1870 +      80 +     620) +        2625 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21434 + (2048 =  1870 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  3781 =  3641 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,175 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21580 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 8B Thinking Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
+llama_model_loader: - kv   5:                         general.size_label str              = 8B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 VL 8B Thinking
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-VL-...
+llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
+llama_model_loader: - kv  12:                        qwen3vl.block_count u32              = 36
+llama_model_loader: - kv  13:                     qwen3vl.context_length u32              = 262144
+llama_model_loader: - kv  14:                   qwen3vl.embedding_length u32              = 4096
+llama_model_loader: - kv  15:                qwen3vl.feed_forward_length u32              = 12288
+llama_model_loader: - kv  16:               qwen3vl.attention.head_count u32              = 32
+llama_model_loader: - kv  17:            qwen3vl.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                     qwen3vl.rope.freq_base f32              = 5000000.000000
+llama_model_loader: - kv  19:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:               qwen3vl.attention.key_length u32              = 128
+llama_model_loader: - kv  21:             qwen3vl.attention.value_length u32              = 128
+llama_model_loader: - kv  22:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
+llama_model_loader: - kv  23:                 qwen3vl.n_deepstack_layers u32              = 3
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
+llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{%- set ...
+llama_model_loader: - kv  34:               general.quantization_version u32              = 2
+llama_model_loader: - kv  35:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  145 tensors
+llama_model_loader: - type q8_0:  216 tensors
+llama_model_loader: - type q4_K:    1 tensors
+llama_model_loader: - type mxfp4:   37 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 7.21 GiB (7.56 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3vl
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 4096
+print_info: n_embd_inp       = 16384
+print_info: n_layer          = 36
+print_info: n_head           = 32
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 4
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 12288
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 40
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 5000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: mrope sections   = [24, 20, 20, 0]
+print_info: model type       = 8B
+print_info: model params     = 8.19 B
+print_info: general.name     = Qwen3 VL 8B Thinking Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 151643 '<|endoftext|>'
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/37 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  3641.67 MiB
+load_tensors:        CUDA0 model buffer size =  1870.32 MiB
+load_tensors:        CUDA1 model buffer size =  1870.32 MiB
+..............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 5000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   128.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  288.00 MiB (  2048 cells,  36 layers,  1/1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   620.05 MiB
+llama_context:      CUDA1 compute buffer size =    98.02 MiB
+llama_context:  CUDA_Host compute buffer size =    12.02 MiB
+llama_context: graph nodes  = 1267
+llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 43.944 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 1.88 seconds per pass - ETA 0.50 minutes
+[1]5.2676,[2]5.9139,[3]6.0729,[4]6.2064,[5]6.4182,[6]6.3748,[7]6.3456,[8]6.2774,[9]6.2990,[10]6.2919,[11]6.3062,[12]6.2931,[13]6.3551,[14]6.3618,[15]6.3496,[16]6.3384,
+Final estimate: PPL = 6.3384 +/- 0.11579
+llama_perf_context_print:        load time =    1409.68 ms
+llama_perf_context_print: prompt eval time =   26373.45 ms / 32768 tokens (    0.80 ms per token,  1242.46 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   26876.81 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 18908 + (2570 =  1870 +      80 +     620) +        2627 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 21434 + (2048 =  1870 +      80 +      98) +         641 |
+llama_memory_breakdown_print: |   - Host               |                  3781 =  3641 +     128 +      12                |

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff