Instructions to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF", filename="Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE # Run inference directly in the terminal: llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE # Run inference directly in the terminal: llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE # Run inference directly in the terminal: ./llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE # Run inference directly in the terminal: ./build/bin/llama-cli -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Use Docker
docker model run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
- LM Studio
- Jan
- vLLM
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
- Ollama
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Ollama:
ollama run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
- Unsloth Studio
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF to start chatting
- Pi
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Run Hermes
hermes
- Docker Model Runner
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Docker Model Runner:
docker model run hf.co/magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
- Lemonade
How to use magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull magiccodingman/Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF:MXFP4_MOE
Run and chat with the model
lemonade run user.Qwen3-VL-8B-Thinking-Unsloth-MXFP4-Hybrid-GGUF-MXFP4_MOE
List all available models
lemonade list
Add GGUF models + tokenizer with LFS
Browse files- .gitattributes +2 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_code.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_general.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/perplexity_math.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +174 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +173 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt +175 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt +175 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt +175 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt +0 -0
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B F16 | 15.26 GiB | 8.19 B | CUDA | 35 | pp8 | 160.47 ± 2.46 |
|
| 9 |
+
| qwen3vl 8B F16 | 15.26 GiB | 8.19 B | CUDA | 35 | tg128 | 21.03 ± 0.10 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21167 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/Qwen3-VL-8B-Thinking-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 35 |
+
llama_model_loader: - kv 24: qwen3vl.n_deepstack_layers u32 = 3
|
| 36 |
+
llama_model_loader: - kv 25: general.quantization_version u32 = 2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 151643
|
| 45 |
+
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = F16
|
| 51 |
+
print_info: file size = 15.26 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 15623.18 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 3680.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 3680.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 107.906 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 3.12 seconds per pass - ETA 2.28 minutes
|
| 162 |
+
[1]2.6509,[2]2.1039,[3]1.6432,[4]1.5273,[5]1.6200,[6]1.6770,[7]1.6384,[8]1.6182,[9]1.5503,[10]1.5079,[11]1.4801,[12]1.4844,[13]1.4567,[14]1.4394,[15]1.4524,[16]1.4358,[17]1.4224,[18]1.4252,[19]1.4135,[20]1.3983,[21]1.3922,[22]1.3897,[23]1.4105,[24]1.4004,[25]1.4054,[26]1.3923,[27]1.3842,[28]1.3817,[29]1.3954,[30]1.3965,[31]1.3872,[32]1.3794,[33]1.3801,[34]1.3784,[35]1.3768,[36]1.4000,[37]1.4096,[38]1.4145,[39]1.4215,[40]1.4224,[41]1.4174,[42]1.4309,[43]1.4308,[44]1.4317,
|
| 163 |
+
Final estimate: PPL = 1.4317 +/- 0.00925
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 2083.42 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 115427.31 ms / 90112 tokens ( 1.28 ms per token, 780.68 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 117016.09 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 15713 + ( 5252 = 3680 + 80 + 1491) + 3141 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19612 + ( 3858 = 3680 + 80 + 98) + 653 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 15763 = 15623 + 128 + 12 |
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21167 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/Qwen3-VL-8B-Thinking-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 35 |
+
llama_model_loader: - kv 24: qwen3vl.n_deepstack_layers u32 = 3
|
| 36 |
+
llama_model_loader: - kv 25: general.quantization_version u32 = 2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 151643
|
| 45 |
+
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = F16
|
| 51 |
+
print_info: file size = 15.26 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 15623.18 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 3680.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 3680.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 46.688 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 2.89 seconds per pass - ETA 0.72 minutes
|
| 162 |
+
[1]7.5799,[2]9.1122,[3]9.4288,[4]9.1032,[5]8.9461,[6]7.6063,[7]6.8620,[8]6.9145,[9]7.2421,[10]7.3723,[11]7.3883,[12]7.7442,[13]7.8003,[14]7.9170,[15]7.9733,
|
| 163 |
+
Final estimate: PPL = 7.9733 +/- 0.17204
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 2089.58 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 39479.57 ms / 30720 tokens ( 1.29 ms per token, 778.12 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 39987.46 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 15715 + ( 5252 = 3680 + 80 + 1491) + 3139 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19612 + ( 3858 = 3680 + 80 + 98) + 653 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 15763 = 15623 + 128 + 12 |
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21168 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/Qwen3-VL-8B-Thinking-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 35 |
+
llama_model_loader: - kv 24: qwen3vl.n_deepstack_layers u32 = 3
|
| 36 |
+
llama_model_loader: - kv 25: general.quantization_version u32 = 2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 151643
|
| 45 |
+
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = F16
|
| 51 |
+
print_info: file size = 15.26 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 15623.18 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 3680.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 3680.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 44.167 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 2.98 seconds per pass - ETA 0.78 minutes
|
| 162 |
+
[1]5.0611,[2]5.6021,[3]5.7480,[4]5.8819,[5]6.0666,[6]6.0144,[7]5.9922,[8]5.9096,[9]5.9390,[10]5.9285,[11]5.9540,[12]5.9421,[13]6.0108,[14]6.0192,[15]6.0065,[16]5.9985,
|
| 163 |
+
Final estimate: PPL = 5.9985 +/- 0.10966
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 2085.72 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 42414.95 ms / 32768 tokens ( 1.29 ms per token, 772.56 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 42855.58 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 15717 + ( 5252 = 3680 + 80 + 1491) + 3137 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19612 + ( 3858 = 3680 + 80 + 98) + 653 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 15763 = 15623 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 9.72 GiB | 8.19 B | CUDA | 35 | pp8 | 171.51 ± 2.69 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 9.72 GiB | 8.19 B | CUDA | 35 | tg128 | 23.90 ± 1.21 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21382 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 38 tensors
|
| 49 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 9.72 GiB (10.19 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 5742.53 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 2105.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 2105.32 MiB
|
| 126 |
+
...............................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 113.902 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.21 seconds per pass - ETA 1.62 minutes
|
| 163 |
+
[1]2.6487,[2]2.0993,[3]1.6410,[4]1.5260,[5]1.6196,[6]1.6767,[7]1.6385,[8]1.6178,[9]1.5501,[10]1.5076,[11]1.4798,[12]1.4842,[13]1.4565,[14]1.4393,[15]1.4523,[16]1.4357,[17]1.4224,[18]1.4251,[19]1.4134,[20]1.3983,[21]1.3922,[22]1.3896,[23]1.4106,[24]1.4005,[25]1.4055,[26]1.3924,[27]1.3842,[28]1.3818,[29]1.3955,[30]1.3966,[31]1.3872,[32]1.3794,[33]1.3801,[34]1.3785,[35]1.3770,[36]1.4001,[37]1.4097,[38]1.4147,[39]1.4216,[40]1.4226,[41]1.4176,[42]1.4311,[43]1.4310,[44]1.4318,
|
| 164 |
+
Final estimate: PPL = 1.4318 +/- 0.00925
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 3555.99 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 84152.72 ms / 90112 tokens ( 0.93 ms per token, 1070.82 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 85447.98 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17506 + (3677 = 2105 + 80 + 1491) + 2923 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21188 + (2283 = 2105 + 80 + 98) + 652 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 5882 = 5742 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21382 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 38 tensors
|
| 49 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 9.72 GiB (10.19 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 5742.53 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 2105.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 2105.32 MiB
|
| 126 |
+
...............................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 195.846 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.16 seconds per pass - ETA 0.53 minutes
|
| 163 |
+
[1]7.5980,[2]9.1402,[3]9.4494,[4]9.1211,[5]8.9602,[6]7.6188,[7]6.8745,[8]6.9243,[9]7.2510,[10]7.3825,[11]7.3992,[12]7.7564,[13]7.8131,[14]7.9294,[15]7.9857,
|
| 164 |
+
Final estimate: PPL = 7.9857 +/- 0.17241
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1740.10 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 28979.72 ms / 30720 tokens ( 0.94 ms per token, 1060.05 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 29591.66 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17508 + (3677 = 2105 + 80 + 1491) + 2921 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21188 + (2283 = 2105 + 80 + 98) + 652 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 5882 = 5742 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21380 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type f16: 38 tensors
|
| 49 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 9.72 GiB (10.19 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 5742.53 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 2105.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 2105.32 MiB
|
| 126 |
+
...............................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 1491.75 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 44.116 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.28 seconds per pass - ETA 0.60 minutes
|
| 163 |
+
[1]5.0583,[2]5.5938,[3]5.7474,[4]5.8798,[5]6.0643,[6]6.0127,[7]5.9910,[8]5.9104,[9]5.9396,[10]5.9283,[11]5.9538,[12]5.9433,[13]6.0120,[14]6.0208,[15]6.0074,[16]6.0000,
|
| 164 |
+
Final estimate: PPL = 6.0000 +/- 0.10970
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1630.49 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 31058.33 ms / 32768 tokens ( 0.95 ms per token, 1055.05 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 31573.35 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17506 + (3677 = 2105 + 80 + 1491) + 2923 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21188 + (2283 = 2105 + 80 + 98) + 652 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 5882 = 5742 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.24 GiB | 8.19 B | CUDA | 35 | pp8 | 283.49 ± 18.12 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.24 GiB | 8.19 B | CUDA | 35 | tg128 | 42.70 ± 2.51 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21380 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.24 GiB (7.60 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3668.22 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1875.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1875.32 MiB
|
| 126 |
+
.............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 638.59 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 116.358 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.83 seconds per pass - ETA 1.33 minutes
|
| 163 |
+
[1]2.6816,[2]2.1245,[3]1.6554,[4]1.5356,[5]1.6287,[6]1.6862,[7]1.6462,[8]1.6246,[9]1.5563,[10]1.5128,[11]1.4852,[12]1.4892,[13]1.4615,[14]1.4438,[15]1.4562,[16]1.4394,[17]1.4261,[18]1.4292,[19]1.4176,[20]1.4021,[21]1.3959,[22]1.3936,[23]1.4149,[24]1.4050,[25]1.4103,[26]1.3972,[27]1.3891,[28]1.3868,[29]1.4009,[30]1.4020,[31]1.3927,[32]1.3847,[33]1.3852,[34]1.3835,[35]1.3821,[36]1.4054,[37]1.4148,[38]1.4197,[39]1.4270,[40]1.4278,[41]1.4227,[42]1.4364,[43]1.4362,[44]1.4371,
|
| 164 |
+
Final estimate: PPL = 1.4371 +/- 0.00935
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 3129.92 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 70001.58 ms / 90112 tokens ( 0.78 ms per token, 1287.29 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 71174.51 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18889 + (2593 = 1875 + 80 + 638) + 2623 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21430 + (2053 = 1875 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3808 = 3668 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21380 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.24 GiB (7.60 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3668.22 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1875.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1875.32 MiB
|
| 126 |
+
.............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 638.59 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 47.748 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.51 seconds per pass - ETA 0.62 minutes
|
| 163 |
+
[1]7.8037,[2]9.3697,[3]9.6610,[4]9.3034,[5]9.1287,[6]7.7470,[7]6.9907,[8]7.0282,[9]7.3555,[10]7.4973,[11]7.5223,[12]7.8922,[13]7.9420,[14]8.0671,[15]8.1239,
|
| 164 |
+
Final estimate: PPL = 8.1239 +/- 0.17640
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 5914.73 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 25837.30 ms / 30720 tokens ( 0.84 ms per token, 1188.98 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 26918.12 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18688 + (2593 = 1875 + 80 + 638) + 2824 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21430 + (2053 = 1875 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3808 = 3668 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21581 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.24 GiB (7.60 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3668.22 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1875.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1875.32 MiB
|
| 126 |
+
.............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 638.59 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 43.903 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.83 seconds per pass - ETA 0.48 minutes
|
| 163 |
+
[1]5.1482,[2]5.6855,[3]5.8356,[4]5.9563,[5]6.1539,[6]6.0893,[7]6.0613,[8]5.9814,[9]6.0097,[10]5.9931,[11]6.0209,[12]6.0079,[13]6.0755,[14]6.0845,[15]6.0698,[16]6.0594,
|
| 164 |
+
Final estimate: PPL = 6.0594 +/- 0.11154
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1298.78 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 25637.97 ms / 32768 tokens ( 0.78 ms per token, 1278.10 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 26066.72 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18895 + (2593 = 1875 + 80 + 638) + 2618 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21430 + (2053 = 1875 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3808 = 3668 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.46 GiB | 8.19 B | CUDA | 35 | pp8 | 244.46 ± 53.49 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.46 GiB | 8.19 B | CUDA | 35 | tg128 | 35.54 ± 2.99 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q5_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.46 GiB (7.82 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3848.59 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1895.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1895.32 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 712.78 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 111.604 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.83 seconds per pass - ETA 1.33 minutes
|
| 163 |
+
[1]2.6617,[2]2.1069,[3]1.6451,[4]1.5298,[5]1.6233,[6]1.6807,[7]1.6417,[8]1.6206,[9]1.5522,[10]1.5094,[11]1.4820,[12]1.4857,[13]1.4580,[14]1.4408,[15]1.4535,[16]1.4370,[17]1.4237,[18]1.4266,[19]1.4151,[20]1.3997,[21]1.3936,[22]1.3911,[23]1.4120,[24]1.4019,[25]1.4068,[26]1.3937,[27]1.3854,[28]1.3830,[29]1.3967,[30]1.3978,[31]1.3885,[32]1.3806,[33]1.3814,[34]1.3796,[35]1.3780,[36]1.4014,[37]1.4109,[38]1.4159,[39]1.4229,[40]1.4239,[41]1.4189,[42]1.4324,[43]1.4323,[44]1.4330,
|
| 164 |
+
Final estimate: PPL = 1.4330 +/- 0.00925
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1333.63 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 75378.35 ms / 90112 tokens ( 0.84 ms per token, 1195.46 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 76965.11 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18803 + (2688 = 1895 + 80 + 712) + 2615 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21410 + (2073 = 1895 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3988 = 3848 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q5_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.46 GiB (7.82 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3848.59 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1895.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1895.32 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 712.78 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 72.35 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 2.13 seconds per pass - ETA 0.53 minutes
|
| 163 |
+
[1]7.6646,[2]9.1815,[3]9.5227,[4]9.1665,[5]8.9984,[6]7.6519,[7]6.8977,[8]6.9411,[9]7.2758,[10]7.4113,[11]7.4226,[12]7.7769,[13]7.8292,[14]7.9405,[15]8.0006,
|
| 164 |
+
Final estimate: PPL = 8.0006 +/- 0.17302
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 2975.11 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 24772.57 ms / 30720 tokens ( 0.81 ms per token, 1240.08 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 25238.18 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18803 + (2688 = 1895 + 80 + 712) + 2615 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21410 + (2073 = 1895 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3988 = 3848 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q5_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.46 GiB (7.82 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 3848.59 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1895.32 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1895.32 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 712.78 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 74.866 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.97 seconds per pass - ETA 0.52 minutes
|
| 163 |
+
[1]5.0711,[2]5.6158,[3]5.7610,[4]5.9023,[5]6.0909,[6]6.0439,[7]6.0261,[8]5.9441,[9]5.9726,[10]5.9623,[11]5.9863,[12]5.9753,[13]6.0463,[14]6.0555,[15]6.0422,[16]6.0353,
|
| 164 |
+
Final estimate: PPL = 6.0353 +/- 0.11051
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1656.13 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 27793.88 ms / 32768 tokens ( 0.85 ms per token, 1178.96 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 29083.79 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18799 + (2688 = 1895 + 80 + 712) + 2619 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21410 + (2073 = 1895 + 80 + 98) + 640 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 3988 = 3848 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.69 GiB | 8.19 B | CUDA | 35 | pp8 | 267.72 ± 8.60 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.69 GiB | 8.19 B | CUDA | 35 | tg128 | 41.84 ± 1.10 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21587 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q6_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.69 GiB (8.06 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 4040.24 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1916.57 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1916.57 MiB
|
| 126 |
+
..........................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 791.61 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 110.499 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.88 seconds per pass - ETA 1.37 minutes
|
| 163 |
+
[1]2.6473,[2]2.0992,[3]1.6406,[4]1.5254,[5]1.6189,[6]1.6766,[7]1.6381,[8]1.6178,[9]1.5504,[10]1.5079,[11]1.4801,[12]1.4846,[13]1.4570,[14]1.4398,[15]1.4528,[16]1.4361,[17]1.4227,[18]1.4253,[19]1.4137,[20]1.3985,[21]1.3925,[22]1.3900,[23]1.4110,[24]1.4009,[25]1.4058,[26]1.3927,[27]1.3845,[28]1.3821,[29]1.3959,[30]1.3969,[31]1.3876,[32]1.3797,[33]1.3804,[34]1.3788,[35]1.3773,[36]1.4006,[37]1.4101,[38]1.4152,[39]1.4221,[40]1.4231,[41]1.4180,[42]1.4316,[43]1.4315,[44]1.4324,
|
| 164 |
+
Final estimate: PPL = 1.4324 +/- 0.00926
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1355.02 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 73422.56 ms / 90112 tokens ( 0.81 ms per token, 1227.31 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 74695.84 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18703 + (2788 = 1916 + 80 + 791) + 2615 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21388 + (2094 = 1916 + 80 + 98) + 641 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 4180 = 4040 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21587 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q6_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.69 GiB (8.06 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 4040.24 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1916.57 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1916.57 MiB
|
| 126 |
+
..........................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 791.61 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 46.557 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.89 seconds per pass - ETA 0.47 minutes
|
| 163 |
+
[1]7.6213,[2]9.1568,[3]9.4635,[4]9.1397,[5]8.9830,[6]7.6382,[7]6.8894,[8]6.9382,[9]7.2656,[10]7.3986,[11]7.4152,[12]7.7717,[13]7.8277,[14]7.9478,[15]8.0054,
|
| 164 |
+
Final estimate: PPL = 8.0054 +/- 0.17304
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1397.79 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 24926.15 ms / 30720 tokens ( 0.81 ms per token, 1232.44 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 25331.66 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18701 + (2788 = 1916 + 80 + 791) + 2617 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21388 + (2094 = 1916 + 80 + 98) + 641 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 4180 = 4040 + 128 + 12 |
|
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21589 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q6_K: 38 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 7.69 GiB (8.06 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3vl
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 4096
|
| 65 |
+
print_info: n_embd_inp = 16384
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 12288
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = 0
|
| 90 |
+
print_info: rope type = 40
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 97 |
+
print_info: model type = 8B
|
| 98 |
+
print_info: model params = 8.19 B
|
| 99 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 4040.24 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1916.57 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1916.57 MiB
|
| 126 |
+
..........................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 791.61 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 47.629 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.92 seconds per pass - ETA 0.50 minutes
|
| 163 |
+
[1]5.0688,[2]5.6108,[3]5.7595,[4]5.8901,[5]6.0777,[6]6.0242,[7]6.0018,[8]5.9206,[9]5.9485,[10]5.9376,[11]5.9650,[12]5.9543,[13]6.0222,[14]6.0285,[15]6.0150,[16]6.0078,
|
| 164 |
+
Final estimate: PPL = 6.0078 +/- 0.10988
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 1414.36 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 27052.06 ms / 32768 tokens ( 0.83 ms per token, 1211.29 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 27492.46 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18702 + (2788 = 1916 + 80 + 791) + 2615 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21388 + (2094 = 1916 + 80 + 98) + 641 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 4180 = 4040 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 8.11 GiB | 8.19 B | CUDA | 35 | pp8 | 235.73 ± 5.61 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 8.11 GiB | 8.19 B | CUDA | 35 | tg128 | 32.05 ± 2.12 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21588 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 8.11 GiB (8.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 4389.72 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1955.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1955.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 935.34 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 109.879 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.92 seconds per pass - ETA 1.40 minutes
|
| 162 |
+
[1]2.6483,[2]2.1013,[3]1.6419,[4]1.5265,[5]1.6199,[6]1.6775,[7]1.6392,[8]1.6186,[9]1.5508,[10]1.5083,[11]1.4804,[12]1.4847,[13]1.4571,[14]1.4399,[15]1.4527,[16]1.4360,[17]1.4228,[18]1.4255,[19]1.4138,[20]1.3986,[21]1.3925,[22]1.3898,[23]1.4108,[24]1.4007,[25]1.4056,[26]1.3925,[27]1.3844,[28]1.3819,[29]1.3956,[30]1.3967,[31]1.3875,[32]1.3796,[33]1.3803,[34]1.3787,[35]1.3772,[36]1.4003,[37]1.4099,[38]1.4150,[39]1.4219,[40]1.4229,[41]1.4179,[42]1.4315,[43]1.4314,[44]1.4322,
|
| 163 |
+
Final estimate: PPL = 1.4322 +/- 0.00926
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1451.83 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 75459.88 ms / 90112 tokens ( 0.84 ms per token, 1194.17 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 76786.52 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18519 + (2970 = 1955 + 80 + 935) + 2617 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21350 + (2133 = 1955 + 80 + 98) + 640 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 4529 = 4389 + 128 + 12 |
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21588 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 8.11 GiB (8.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 4389.72 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1955.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1955.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 935.34 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 47.478 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.95 seconds per pass - ETA 0.48 minutes
|
| 162 |
+
[1]7.6034,[2]9.1323,[3]9.4408,[4]9.1136,[5]8.9542,[6]7.6151,[7]6.8714,[8]6.9203,[9]7.2476,[10]7.3810,[11]7.3968,[12]7.7530,[13]7.8096,[14]7.9281,[15]7.9846,
|
| 163 |
+
Final estimate: PPL = 7.9846 +/- 0.17232
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1541.75 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 26161.57 ms / 30720 tokens ( 0.85 ms per token, 1174.24 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 26597.65 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18520 + (2970 = 1955 + 80 + 935) + 2615 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21350 + (2133 = 1955 + 80 + 98) + 640 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 4529 = 4389 + 128 + 12 |
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21578 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 254 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 8.11 GiB (8.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3vl
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 4096
|
| 64 |
+
print_info: n_embd_inp = 16384
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 12288
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = 0
|
| 89 |
+
print_info: rope type = 40
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 96 |
+
print_info: model type = 8B
|
| 97 |
+
print_info: model params = 8.19 B
|
| 98 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 4389.72 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1955.32 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1955.32 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 935.34 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 79.689 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 2.14 seconds per pass - ETA 0.57 minutes
|
| 162 |
+
[1]5.0604,[2]5.5995,[3]5.7505,[4]5.8841,[5]6.0706,[6]6.0197,[7]5.9978,[8]5.9144,[9]5.9425,[10]5.9319,[11]5.9574,[12]5.9453,[13]6.0137,[14]6.0216,[15]6.0091,[16]6.0019,
|
| 163 |
+
Final estimate: PPL = 6.0019 +/- 0.10972
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1653.03 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 30129.05 ms / 32768 tokens ( 0.92 ms per token, 1087.59 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 30616.99 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18511 + (2970 = 1955 + 80 + 935) + 2624 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21350 + (2133 = 1955 + 80 + 98) + 640 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 4529 = 4389 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3vl 8B MXFP4 MoE | 7.21 GiB | 8.19 B | CUDA | 35 | pp8 | 290.93 ± 5.73 |
|
| 9 |
+
| qwen3vl 8B MXFP4 MoE | 7.21 GiB | 8.19 B | CUDA | 35 | tg128 | 45.82 ± 1.57 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21581 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 50 |
+
llama_model_loader: - type mxfp4: 37 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 7.21 GiB (7.56 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3vl
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 4096
|
| 66 |
+
print_info: n_embd_inp = 16384
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 12288
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = 0
|
| 91 |
+
print_info: rope type = 40
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 98 |
+
print_info: model type = 8B
|
| 99 |
+
print_info: model params = 8.19 B
|
| 100 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 101 |
+
print_info: vocab type = BPE
|
| 102 |
+
print_info: n_vocab = 151936
|
| 103 |
+
print_info: n_merges = 151387
|
| 104 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 105 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 107 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 108 |
+
print_info: LF token = 198 'Ċ'
|
| 109 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 110 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 111 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 112 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 113 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 114 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 115 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 116 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 117 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 118 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 119 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 120 |
+
print_info: max token length = 256
|
| 121 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 122 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 123 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 124 |
+
load_tensors: CPU_Mapped model buffer size = 3641.67 MiB
|
| 125 |
+
load_tensors: CUDA0 model buffer size = 1870.32 MiB
|
| 126 |
+
load_tensors: CUDA1 model buffer size = 1870.32 MiB
|
| 127 |
+
..............................................................................................
|
| 128 |
+
llama_context: constructing llama_context
|
| 129 |
+
llama_context: n_seq_max = 1
|
| 130 |
+
llama_context: n_ctx = 2048
|
| 131 |
+
llama_context: n_ctx_seq = 2048
|
| 132 |
+
llama_context: n_batch = 2048
|
| 133 |
+
llama_context: n_ubatch = 512
|
| 134 |
+
llama_context: causal_attn = 1
|
| 135 |
+
llama_context: flash_attn = auto
|
| 136 |
+
llama_context: kv_unified = false
|
| 137 |
+
llama_context: freq_base = 5000000.0
|
| 138 |
+
llama_context: freq_scale = 1
|
| 139 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 140 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 141 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 144 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 145 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 146 |
+
llama_context: CUDA0 compute buffer size = 620.05 MiB
|
| 147 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 148 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 149 |
+
llama_context: graph nodes = 1267
|
| 150 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 151 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 155 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 156 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 157 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 158 |
+
|
| 159 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 160 |
+
perplexity: tokenizing the input ..
|
| 161 |
+
perplexity: tokenization took 150.435 ms
|
| 162 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 163 |
+
perplexity: 2.86 seconds per pass - ETA 2.08 minutes
|
| 164 |
+
[1]2.7260,[2]2.1441,[3]1.6679,[4]1.5442,[5]1.6418,[6]1.7019,[7]1.6613,[8]1.6423,[9]1.5712,[10]1.5282,[11]1.4980,[12]1.4998,[13]1.4703,[14]1.4518,[15]1.4659,[16]1.4485,[17]1.4342,[18]1.4372,[19]1.4250,[20]1.4094,[21]1.4027,[22]1.3996,[23]1.4210,[24]1.4103,[25]1.4153,[26]1.4017,[27]1.3934,[28]1.3906,[29]1.4037,[30]1.4047,[31]1.3951,[32]1.3871,[33]1.3876,[34]1.3857,[35]1.3841,[36]1.4080,[37]1.4179,[38]1.4232,[39]1.4310,[40]1.4315,[41]1.4260,[42]1.4404,[43]1.4401,[44]1.4410,
|
| 165 |
+
Final estimate: PPL = 1.4410 +/- 0.00923
|
| 166 |
+
|
| 167 |
+
llama_perf_context_print: load time = 1390.22 ms
|
| 168 |
+
llama_perf_context_print: prompt eval time = 72975.93 ms / 90112 tokens ( 0.81 ms per token, 1234.82 tokens per second)
|
| 169 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 170 |
+
llama_perf_context_print: total time = 74201.95 ms / 90113 tokens
|
| 171 |
+
llama_perf_context_print: graphs reused = 0
|
| 172 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18909 + (2570 = 1870 + 80 + 620) + 2626 |
|
| 174 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21434 + (2048 = 1870 + 80 + 98) + 641 |
|
| 175 |
+
llama_memory_breakdown_print: | - Host | 3781 = 3641 + 128 + 12 |
|
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21581 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 50 |
+
llama_model_loader: - type mxfp4: 37 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 7.21 GiB (7.56 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3vl
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 4096
|
| 66 |
+
print_info: n_embd_inp = 16384
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 12288
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = 0
|
| 91 |
+
print_info: rope type = 40
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 98 |
+
print_info: model type = 8B
|
| 99 |
+
print_info: model params = 8.19 B
|
| 100 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 101 |
+
print_info: vocab type = BPE
|
| 102 |
+
print_info: n_vocab = 151936
|
| 103 |
+
print_info: n_merges = 151387
|
| 104 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 105 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 107 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 108 |
+
print_info: LF token = 198 'Ċ'
|
| 109 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 110 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 111 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 112 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 113 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 114 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 115 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 116 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 117 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 118 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 119 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 120 |
+
print_info: max token length = 256
|
| 121 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 122 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 123 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 124 |
+
load_tensors: CPU_Mapped model buffer size = 3641.67 MiB
|
| 125 |
+
load_tensors: CUDA0 model buffer size = 1870.32 MiB
|
| 126 |
+
load_tensors: CUDA1 model buffer size = 1870.32 MiB
|
| 127 |
+
..............................................................................................
|
| 128 |
+
llama_context: constructing llama_context
|
| 129 |
+
llama_context: n_seq_max = 1
|
| 130 |
+
llama_context: n_ctx = 2048
|
| 131 |
+
llama_context: n_ctx_seq = 2048
|
| 132 |
+
llama_context: n_batch = 2048
|
| 133 |
+
llama_context: n_ubatch = 512
|
| 134 |
+
llama_context: causal_attn = 1
|
| 135 |
+
llama_context: flash_attn = auto
|
| 136 |
+
llama_context: kv_unified = false
|
| 137 |
+
llama_context: freq_base = 5000000.0
|
| 138 |
+
llama_context: freq_scale = 1
|
| 139 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 140 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 141 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 144 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 145 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 146 |
+
llama_context: CUDA0 compute buffer size = 620.05 MiB
|
| 147 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 148 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 149 |
+
llama_context: graph nodes = 1267
|
| 150 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 151 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 155 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 156 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 157 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 158 |
+
|
| 159 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 160 |
+
perplexity: tokenizing the input ..
|
| 161 |
+
perplexity: tokenization took 74.452 ms
|
| 162 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 163 |
+
perplexity: 1.90 seconds per pass - ETA 0.47 minutes
|
| 164 |
+
[1]7.8150,[2]9.3362,[3]9.6859,[4]9.3691,[5]9.1857,[6]7.8080,[7]7.0354,[8]7.0834,[9]7.4342,[10]7.5682,[11]7.6025,[12]7.9602,[13]8.0216,[14]8.1600,[15]8.2160,
|
| 165 |
+
Final estimate: PPL = 8.2160 +/- 0.17522
|
| 166 |
+
|
| 167 |
+
llama_perf_context_print: load time = 1397.05 ms
|
| 168 |
+
llama_perf_context_print: prompt eval time = 24635.38 ms / 30720 tokens ( 0.80 ms per token, 1246.99 tokens per second)
|
| 169 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 170 |
+
llama_perf_context_print: total time = 25079.68 ms / 30721 tokens
|
| 171 |
+
llama_perf_context_print: graphs reused = 0
|
| 172 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18911 + (2570 = 1870 + 80 + 620) + 2625 |
|
| 174 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21434 + (2048 = 1870 + 80 + 98) + 641 |
|
| 175 |
+
llama_memory_breakdown_print: | - Host | 3781 = 3641 + 128 + 12 |
|
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21580 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/world8/AI/Models/Qwen3-VL-8B-Thinking-unsloth/GGUF/MXFP4/Qwen3-VL-8B-Thinking-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3vl
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 VL 8B Thinking Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Thinking-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 8B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 VL 8B Thinking
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
|
| 23 |
+
llama_model_loader: - kv 12: qwen3vl.block_count u32 = 36
|
| 24 |
+
llama_model_loader: - kv 13: qwen3vl.context_length u32 = 262144
|
| 25 |
+
llama_model_loader: - kv 14: qwen3vl.embedding_length u32 = 4096
|
| 26 |
+
llama_model_loader: - kv 15: qwen3vl.feed_forward_length u32 = 12288
|
| 27 |
+
llama_model_loader: - kv 16: qwen3vl.attention.head_count u32 = 32
|
| 28 |
+
llama_model_loader: - kv 17: qwen3vl.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: qwen3vl.rope.freq_base f32 = 5000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: qwen3vl.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: qwen3vl.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
|
| 34 |
+
llama_model_loader: - kv 23: qwen3vl.n_deepstack_layers u32 = 3
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 49 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 50 |
+
llama_model_loader: - type mxfp4: 37 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 7.21 GiB (7.56 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3vl
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 4096
|
| 66 |
+
print_info: n_embd_inp = 16384
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 12288
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = 0
|
| 91 |
+
print_info: rope type = 40
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: mrope sections = [24, 20, 20, 0]
|
| 98 |
+
print_info: model type = 8B
|
| 99 |
+
print_info: model params = 8.19 B
|
| 100 |
+
print_info: general.name = Qwen3 VL 8B Thinking Unsloth
|
| 101 |
+
print_info: vocab type = BPE
|
| 102 |
+
print_info: n_vocab = 151936
|
| 103 |
+
print_info: n_merges = 151387
|
| 104 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 105 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 107 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 108 |
+
print_info: LF token = 198 'Ċ'
|
| 109 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 110 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 111 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 112 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 113 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 114 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 115 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 116 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 117 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 118 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 119 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 120 |
+
print_info: max token length = 256
|
| 121 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 122 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 123 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 124 |
+
load_tensors: CPU_Mapped model buffer size = 3641.67 MiB
|
| 125 |
+
load_tensors: CUDA0 model buffer size = 1870.32 MiB
|
| 126 |
+
load_tensors: CUDA1 model buffer size = 1870.32 MiB
|
| 127 |
+
..............................................................................................
|
| 128 |
+
llama_context: constructing llama_context
|
| 129 |
+
llama_context: n_seq_max = 1
|
| 130 |
+
llama_context: n_ctx = 2048
|
| 131 |
+
llama_context: n_ctx_seq = 2048
|
| 132 |
+
llama_context: n_batch = 2048
|
| 133 |
+
llama_context: n_ubatch = 512
|
| 134 |
+
llama_context: causal_attn = 1
|
| 135 |
+
llama_context: flash_attn = auto
|
| 136 |
+
llama_context: kv_unified = false
|
| 137 |
+
llama_context: freq_base = 5000000.0
|
| 138 |
+
llama_context: freq_scale = 1
|
| 139 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 140 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 141 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 144 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 145 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 146 |
+
llama_context: CUDA0 compute buffer size = 620.05 MiB
|
| 147 |
+
llama_context: CUDA1 compute buffer size = 98.02 MiB
|
| 148 |
+
llama_context: CUDA_Host compute buffer size = 12.02 MiB
|
| 149 |
+
llama_context: graph nodes = 1267
|
| 150 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 151 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 155 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 156 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 157 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 158 |
+
|
| 159 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 160 |
+
perplexity: tokenizing the input ..
|
| 161 |
+
perplexity: tokenization took 43.944 ms
|
| 162 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 163 |
+
perplexity: 1.88 seconds per pass - ETA 0.50 minutes
|
| 164 |
+
[1]5.2676,[2]5.9139,[3]6.0729,[4]6.2064,[5]6.4182,[6]6.3748,[7]6.3456,[8]6.2774,[9]6.2990,[10]6.2919,[11]6.3062,[12]6.2931,[13]6.3551,[14]6.3618,[15]6.3496,[16]6.3384,
|
| 165 |
+
Final estimate: PPL = 6.3384 +/- 0.11579
|
| 166 |
+
|
| 167 |
+
llama_perf_context_print: load time = 1409.68 ms
|
| 168 |
+
llama_perf_context_print: prompt eval time = 26373.45 ms / 32768 tokens ( 0.80 ms per token, 1242.46 tokens per second)
|
| 169 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 170 |
+
llama_perf_context_print: total time = 26876.81 ms / 32769 tokens
|
| 171 |
+
llama_perf_context_print: graphs reused = 0
|
| 172 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18908 + (2570 = 1870 + 80 + 620) + 2627 |
|
| 174 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21434 + (2048 = 1870 + 80 + 98) + 641 |
|
| 175 |
+
llama_memory_breakdown_print: | - Host | 3781 = 3641 + 128 + 12 |
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
The diff for this file is too large to render.
See raw diff
|
|
|