Quantized Qwen3-4B-Thinking-2507

This repository provides the Q4 and Q5 quantized version of the Qwen3-4B-Thinking-2507 model. This model emphasizes depth of reasoning (“thinking mode”). It outputs an internal “thought” component (within / boundaries) followed by the final generated content. This model Significantly improved reasoning and general capabilities—including logic, math, science, coding, instruction following, tool usage, text generation, alignment with human preferences, and 256K long-context understanding. These quantized models will able to run on the CPU's and edge devices and can utilize the model capabality without limitation of the hardware.

Model Overview

Original Model: Qwen3-4B-Thinking-2507
Thinking Mode: Default enabled; tag not required
Architecture: decoder-only model
Base Model: Qwen3-4B-Thinking-2507
Quantized Version:
- Q4_K_M
- Q5_K_M
Modalities: Text
Developer: Qwen
License: Apache 2.0 License
Languages: English

Quantization Details

Q4_K_M Version

Approx. ~71% size reduction
Lower memory footprint (~2.33 GB)
Best suited for deployment on edge devices or low-resource GPUs
Slight performance degradation in complex reasoning scenarios

Q5_K_M Version

Approx. ~66% size reduction
Lower memory footprint (~2.69 GB)
Better performance retention, recommended when quality is a priority

Key Features

Significantly improved reasoning: logical reasoning, mathematics, science, coding, and academic benchmarks.
Better general capabilities: instruction following, tool usage, text generation, and alignment with human preferences.
Long-Context Understanding: Can handle up to 256K tokens, enabling analysis of very large documents.
Large-Scale Transformer: 36-layer decoder-only Transformer with Grouped Query Attention for efficient computation..
Deployment Ready: Compatible with CPUs, lightweight GPUs, and frameworks like vllm, llama.cpp.

Usage Example

Using llama.cpp for inference:

./llama-cli -hf SandLogicTechnologies/Qwen3-4B-Thinking-2507 -p "Give me a short introduction to large language model."

Recommended Use Cases

Advanced Reasoning Tasks: Logical reasoning, mathematics, science, and coding problems.
Academic Assistance: Solving benchmark questions, research summarization, and educational content generation.
Instruction Following: Chatbots and virtual assistants that respond accurately to user instruction.
Long-Context Applications: Analyzing and generating content from very large documents (up to 256K tokens).
Deployment in Low-Resource Environments: Running efficiently on CPUs, edge devices, or lightweight GPUs.

Acknowledgments

These quantized models are based on the original work by the Qwen development team.

Special thanks to:

The Qwen team for developing and releasing the Qwen3-4B-Thinking-2507 model.
Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.

Contact

For any inquiries or support, please contact us at [email protected] or visit our Website.

Downloads last month: 9

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

5-bit

Model tree for SandLogicTechnologies/Qwen3-4B-Thinking-2507-GGUF

Base model

Qwen/Qwen3-4B-Thinking-2507

Quantized

(78)

this model