Quantized Qwen3-4B-Thinking-2507
This repository provides the Q4 and Q5 quantized version of the Qwen3-4B-Thinking-2507 model. This model emphasizes depth of reasoning (“thinking mode”). It outputs an internal “thought” component (within / boundaries) followed by the final generated content. This model Significantly improved reasoning and general capabilities—including logic, math, science, coding, instruction following, tool usage, text generation, alignment with human preferences, and 256K long-context understanding. These quantized models will able to run on the CPU's and edge devices and can utilize the model capabality without limitation of the hardware.
Model Overview
- Original Model: Qwen3-4B-Thinking-2507
- Thinking Mode: Default enabled; tag not required
- Architecture: decoder-only model
- Base Model: Qwen3-4B-Thinking-2507
- Quantized Version:
- Q4_K_M
- Q5_K_M
- Modalities: Text
- Developer: Qwen
- License: Apache 2.0 License
- Languages: English
Quantization Details
Q4_K_M Version
- Approx. ~71% size reduction
- Lower memory footprint (~2.33 GB)
- Best suited for deployment on edge devices or low-resource GPUs
- Slight performance degradation in complex reasoning scenarios
Q5_K_M Version
- Approx. ~66% size reduction
- Lower memory footprint (~2.69 GB)
- Better performance retention, recommended when quality is a priority
Key Features
- Significantly improved reasoning: logical reasoning, mathematics, science, coding, and academic benchmarks.
- Better general capabilities: instruction following, tool usage, text generation, and alignment with human preferences.
- Long-Context Understanding: Can handle up to 256K tokens, enabling analysis of very large documents.
- Large-Scale Transformer: 36-layer decoder-only Transformer with Grouped Query Attention for efficient computation..
- Deployment Ready: Compatible with CPUs, lightweight GPUs, and frameworks like vllm, llama.cpp.
Usage Example
Using llama.cpp for inference:
./llama-cli -hf SandLogicTechnologies/Qwen3-4B-Thinking-2507 -p "Give me a short introduction to large language model."
Recommended Use Cases
- Advanced Reasoning Tasks: Logical reasoning, mathematics, science, and coding problems.
- Academic Assistance: Solving benchmark questions, research summarization, and educational content generation.
- Instruction Following: Chatbots and virtual assistants that respond accurately to user instruction.
- Long-Context Applications: Analyzing and generating content from very large documents (up to 256K tokens).
- Deployment in Low-Resource Environments: Running efficiently on CPUs, edge devices, or lightweight GPUs.
Acknowledgments
These quantized models are based on the original work by the Qwen development team.
Special thanks to:
The Qwen team for developing and releasing the Qwen3-4B-Thinking-2507 model.
Georgi Gerganov and the entire
llama.cppopen-source community for enabling efficient model quantization and inference via the GGUF format.
Contact
For any inquiries or support, please contact us at [email protected] or visit our Website.
- Downloads last month
- 9
4-bit
5-bit
Model tree for SandLogicTechnologies/Qwen3-4B-Thinking-2507-GGUF
Base model
Qwen/Qwen3-4B-Thinking-2507