Quantized Qwen3-4B-Thinking-2507

This repository provides the Q4 and Q5 quantized version of the Qwen3-4B-Thinking-2507 model. This model emphasizes depth of reasoning (“thinking mode”). It outputs an internal “thought” component (within / boundaries) followed by the final generated content. This model Significantly improved reasoning and general capabilities—including logic, math, science, coding, instruction following, tool usage, text generation, alignment with human preferences, and 256K long-context understanding. These quantized models will able to run on the CPU's and edge devices and can utilize the model capabality without limitation of the hardware.

Model Overview

  • Original Model: Qwen3-4B-Thinking-2507
  • Thinking Mode: Default enabled; tag not required
  • Architecture: decoder-only model
  • Base Model: Qwen3-4B-Thinking-2507
  • Quantized Version:
    • Q4_K_M
    • Q5_K_M
  • Modalities: Text
  • Developer: Qwen
  • License: Apache 2.0 License
  • Languages: English

Quantization Details

Q4_K_M Version

  • Approx. ~71% size reduction
  • Lower memory footprint (~2.33 GB)
  • Best suited for deployment on edge devices or low-resource GPUs
  • Slight performance degradation in complex reasoning scenarios

Q5_K_M Version

  • Approx. ~66% size reduction
  • Lower memory footprint (~2.69 GB)
  • Better performance retention, recommended when quality is a priority

Key Features

  • Significantly improved reasoning: logical reasoning, mathematics, science, coding, and academic benchmarks.
  • Better general capabilities: instruction following, tool usage, text generation, and alignment with human preferences.
  • Long-Context Understanding: Can handle up to 256K tokens, enabling analysis of very large documents.
  • Large-Scale Transformer: 36-layer decoder-only Transformer with Grouped Query Attention for efficient computation..
  • Deployment Ready: Compatible with CPUs, lightweight GPUs, and frameworks like vllm, llama.cpp.

Usage Example

Using llama.cpp for inference:

./llama-cli -hf SandLogicTechnologies/Qwen3-4B-Thinking-2507 -p "Give me a short introduction to large language model."

Recommended Use Cases

  • Advanced Reasoning Tasks: Logical reasoning, mathematics, science, and coding problems.
  • Academic Assistance: Solving benchmark questions, research summarization, and educational content generation.
  • Instruction Following: Chatbots and virtual assistants that respond accurately to user instruction.
  • Long-Context Applications: Analyzing and generating content from very large documents (up to 256K tokens).
  • Deployment in Low-Resource Environments: Running efficiently on CPUs, edge devices, or lightweight GPUs.

Acknowledgments

These quantized models are based on the original work by the Qwen development team.

Special thanks to:

  • The Qwen team for developing and releasing the Qwen3-4B-Thinking-2507 model.

  • Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.


Contact

For any inquiries or support, please contact us at [email protected] or visit our Website.

Downloads last month
9
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SandLogicTechnologies/Qwen3-4B-Thinking-2507-GGUF

Quantized
(78)
this model