Crow-9B-Opus-4.6-Distill-Heretic β Qwen3.5 W4G128 (AutoRound / Compressed-Tensors)
4-bit group-128 quantization of DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED, produced with AutoRound and stored in the compressed-tensors (pack-quantized) format.
Qwen3.5 is a hybrid architecture model that interleaves full attention layers (every 4th layer) with linear/GatedDeltaNet (Mamba-style) layers. This requires specific vLLM patches to run correctly β all patches and instructions are documented below.
Model Details
| Property | Value |
|---|---|
| Base model | DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED |
| Architecture | Qwen3_5ForCausalLM (text-only, 28 layers: 21Γ GatedDeltaNet + 7Γ Full Attention) |
| Parameters | ~9B |
| Quantization | W4G128 (INT4 weights, group_size=128, asymmetric, AutoRound) |
| Format | compressed-tensors pack-quantized (Marlin-compatible) |
| Kernel | CompressedTensorsWNA16 β Marlin via vLLM |
| lm_head | Not quantized (BF16) |
| Disk size | ~5.5 GB (model.safetensors + extra_weights.safetensors) |
| Max context | 32,768 tokens tested; up to 262,144 tokens possible |
| VRAM required | ~8 GB (fp16 KV cache, 32K context) |
Quickstart
1 β Environment
Qwen3.5 requires the vLLM nightly build (β₯ 0.17.0rc1.dev206). Use a fresh conda or uv environment:
conda create -n vllm-qwen35 python=3.11 -y
conda activate vllm-qwen35
# Install vLLM nightly with auto torch backend
pip install vllm --torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
β οΈ The standard PyPI release of vLLM does not support Qwen3.5. You must use the nightly wheel.
2 β Required vLLM Patches
Qwen3.5 is a new hybrid architecture that is not yet fully integrated into vLLM nightly as of March 2025. Five patches are required. Apply them to the installed vLLM package:
VLLM_SITE=$(python -c "import vllm; import os; print(os.path.dirname(vllm.__file__))")
echo "Patching vLLM at: $VLLM_SITE"
Patch 1 β Config registry (transformers_utils/config.py)
The qwen3_5_text model type is not registered in vLLM's config lookup table.
# File: $VLLM_SITE/transformers_utils/config.py
# Find the _CONFIG_REGISTRY dict and add this entry:
_CONFIG_REGISTRY = {
...
"qwen3_5_text": "Qwen3_5TextConfig", # β ADD THIS LINE
...
}
Patch 2 β Model registry (model_executor/models/registry.py)
The Qwen3_5ForCausalLM architecture is not mapped to its module.
# File: $VLLM_SITE/model_executor/models/registry.py
# Find _TEXT_GENERATION_MODELS and add:
_TEXT_GENERATION_MODELS = {
...
"Qwen3_5ForCausalLM": ("qwen3_5", "Qwen3_5ForCausalLM"), # β ADD THIS LINE
...
}
Patch 3 β Hybrid KV cache (model_executor/models/qwen3_5.py)
Qwen3.5 is a hybrid model (attention + Mamba layers) but its Qwen3_5ForCausalLMBase class does not implement the IsHybrid interface. Without this vLLM cannot auto-compute a compatible block_size (needs to be 272, not the default 16), causing a fatal NotImplementedError at startup.
Find class Qwen3_5ForCausalLMBase and make these changes:
# BEFORE:
class Qwen3_5ForCausalLMBase(
nn.Module,
HasInnerState,
SupportsLoRA,
SupportsPP,
):
...
# AFTER:
class Qwen3_5ForCausalLMBase(
nn.Module,
HasInnerState,
IsHybrid, # β ADD
SupportsMRoPE, # β ADD
SupportsLoRA,
SupportsPP,
):
is_hybrid = True # β ADD
supports_mrope: ClassVar[Literal[True]] = True # β ADD
# β ADD these three classmethods (Mamba state shape/dtype/copy):
@classmethod
def get_mamba_state_dtype_from_config(cls, vllm_config):
return MambaStateDtypeCalculator.gated_delta_net_state_dtype(
vllm_config.model_config.dtype,
vllm_config.cache_config.mamba_cache_dtype,
vllm_config.cache_config.mamba_ssm_cache_dtype,
)
@classmethod
def get_mamba_state_shape_from_config(cls, vllm_config):
cfg = vllm_config.model_config.hf_text_config
tp = vllm_config.parallel_config.tensor_parallel_size
num_spec = (vllm_config.speculative_config.num_speculative_tokens
if vllm_config.speculative_config else 0)
return MambaStateShapeCalculator.gated_delta_net_state_shape(
tp, cfg.linear_num_key_heads, cfg.linear_num_value_heads,
cfg.linear_key_head_dim, cfg.linear_value_head_dim,
cfg.linear_conv_kernel_dim, num_spec,
)
@classmethod
def get_mamba_state_copy_func(cls):
return MambaStateCopyFuncCalculator.gated_delta_net_state_copy_func()
# β ADD text-only M-RoPE (config inherits mrope_section from VL parent):
def get_mrope_input_positions(self, input_tokens, mm_features):
n = len(input_tokens)
positions = torch.arange(n).unsqueeze(0).expand(3, -1)
return positions, 0
Also add SupportsMRoPE to the imports at the top of the file:
from .interfaces import (
HasInnerState,
IsHybrid,
MixtureOfExperts,
MultiModalEmbeddings,
SupportsLoRA,
SupportsMRoPE, # β ADD
SupportsPP,
_require_is_multimodal,
)
Patch 4 β Weight prefix remapping (model_executor/models/qwen3_5.py)
This checkpoint was quantized from the full VL model (Qwen3_5ForConditionalGeneration), so weight keys carry the model.language_model. prefix instead of the text-only model. prefix. The vision encoder weights must also be silently ignored.
Find class Qwen3_5ForCausalLMBase's load_weights method and replace it:
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
# Remap VL-model weight prefix β text-only prefix; skip vision encoder.
mapper = WeightsMapper(
orig_to_new_prefix={"model.language_model.": "model."},
)
loader = AutoWeightsLoader(
self,
skip_prefixes=["mtp."],
ignore_unexpected_prefixes=["model.visual."],
)
return loader.load_weights(weights, mapper=mapper)
Make sure WeightsMapper is imported from .utils:
from .utils import (
AutoWeightsLoader,
PPMissingLayer,
WeightsMapper, # β ADD if not already present
...
)
Note: You will still see
WARNING: Parameter xxx.weight not found in params_dictfor every layer. These come from the original BF16 weights stored inextra_weights.safetensorsalongside the quantized versions β they are harmless. The quantizedweight_packed/weight_scale/weight_zero_pointtensors load correctly and silently.
3 β Serve with vLLM
conda activate vllm-qwen35
CUDA_VISIBLE_DEVICES=0 CUDA_DEVICE_ORDER=PCI_BUS_ID \
vllm serve groxaxo/Crow-9B-Opus-4.6-Distill-Heretic-Qwen3.5-W4G128 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--dtype bfloat16 \
--trust-remote-code
Expected startup output:
Setting attention block size to 272 tokens to ensure that attention page
size is >= mamba page size.
Padding mamba page size by 1.49% ...
Loading safetensors checkpoint shards: 100% | 2/2 [00:05]
Loading weights took 5.82 seconds
Application startup complete.
The
block_size=272message confirms the hybrid KV cache patch is working correctly.
4 β API Usage
The server exposes an OpenAI-compatible API at http://localhost:8000/v1.
Basic Chat
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
MODEL = "groxaxo/Crow-9B-Opus-4.6-Distill-Heretic-Qwen3.5-W4G128"
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "Explain quantum entanglement simply."}],
max_tokens=512,
temperature=0.6,
)
print(response.choices[0].message.content)
# The <think>...</think> reasoning block is parsed separately:
print(response.choices[0].message.reasoning)
With Tool Calling
tools = [{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string", "description": "Search query"}},
"required": ["query"],
},
},
}]
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "What is the latest news about fusion energy?"}],
tools=tools,
tool_choice="auto",
max_tokens=512,
)
msg = response.choices[0].message
if msg.tool_calls:
print("Tool call:", msg.tool_calls[0].function.name, msg.tool_calls[0].function.arguments)
else:
print("Direct answer:", msg.content)
Disabling Thinking (faster, cheaper)
# Add /no-thinking to the system prompt or set temperature=0
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "/no-thinking"},
{"role": "user", "content": "What is 2+2?"},
],
max_tokens=64,
temperature=0,
)
Architecture Notes
Qwen3.5 uses a hybrid attention pattern (introduced in Qwen3-Next):
- Layer types:
linear_attention(GatedDeltaNet / Mamba2-style) Γ 21 layers +full_attentionΓ 7 layers (every 4th) - GatedDeltaNet layers maintain a recurrent state (conv state + temporal state) requiring Mamba-style KV cache allocation
- Block size: vLLM must use
block_size=272(not the default 16) so that the Mamba page size (1,097,728 bytes) aligns with the FlashAttention page size
| Component | Size |
|---|---|
| Conv state (per GDN layer) | 3 Γ 8192 Γ BF16 = 49,152 bytes |
| Temporal state (per GDN layer) | 32 Γ 128 Γ 128 Γ BF16 = 1,048,576 bytes |
| Total Mamba page | 1,097,728 bytes |
| FA page @ block_size=272 | 272 Γ 4,096 = 1,114,112 bytes (padded 1.49%) |
Quantization Details
Quantized with AutoRound (Intel Neural Compressor) using:
- Method:
compressed-tensorspack-quantized - Bits: INT4, group_size=128, asymmetric (zero-point),
zp_dtype=torch.int8 - Targets: all
Linearlayers exceptlm_headand vision encoder - Calibration: 400 iterations, batch_size=1
- Scale dtype: BF16 (default), zero-point: INT8
The extra_weights.safetensors file contains the original BF16 copies of quantized layers. These are used by AutoRound for reference and are not loaded by vLLM (vLLM uses the weight_packed / weight_scale tensors from model.safetensors).
Perplexity (WikiText-2 test set)
Evaluated on 8,000 tokens of WikiText-2 with 512-token chunks, 256-token stride:
| Model | PPL |
|---|---|
| DavidAU/Qwen3.5-9B-Heretic (BF16) | 9.6540 |
| This model (W4G128) | 9.9664 |
| Ξ degradation | +3.24% |
The quantization cost of 4-bit W4G128 is only +3.24% perplexity β excellent quality for a 2Γ compression ratio.
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
KeyError: qwen3_5_text |
Config registry missing entry | Apply Patch 1 |
KeyError: Qwen3_5ForCausalLM |
Model registry missing entry | Apply Patch 2 |
NotImplementedError: page size not divisible |
IsHybrid not implemented |
Apply Patch 3 |
AssertionError: M-RoPE support is not implemented |
SupportsMRoPE not implemented |
Apply Patch 3 |
Parameter xxx.weight not found in params_dict |
BF16 backup weights in extra_weights.safetensors | Harmless β ignore |
Parameter xxx not found (non-.weight) |
Weight prefix mismatch (VL β text) | Apply Patch 4 |
Related Models
- DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED β original BF16 source
- groxaxo/Qwen3.5-27B-heretic-AWQ-W4A16 β 27B variant
Citation
If you use this model, please cite the original Qwen3.5 paper and AutoRound:
@misc{qwen3.5,
title = {Qwen3.5 Technical Report},
author = {Qwen Team},
year = {2025},
}
@misc{autoround,
title = {AutoRound: Optimizing LLM Quantization},
author = {Intel Neural Compressor Team},
year = {2024},
}
- Downloads last month
- 24
Model tree for groxaxo/Crow-9B-Opus-4.6-Distill-Heretic-Qwen3.5-W4G128
Base model
Qwen/Qwen3.5-9B-Base