Crow-9B-Opus-4.6-Distill-Heretic — Qwen3.5 W4G128 (AutoRound / Compressed-Tensors)

4-bit group-128 quantization of DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED, produced with AutoRound and stored in the compressed-tensors (pack-quantized) format.

Qwen3.5 is a hybrid architecture model that interleaves full attention layers (every 4th layer) with linear/GatedDeltaNet (Mamba-style) layers. This requires specific vLLM patches to run correctly — all patches and instructions are documented below.

Model Details

Property	Value
Base model	DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED
Architecture	`Qwen3_5ForCausalLM` (text-only, 28 layers: 21× GatedDeltaNet + 7× Full Attention)
Parameters	~9B
Quantization	W4G128 (INT4 weights, group_size=128, asymmetric, AutoRound)
Format	`compressed-tensors` `pack-quantized` (Marlin-compatible)
Kernel	`CompressedTensorsWNA16` → Marlin via vLLM
lm_head	Not quantized (BF16)
Disk size	~5.5 GB (model.safetensors + extra_weights.safetensors)
Max context	32,768 tokens tested; up to 262,144 tokens possible
VRAM required	~8 GB (fp16 KV cache, 32K context)

Quickstart

1 — Environment

Qwen3.5 requires the vLLM nightly build (≥ 0.17.0rc1.dev206). Use a fresh conda or uv environment:

conda create -n vllm-qwen35 python=3.11 -y
conda activate vllm-qwen35

# Install vLLM nightly with auto torch backend
pip install vllm --torch-backend=auto \
  --extra-index-url https://wheels.vllm.ai/nightly

⚠️ The standard PyPI release of vLLM does not support Qwen3.5. You must use the nightly wheel.

2 — Required vLLM Patches

Qwen3.5 is a new hybrid architecture that is not yet fully integrated into vLLM nightly as of March 2025. Five patches are required. Apply them to the installed vLLM package:

VLLM_SITE=$(python -c "import vllm; import os; print(os.path.dirname(vllm.__file__))")
echo "Patching vLLM at: $VLLM_SITE"

Patch 1 — Config registry (`transformers_utils/config.py`)

The qwen3_5_text model type is not registered in vLLM's config lookup table.

# File: $VLLM_SITE/transformers_utils/config.py
# Find the _CONFIG_REGISTRY dict and add this entry:
_CONFIG_REGISTRY = {
    ...
    "qwen3_5_text": "Qwen3_5TextConfig",   # ← ADD THIS LINE
    ...
}

Patch 2 — Model registry (`model_executor/models/registry.py`)

The Qwen3_5ForCausalLM architecture is not mapped to its module.

# File: $VLLM_SITE/model_executor/models/registry.py
# Find _TEXT_GENERATION_MODELS and add:
_TEXT_GENERATION_MODELS = {
    ...
    "Qwen3_5ForCausalLM": ("qwen3_5", "Qwen3_5ForCausalLM"),  # ← ADD THIS LINE
    ...
}

Patch 3 — Hybrid KV cache (`model_executor/models/qwen3_5.py`)

Qwen3.5 is a hybrid model (attention + Mamba layers) but its Qwen3_5ForCausalLMBase class does not implement the IsHybrid interface. Without this vLLM cannot auto-compute a compatible block_size (needs to be 272, not the default 16), causing a fatal NotImplementedError at startup.

Find class Qwen3_5ForCausalLMBase and make these changes:

# BEFORE:
class Qwen3_5ForCausalLMBase(
    nn.Module,
    HasInnerState,
    SupportsLoRA,
    SupportsPP,
):
    ...

# AFTER:
class Qwen3_5ForCausalLMBase(
    nn.Module,
    HasInnerState,
    IsHybrid,       # ← ADD
    SupportsMRoPE,  # ← ADD
    SupportsLoRA,
    SupportsPP,
):
    is_hybrid = True                                           # ← ADD
    supports_mrope: ClassVar[Literal[True]] = True            # ← ADD

    # ← ADD these three classmethods (Mamba state shape/dtype/copy):
    @classmethod
    def get_mamba_state_dtype_from_config(cls, vllm_config):
        return MambaStateDtypeCalculator.gated_delta_net_state_dtype(
            vllm_config.model_config.dtype,
            vllm_config.cache_config.mamba_cache_dtype,
            vllm_config.cache_config.mamba_ssm_cache_dtype,
        )

    @classmethod
    def get_mamba_state_shape_from_config(cls, vllm_config):
        cfg = vllm_config.model_config.hf_text_config
        tp  = vllm_config.parallel_config.tensor_parallel_size
        num_spec = (vllm_config.speculative_config.num_speculative_tokens
                    if vllm_config.speculative_config else 0)
        return MambaStateShapeCalculator.gated_delta_net_state_shape(
            tp, cfg.linear_num_key_heads, cfg.linear_num_value_heads,
            cfg.linear_key_head_dim, cfg.linear_value_head_dim,
            cfg.linear_conv_kernel_dim, num_spec,
        )

    @classmethod
    def get_mamba_state_copy_func(cls):
        return MambaStateCopyFuncCalculator.gated_delta_net_state_copy_func()

    # ← ADD text-only M-RoPE (config inherits mrope_section from VL parent):
    def get_mrope_input_positions(self, input_tokens, mm_features):
        n = len(input_tokens)
        positions = torch.arange(n).unsqueeze(0).expand(3, -1)
        return positions, 0

Also add SupportsMRoPE to the imports at the top of the file:

from .interfaces import (
    HasInnerState,
    IsHybrid,
    MixtureOfExperts,
    MultiModalEmbeddings,
    SupportsLoRA,
    SupportsMRoPE,   # ← ADD
    SupportsPP,
    _require_is_multimodal,
)

Patch 4 — Weight prefix remapping (`model_executor/models/qwen3_5.py`)

This checkpoint was quantized from the full VL model (Qwen3_5ForConditionalGeneration), so weight keys carry the model.language_model. prefix instead of the text-only model. prefix. The vision encoder weights must also be silently ignored.

Find class Qwen3_5ForCausalLMBase's load_weights method and replace it:

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
    # Remap VL-model weight prefix → text-only prefix; skip vision encoder.
    mapper = WeightsMapper(
        orig_to_new_prefix={"model.language_model.": "model."},
    )
    loader = AutoWeightsLoader(
        self,
        skip_prefixes=["mtp."],
        ignore_unexpected_prefixes=["model.visual."],
    )
    return loader.load_weights(weights, mapper=mapper)

Make sure WeightsMapper is imported from .utils:

from .utils import (
    AutoWeightsLoader,
    PPMissingLayer,
    WeightsMapper,   # ← ADD if not already present
    ...
)

Note: You will still see WARNING: Parameter xxx.weight not found in params_dict for every layer. These come from the original BF16 weights stored in extra_weights.safetensors alongside the quantized versions — they are harmless. The quantized weight_packed / weight_scale / weight_zero_point tensors load correctly and silently.

3 — Serve with vLLM

conda activate vllm-qwen35

CUDA_VISIBLE_DEVICES=0 CUDA_DEVICE_ORDER=PCI_BUS_ID \
vllm serve groxaxo/Crow-9B-Opus-4.6-Distill-Heretic-Qwen3.5-W4G128 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype bfloat16 \
  --trust-remote-code

Expected startup output:

Setting attention block size to 272 tokens to ensure that attention page
  size is >= mamba page size.
Padding mamba page size by 1.49% ...
Loading safetensors checkpoint shards: 100% | 2/2 [00:05]
Loading weights took 5.82 seconds
Application startup complete.

The block_size=272 message confirms the hybrid KV cache patch is working correctly.

4 — API Usage

The server exposes an OpenAI-compatible API at http://localhost:8000/v1.

Basic Chat

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
MODEL = "groxaxo/Crow-9B-Opus-4.6-Distill-Heretic-Qwen3.5-W4G128"

response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Explain quantum entanglement simply."}],
    max_tokens=512,
    temperature=0.6,
)
print(response.choices[0].message.content)
# The <think>...</think> reasoning block is parsed separately:
print(response.choices[0].message.reasoning)

With Tool Calling

tools = [{
    "type": "function",
    "function": {
        "name": "web_search",
        "description": "Search the web for current information",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string", "description": "Search query"}},
            "required": ["query"],
        },
    },
}]

response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "What is the latest news about fusion energy?"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=512,
)

msg = response.choices[0].message
if msg.tool_calls:
    print("Tool call:", msg.tool_calls[0].function.name, msg.tool_calls[0].function.arguments)
else:
    print("Direct answer:", msg.content)

Disabling Thinking (faster, cheaper)

# Add /no-thinking to the system prompt or set temperature=0
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "/no-thinking"},
        {"role": "user", "content": "What is 2+2?"},
    ],
    max_tokens=64,
    temperature=0,
)

Architecture Notes

Qwen3.5 uses a hybrid attention pattern (introduced in Qwen3-Next):

Layer types: linear_attention (GatedDeltaNet / Mamba2-style) × 21 layers + full_attention × 7 layers (every 4th)
GatedDeltaNet layers maintain a recurrent state (conv state + temporal state) requiring Mamba-style KV cache allocation
Block size: vLLM must use block_size=272 (not the default 16) so that the Mamba page size (1,097,728 bytes) aligns with the FlashAttention page size

Component	Size
Conv state (per GDN layer)	3 × 8192 × BF16 = 49,152 bytes
Temporal state (per GDN layer)	32 × 128 × 128 × BF16 = 1,048,576 bytes
Total Mamba page	1,097,728 bytes
FA page @ block_size=272	272 × 4,096 = 1,114,112 bytes (padded 1.49%)

Quantization Details

Quantized with AutoRound (Intel Neural Compressor) using:

Method: compressed-tensors pack-quantized
Bits: INT4, group_size=128, asymmetric (zero-point), zp_dtype=torch.int8
Targets: all Linear layers except lm_head and vision encoder
Calibration: 400 iterations, batch_size=1
Scale dtype: BF16 (default), zero-point: INT8

The extra_weights.safetensors file contains the original BF16 copies of quantized layers. These are used by AutoRound for reference and are not loaded by vLLM (vLLM uses the weight_packed / weight_scale tensors from model.safetensors).

Perplexity (WikiText-2 test set)

Evaluated on 8,000 tokens of WikiText-2 with 512-token chunks, 256-token stride:

Model	PPL
DavidAU/Qwen3.5-9B-Heretic (BF16)	9.6540
This model (W4G128)	9.9664
Δ degradation	+3.24%

The quantization cost of 4-bit W4G128 is only +3.24% perplexity — excellent quality for a 2× compression ratio.

Troubleshooting

Error	Cause	Fix
`KeyError: qwen3_5_text`	Config registry missing entry	Apply Patch 1
`KeyError: Qwen3_5ForCausalLM`	Model registry missing entry	Apply Patch 2
`NotImplementedError: page size not divisible`	`IsHybrid` not implemented	Apply Patch 3
`AssertionError: M-RoPE support is not implemented`	`SupportsMRoPE` not implemented	Apply Patch 3
`Parameter xxx.weight not found in params_dict`	BF16 backup weights in extra_weights.safetensors	Harmless — ignore
`Parameter xxx not found` (non-.weight)	Weight prefix mismatch (VL → text)	Apply Patch 4

Related Models

DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED — original BF16 source
groxaxo/Qwen3.5-27B-heretic-AWQ-W4A16 — 27B variant

Citation

If you use this model, please cite the original Qwen3.5 paper and AutoRound:

@misc{qwen3.5,
  title  = {Qwen3.5 Technical Report},
  author = {Qwen Team},
  year   = {2025},
}
@misc{autoround,
  title  = {AutoRound: Optimizing LLM Quantization},
  author = {Intel Neural Compressor Team},
  year   = {2024},
}

Downloads last month: 24

Model tree for groxaxo/Crow-9B-Opus-4.6-Distill-Heretic-Qwen3.5-W4G128

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

trohrbaugh/Qwen3.5-9B-heretic-v2