---
license: gpl-3.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
base_model:
- unsloth/gemma-4-26B-A4B-it
base_model_relation: finetune
datasets:
- angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
tags:
- unsloth
- gemma
- gemma-4
- gemma4
- moe
- mixture-of-experts
- multimodal
- vision-language
- audio
- reasoning
- thinking
- chain-of-thought
- distillation
- claude
- text-generation
- conversational
- fine-tuned
- bfloat16
model_type: gemma4
inference: false
---

# Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune (Unsloth)

This is a fine-tune of [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) (via the Unsloth-fixed checkpoint [`unsloth/gemma-4-26b-a4b-it`](https://huggingface.co/unsloth/gemma-4-26b-a4b-it)) trained on [`angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k`](https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) — a ~8.7k-example reasoning trace dataset distilled from Claude Opus 4.6 / 4.7.

The goal of this fine-tune is to **strengthen multi-step reasoning, planning, and self-reflection** on top of Gemma‑4's native `<|channel>thought` reasoning channel, while preserving its multimodal (text + image + audio + video), tool-calling, and long-context capabilities.

> Trained with [Unsloth](https://github.com/unslothai/unsloth) — 2× faster training, lower VRAM, identical accuracy.

---

## Model Summary

| Property | Details |
|---|---|
| **Base model** | [`unsloth/gemma-4-26b-a4b-it`](https://huggingface.co/unsloth/gemma-4-26b-a4b-it) ([`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it)) |
| **Architecture** | `Gemma4ForConditionalGeneration` (Mixture‑of‑Experts, multimodal) |
| **Model type** | `gemma4` |
| **Total parameters** | ~26 B |
| **Active parameters / token** | ~4 B (MoE: 128 experts, top‑8 routing) |
| **Modalities** | Text, Image, Audio, Video (inputs) → Text (output) |
| **Max context length** | 262,144 tokens (262K) |
| **Sliding window** | 1,024 (every 6th layer is full attention) |
| **Vocab size** | 262,144 |
| **Tensor dtype** | `bfloat16` |
| **Tokenizer** | Gemma‑4 SentencePiece (multimodal special tokens for `<|image|>`, `<|audio|>`, `<|video|>`) |
| **Chat template** | Gemma‑4 conversational template with `<|channel>thought` reasoning channel and native tool‑calling |
| **Training framework** | [Unsloth](https://github.com/unslothai/unsloth) `2026.5.7` |
| **Fine-tuning dataset** | [`angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k`](https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) |
| **License** | GPL‑3.0 (this fine‑tune); base model under [Gemma Terms of Use](https://ai.google.dev/gemma/terms) |

### Architecture Details

- **Text backbone** (`gemma4_text`):
  - 30 hidden layers, hidden size 2816, intermediate size 2112
  - 16 attention heads, 8 KV heads, head dim 256, global head dim 512
  - **MoE blocks**: 128 experts, top‑8 routing, MoE intermediate size 704
  - Hybrid attention pattern: 5× sliding (window=1024) + 1× full attention, repeated
  - Final‑logit softcap = 30.0, RMSNorm ε = 1e‑6
  - RoPE: θ=1e6 (full‑attn, partial rotary 0.25), θ=1e4 (sliding)
  - Tied input/output embeddings
- **Vision tower** (`gemma4_vision`): 27 layers, hidden 1152, 16 heads, patch 16, 280 soft tokens per image, pooling kernel 3
- **Audio tower** (`Gemma4AudioFeatureExtractor`): 16 kHz, 128 mel bins, 40 ms / token, up to 750 audio tokens
- **Video processor**: 32 sampled frames, 70 soft tokens per frame max

---

## Training

| Property | Details |
|---|---|
| **Method** | Supervised Fine‑Tuning (SFT) on reasoning traces |
| **Device** | Nvidia DGX Spark (x1) |
| **Framework** | Unsloth + 🤗 Transformers / TRL |
| **Precision** | bf16 |
| **Dataset size** | ~8,700 multi‑turn reasoning examples |
| **Dataset source** | Reasoning rollouts distilled from Claude Opus 4.6 / 4.7 |
| **Reasoning format** | Preserves Gemma‑4's native `<|channel>thought ... <channel|>` block, taught from Claude‑style chain‑of‑thought |

The training corpus emphasizes:

- Long, structured chain‑of‑thought reasoning
- Math, code, logic and step‑wise problem decomposition
- Self‑verification and answer revision patterns
- Instruction following with explicit thinking → answer separation

> Reasoning data is distilled from Anthropic's Claude models. Outputs may reflect stylistic patterns of Claude (e.g. hedged tone, explicit step labels, "Let me think…" preambles). Use accordingly.

---

## Intended Use

**Primary use cases**

- Reasoning‑heavy assistants (math, coding, agentic planning)
- Multimodal Q&A over images / audio / video
- Long‑context (up to 262K) summarization, retrieval, and document analysis
- Tool‑calling / function‑calling agents (native in chat template)
- Research on MoE + multimodal reasoning distillation

**Out‑of‑scope / not recommended**

- High‑stakes decisions (medical, legal, financial advice without human review)
- Generation of disallowed content under the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy)
- Safety‑critical autonomous deployments without guardrails

---

## How to Use

### Install

```bash
pip install -U transformers accelerate
# Optional (recommended) for faster inference / fine-tuning:
pip install -U unsloth
```

> Requires a recent `transformers` build with `gemma4` model support.

### Text generation (Transformers)

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a careful, step-by-step reasoner."},
    {"role": "user", "content": "If a train leaves at 9:15 and travels for 2h 47m, when does it arrive?"},
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,         # turn on the <|channel>thought block
    return_tensors="pt",
    return_dict=True,
).to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))
```

### Multimodal (image + text)

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/diagram.png"},
            {"type": "text",  "text": "Explain what this diagram shows and reason about any inconsistencies."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```

The processor also accepts `{"type": "audio", ...}` and `{"type": "video", ...}` content parts (see `processor_config.json`).

### Tool calling

The chat template natively supports OpenAI‑style `tools` and Gemma‑native `tool_calls` / `tool_responses`. Pass tools to `apply_chat_template(..., tools=[...])` and the template will emit `<|tool>…<tool|>` declarations and parse `<|tool_call>…<tool_call|>` blocks.

### Faster inference with Unsloth

```python
from unsloth import FastModel

model, processor = FastModel.from_pretrained(
    "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
    max_seq_length = 8192,
    load_in_4bit  = False,   # bf16 here; set True for 4-bit
    dtype = None,
)
```

### Recommended sampling

From `generation_config.json`:

| Param | Value |
|---|---|
| `temperature` | 1.0 |
| `top_p` | 0.95 |
| `top_k` | 64 |
| `do_sample` | true |
| `eos_token_id` | `[1, 106, 50]` |
| `pad_token_id` | 0 |
| `bos_token_id` | 2 |

For deterministic reasoning, drop temperature to ~0.3–0.6 and disable sampling.

---

## Chat Template & Reasoning Channel

This model uses Gemma‑4's structured chat template (`chat_template.jinja`) with:

- Role turns: `<|turn>system|user|model<turn|>`
- Thinking channel: `<|channel>thought ... <channel|>` (gated by `enable_thinking=True`)
- Tool declarations: `<|tool>…<tool|>`
- Tool calls / responses: `<|tool_call>…<tool_call|>` / `<|tool_response>…<tool_response|>`
- Multimodal placeholders: `<|image|>`, `<|audio|>`, `<|video|>`

When `add_generation_prompt=True` is used **without** `enable_thinking`, the template emits an empty `<|channel>thought<channel|>` to suppress reasoning. Pass `enable_thinking=True` to enable the model's full chain‑of‑thought.

---

## Files

| File | Purpose |
|---|---|
| `config.json` | Full model config (text + vision + audio sub‑configs) |
| `generation_config.json` | Default sampling parameters |
| `processor_config.json` | Image / audio / video processor settings |
| `tokenizer.json`, `tokenizer_config.json` | Gemma‑4 multimodal tokenizer |
| `chat_template.jinja` | Conversational + tool‑calling + reasoning template |
| `model-00001-of-00002.safetensors`, `model-00002-of-00002.safetensors` | bf16 weights (~51.6 GB total) |
| `model.safetensors.index.json` | Sharding index |
| `export_metadata.json` | Export provenance |

---

## Limitations & Biases

- **Hallucinations**: Like all LLMs, this model can produce confident but incorrect answers, particularly outside its training distribution.
- **Reasoning style transfer**: Because the SFT data is distilled from Claude, stylistic and refusal patterns of Claude may leak into outputs.
- **Dataset size**: ~8.7k examples is small; expect targeted improvements on reasoning style rather than broad capability uplift over the base model.
- **Multimodal grounding**: Vision/audio/video capabilities are inherited from the base model and were not specifically targeted by this fine‑tune.
- **Safety**: No additional safety fine‑tuning was performed. The base Gemma‑4 safety guarantees apply, but downstream users should add their own guardrails.

---

## License

- **This fine-tune**: [GPL‑3.0](https://www.gnu.org/licenses/gpl-3.0.html)
- **Base model**: subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). You must comply with both when using or redistributing this model.
- **Training data**: see the [dataset card](https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) for terms.

---

## Citation

If you use this model, please cite the base model, the Unsloth project, and the dataset:

```bibtex
@misc{gemma4_2025,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://ai.google.dev/gemma}
}

@misc{unsloth,
  title  = {Unsloth: 2x faster LLM fine-tuning with 70% less memory},
  author = {Daniel Han and Michael Han and {Unsloth team}},
  year   = {2024-2026},
  url    = {https://github.com/unslothai/unsloth}
}

@misc{claude_reasoning_8k7,
  title  = {claude-opus-4.6-4.7-reasoning-8.7k},
  author = {angrygiraffe},
  year   = {2026},
  url    = {https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k}
}
```

---

## Acknowledgements

- **Google DeepMind** — Gemma‑4 base model
- **Unsloth team** — Quant‑fixed checkpoint, training framework, and inference acceleration
- **angrygiraffe** — Reasoning distillation dataset
- **Anthropic** — Source model family (Claude Opus 4.6 / 4.7) for the distilled reasoning traces