Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune (Unsloth)

This is a fine-tune of google/gemma-4-26B-A4B-it (via the Unsloth-fixed checkpoint unsloth/gemma-4-26b-a4b-it) trained on angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — a ~8.7k-example reasoning trace dataset distilled from Claude Opus 4.6 / 4.7.

The goal of this fine-tune is to strengthen multi-step reasoning, planning, and self-reflection on top of Gemma‑4's native <|channel>thought reasoning channel, while preserving its multimodal (text + image + audio + video), tool-calling, and long-context capabilities.

Trained with Unsloth — 2× faster training, lower VRAM, identical accuracy.


Model Summary

Property Details
Base model unsloth/gemma-4-26b-a4b-it (google/gemma-4-26B-A4B-it)
Architecture Gemma4ForConditionalGeneration (Mixture‑of‑Experts, multimodal)
Model type gemma4
Total parameters ~26 B
Active parameters / token ~4 B (MoE: 128 experts, top‑8 routing)
Modalities Text, Image, Audio, Video (inputs) → Text (output)
Max context length 262,144 tokens (262K)
Sliding window 1,024 (every 6th layer is full attention)
Vocab size 262,144
Tensor dtype bfloat16
Tokenizer Gemma‑4 SentencePiece (multimodal special tokens for `<
Chat template Gemma‑4 conversational template with `<
Training framework Unsloth 2026.5.7
Fine-tuning dataset angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
License GPL‑3.0 (this fine‑tune); base model under Gemma Terms of Use

Architecture Details

  • Text backbone (gemma4_text):
    • 30 hidden layers, hidden size 2816, intermediate size 2112
    • 16 attention heads, 8 KV heads, head dim 256, global head dim 512
    • MoE blocks: 128 experts, top‑8 routing, MoE intermediate size 704
    • Hybrid attention pattern: 5× sliding (window=1024) + 1× full attention, repeated
    • Final‑logit softcap = 30.0, RMSNorm ε = 1e‑6
    • RoPE: θ=1e6 (full‑attn, partial rotary 0.25), θ=1e4 (sliding)
    • Tied input/output embeddings
  • Vision tower (gemma4_vision): 27 layers, hidden 1152, 16 heads, patch 16, 280 soft tokens per image, pooling kernel 3
  • Audio tower (Gemma4AudioFeatureExtractor): 16 kHz, 128 mel bins, 40 ms / token, up to 750 audio tokens
  • Video processor: 32 sampled frames, 70 soft tokens per frame max

Training

Property Details
Method Supervised Fine‑Tuning (SFT) on reasoning traces
Device Nvidia DGX Spark (x1)
Framework Unsloth + 🤗 Transformers / TRL
Precision bf16
Dataset size ~8,700 multi‑turn reasoning examples
Dataset source Reasoning rollouts distilled from Claude Opus 4.6 / 4.7
Reasoning format Preserves Gemma‑4's native `<

The training corpus emphasizes:

  • Long, structured chain‑of‑thought reasoning
  • Math, code, logic and step‑wise problem decomposition
  • Self‑verification and answer revision patterns
  • Instruction following with explicit thinking → answer separation

Reasoning data is distilled from Anthropic's Claude models. Outputs may reflect stylistic patterns of Claude (e.g. hedged tone, explicit step labels, "Let me think…" preambles). Use accordingly.


Intended Use

Primary use cases

  • Reasoning‑heavy assistants (math, coding, agentic planning)
  • Multimodal Q&A over images / audio / video
  • Long‑context (up to 262K) summarization, retrieval, and document analysis
  • Tool‑calling / function‑calling agents (native in chat template)
  • Research on MoE + multimodal reasoning distillation

Out‑of‑scope / not recommended

  • High‑stakes decisions (medical, legal, financial advice without human review)
  • Generation of disallowed content under the Gemma Prohibited Use Policy
  • Safety‑critical autonomous deployments without guardrails

How to Use

Install

pip install -U transformers accelerate
# Optional (recommended) for faster inference / fine-tuning:
pip install -U unsloth

Requires a recent transformers build with gemma4 model support.

Text generation (Transformers)

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a careful, step-by-step reasoner."},
    {"role": "user", "content": "If a train leaves at 9:15 and travels for 2h 47m, when does it arrive?"},
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,         # turn on the <|channel>thought block
    return_tensors="pt",
    return_dict=True,
).to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))

Multimodal (image + text)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/diagram.png"},
            {"type": "text",  "text": "Explain what this diagram shows and reason about any inconsistencies."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

The processor also accepts {"type": "audio", ...} and {"type": "video", ...} content parts (see processor_config.json).

Tool calling

The chat template natively supports OpenAI‑style tools and Gemma‑native tool_calls / tool_responses. Pass tools to apply_chat_template(..., tools=[...]) and the template will emit <|tool>…<tool|> declarations and parse <|tool_call>…<tool_call|> blocks.

Faster inference with Unsloth

from unsloth import FastModel

model, processor = FastModel.from_pretrained(
    "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
    max_seq_length = 8192,
    load_in_4bit  = False,   # bf16 here; set True for 4-bit
    dtype = None,
)

Recommended sampling

From generation_config.json:

Param Value
temperature 1.0
top_p 0.95
top_k 64
do_sample true
eos_token_id [1, 106, 50]
pad_token_id 0
bos_token_id 2

For deterministic reasoning, drop temperature to ~0.3–0.6 and disable sampling.


Chat Template & Reasoning Channel

This model uses Gemma‑4's structured chat template (chat_template.jinja) with:

  • Role turns: <|turn>system|user|model<turn|>
  • Thinking channel: <|channel>thought ... <channel|> (gated by enable_thinking=True)
  • Tool declarations: <|tool>…<tool|>
  • Tool calls / responses: <|tool_call>…<tool_call|> / <|tool_response>…<tool_response|>
  • Multimodal placeholders: <|image|>, <|audio|>, <|video|>

When add_generation_prompt=True is used without enable_thinking, the template emits an empty <|channel>thought<channel|> to suppress reasoning. Pass enable_thinking=True to enable the model's full chain‑of‑thought.


Files

File Purpose
config.json Full model config (text + vision + audio sub‑configs)
generation_config.json Default sampling parameters
processor_config.json Image / audio / video processor settings
tokenizer.json, tokenizer_config.json Gemma‑4 multimodal tokenizer
chat_template.jinja Conversational + tool‑calling + reasoning template
model-00001-of-00002.safetensors, model-00002-of-00002.safetensors bf16 weights (~51.6 GB total)
model.safetensors.index.json Sharding index
export_metadata.json Export provenance

Limitations & Biases

  • Hallucinations: Like all LLMs, this model can produce confident but incorrect answers, particularly outside its training distribution.
  • Reasoning style transfer: Because the SFT data is distilled from Claude, stylistic and refusal patterns of Claude may leak into outputs.
  • Dataset size: ~8.7k examples is small; expect targeted improvements on reasoning style rather than broad capability uplift over the base model.
  • Multimodal grounding: Vision/audio/video capabilities are inherited from the base model and were not specifically targeted by this fine‑tune.
  • Safety: No additional safety fine‑tuning was performed. The base Gemma‑4 safety guarantees apply, but downstream users should add their own guardrails.

License


Citation

If you use this model, please cite the base model, the Unsloth project, and the dataset:

@misc{gemma4_2025,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://ai.google.dev/gemma}
}

@misc{unsloth,
  title  = {Unsloth: 2x faster LLM fine-tuning with 70% less memory},
  author = {Daniel Han and Michael Han and {Unsloth team}},
  year   = {2024-2026},
  url    = {https://github.com/unslothai/unsloth}
}

@misc{claude_reasoning_8k7,
  title  = {claude-opus-4.6-4.7-reasoning-8.7k},
  author = {angrygiraffe},
  year   = {2026},
  url    = {https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k}
}

Acknowledgements

  • Google DeepMind — Gemma‑4 base model
  • Unsloth team — Quant‑fixed checkpoint, training framework, and inference acceleration
  • angrygiraffe — Reasoning distillation dataset
  • Anthropic — Source model family (Claude Opus 4.6 / 4.7) for the distilled reasoning traces
Downloads last month
168
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled

Finetuned
(12)
this model

Dataset used to train glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled