--- license: gpl-3.0 language: - en library_name: transformers pipeline_tag: text-generation base_model: - unsloth/gemma-4-26B-A4B-it base_model_relation: finetune datasets: - angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k tags: - unsloth - gemma - gemma-4 - gemma4 - moe - mixture-of-experts - multimodal - vision-language - audio - reasoning - thinking - chain-of-thought - distillation - claude - text-generation - conversational - fine-tuned - bfloat16 model_type: gemma4 inference: false --- # Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune (Unsloth) This is a fine-tune of [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) (via the Unsloth-fixed checkpoint [`unsloth/gemma-4-26b-a4b-it`](https://huggingface.co/unsloth/gemma-4-26b-a4b-it)) trained on [`angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k`](https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) — a ~8.7k-example reasoning trace dataset distilled from Claude Opus 4.6 / 4.7. The goal of this fine-tune is to **strengthen multi-step reasoning, planning, and self-reflection** on top of Gemma‑4's native `<|channel>thought` reasoning channel, while preserving its multimodal (text + image + audio + video), tool-calling, and long-context capabilities. > Trained with [Unsloth](https://github.com/unslothai/unsloth) — 2× faster training, lower VRAM, identical accuracy. --- ## Model Summary | Property | Details | |---|---| | **Base model** | [`unsloth/gemma-4-26b-a4b-it`](https://huggingface.co/unsloth/gemma-4-26b-a4b-it) ([`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it)) | | **Architecture** | `Gemma4ForConditionalGeneration` (Mixture‑of‑Experts, multimodal) | | **Model type** | `gemma4` | | **Total parameters** | ~26 B | | **Active parameters / token** | ~4 B (MoE: 128 experts, top‑8 routing) | | **Modalities** | Text, Image, Audio, Video (inputs) → Text (output) | | **Max context length** | 262,144 tokens (262K) | | **Sliding window** | 1,024 (every 6th layer is full attention) | | **Vocab size** | 262,144 | | **Tensor dtype** | `bfloat16` | | **Tokenizer** | Gemma‑4 SentencePiece (multimodal special tokens for `<|image|>`, `<|audio|>`, `<|video|>`) | | **Chat template** | Gemma‑4 conversational template with `<|channel>thought` reasoning channel and native tool‑calling | | **Training framework** | [Unsloth](https://github.com/unslothai/unsloth) `2026.5.7` | | **Fine-tuning dataset** | [`angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k`](https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) | | **License** | GPL‑3.0 (this fine‑tune); base model under [Gemma Terms of Use](https://ai.google.dev/gemma/terms) | ### Architecture Details - **Text backbone** (`gemma4_text`): - 30 hidden layers, hidden size 2816, intermediate size 2112 - 16 attention heads, 8 KV heads, head dim 256, global head dim 512 - **MoE blocks**: 128 experts, top‑8 routing, MoE intermediate size 704 - Hybrid attention pattern: 5× sliding (window=1024) + 1× full attention, repeated - Final‑logit softcap = 30.0, RMSNorm ε = 1e‑6 - RoPE: θ=1e6 (full‑attn, partial rotary 0.25), θ=1e4 (sliding) - Tied input/output embeddings - **Vision tower** (`gemma4_vision`): 27 layers, hidden 1152, 16 heads, patch 16, 280 soft tokens per image, pooling kernel 3 - **Audio tower** (`Gemma4AudioFeatureExtractor`): 16 kHz, 128 mel bins, 40 ms / token, up to 750 audio tokens - **Video processor**: 32 sampled frames, 70 soft tokens per frame max --- ## Training | Property | Details | |---|---| | **Method** | Supervised Fine‑Tuning (SFT) on reasoning traces | | **Device** | Nvidia DGX Spark (x1) | | **Framework** | Unsloth + 🤗 Transformers / TRL | | **Precision** | bf16 | | **Dataset size** | ~8,700 multi‑turn reasoning examples | | **Dataset source** | Reasoning rollouts distilled from Claude Opus 4.6 / 4.7 | | **Reasoning format** | Preserves Gemma‑4's native `<|channel>thought ... ` block, taught from Claude‑style chain‑of‑thought | The training corpus emphasizes: - Long, structured chain‑of‑thought reasoning - Math, code, logic and step‑wise problem decomposition - Self‑verification and answer revision patterns - Instruction following with explicit thinking → answer separation > Reasoning data is distilled from Anthropic's Claude models. Outputs may reflect stylistic patterns of Claude (e.g. hedged tone, explicit step labels, "Let me think…" preambles). Use accordingly. --- ## Intended Use **Primary use cases** - Reasoning‑heavy assistants (math, coding, agentic planning) - Multimodal Q&A over images / audio / video - Long‑context (up to 262K) summarization, retrieval, and document analysis - Tool‑calling / function‑calling agents (native in chat template) - Research on MoE + multimodal reasoning distillation **Out‑of‑scope / not recommended** - High‑stakes decisions (medical, legal, financial advice without human review) - Generation of disallowed content under the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy) - Safety‑critical autonomous deployments without guardrails --- ## How to Use ### Install ```bash pip install -U transformers accelerate # Optional (recommended) for faster inference / fine-tuning: pip install -U unsloth ``` > Requires a recent `transformers` build with `gemma4` model support. ### Text generation (Transformers) ```python import torch from transformers import AutoProcessor, AutoModelForImageTextToText model_id = "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForImageTextToText.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ {"role": "system", "content": "You are a careful, step-by-step reasoner."}, {"role": "user", "content": "If a train leaves at 9:15 and travels for 2h 47m, when does it arrive?"}, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, enable_thinking=True, # turn on the <|channel>thought block return_tensors="pt", return_dict=True, ).to(model.device) out = model.generate( **inputs, max_new_tokens=1024, do_sample=True, temperature=1.0, top_p=0.95, top_k=64, ) print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)) ``` ### Multimodal (image + text) ```python messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://example.com/diagram.png"}, {"type": "text", "text": "Explain what this diagram shows and reason about any inconsistencies."}, ], }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, enable_thinking=True, return_tensors="pt", return_dict=True, ).to(model.device) out = model.generate(**inputs, max_new_tokens=1024) print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) ``` The processor also accepts `{"type": "audio", ...}` and `{"type": "video", ...}` content parts (see `processor_config.json`). ### Tool calling The chat template natively supports OpenAI‑style `tools` and Gemma‑native `tool_calls` / `tool_responses`. Pass tools to `apply_chat_template(..., tools=[...])` and the template will emit `<|tool>…` declarations and parse `<|tool_call>…` blocks. ### Faster inference with Unsloth ```python from unsloth import FastModel model, processor = FastModel.from_pretrained( "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled", max_seq_length = 8192, load_in_4bit = False, # bf16 here; set True for 4-bit dtype = None, ) ``` ### Recommended sampling From `generation_config.json`: | Param | Value | |---|---| | `temperature` | 1.0 | | `top_p` | 0.95 | | `top_k` | 64 | | `do_sample` | true | | `eos_token_id` | `[1, 106, 50]` | | `pad_token_id` | 0 | | `bos_token_id` | 2 | For deterministic reasoning, drop temperature to ~0.3–0.6 and disable sampling. --- ## Chat Template & Reasoning Channel This model uses Gemma‑4's structured chat template (`chat_template.jinja`) with: - Role turns: `<|turn>system|user|model` - Thinking channel: `<|channel>thought ... ` (gated by `enable_thinking=True`) - Tool declarations: `<|tool>…` - Tool calls / responses: `<|tool_call>…` / `<|tool_response>…` - Multimodal placeholders: `<|image|>`, `<|audio|>`, `<|video|>` When `add_generation_prompt=True` is used **without** `enable_thinking`, the template emits an empty `<|channel>thought` to suppress reasoning. Pass `enable_thinking=True` to enable the model's full chain‑of‑thought. --- ## Files | File | Purpose | |---|---| | `config.json` | Full model config (text + vision + audio sub‑configs) | | `generation_config.json` | Default sampling parameters | | `processor_config.json` | Image / audio / video processor settings | | `tokenizer.json`, `tokenizer_config.json` | Gemma‑4 multimodal tokenizer | | `chat_template.jinja` | Conversational + tool‑calling + reasoning template | | `model-00001-of-00002.safetensors`, `model-00002-of-00002.safetensors` | bf16 weights (~51.6 GB total) | | `model.safetensors.index.json` | Sharding index | | `export_metadata.json` | Export provenance | --- ## Limitations & Biases - **Hallucinations**: Like all LLMs, this model can produce confident but incorrect answers, particularly outside its training distribution. - **Reasoning style transfer**: Because the SFT data is distilled from Claude, stylistic and refusal patterns of Claude may leak into outputs. - **Dataset size**: ~8.7k examples is small; expect targeted improvements on reasoning style rather than broad capability uplift over the base model. - **Multimodal grounding**: Vision/audio/video capabilities are inherited from the base model and were not specifically targeted by this fine‑tune. - **Safety**: No additional safety fine‑tuning was performed. The base Gemma‑4 safety guarantees apply, but downstream users should add their own guardrails. --- ## License - **This fine-tune**: [GPL‑3.0](https://www.gnu.org/licenses/gpl-3.0.html) - **Base model**: subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). You must comply with both when using or redistributing this model. - **Training data**: see the [dataset card](https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k) for terms. --- ## Citation If you use this model, please cite the base model, the Unsloth project, and the dataset: ```bibtex @misc{gemma4_2025, title = {Gemma 4}, author = {Google DeepMind}, year = {2025}, url = {https://ai.google.dev/gemma} } @misc{unsloth, title = {Unsloth: 2x faster LLM fine-tuning with 70% less memory}, author = {Daniel Han and Michael Han and {Unsloth team}}, year = {2024-2026}, url = {https://github.com/unslothai/unsloth} } @misc{claude_reasoning_8k7, title = {claude-opus-4.6-4.7-reasoning-8.7k}, author = {angrygiraffe}, year = {2026}, url = {https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k} } ``` --- ## Acknowledgements - **Google DeepMind** — Gemma‑4 base model - **Unsloth team** — Quant‑fixed checkpoint, training framework, and inference acceleration - **angrygiraffe** — Reasoning distillation dataset - **Anthropic** — Source model family (Claude Opus 4.6 / 4.7) for the distilled reasoning traces