Instructions to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled") model = AutoModelForImageTextToText.from_pretrained("glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled
- SGLang
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled", max_seq_length=2048, ) - Docker Model Runner
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with Docker Model Runner:
docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled
Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune (Unsloth)
This is a fine-tune of google/gemma-4-26B-A4B-it (via the Unsloth-fixed checkpoint unsloth/gemma-4-26b-a4b-it) trained on angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — a ~8.7k-example reasoning trace dataset distilled from Claude Opus 4.6 / 4.7.
The goal of this fine-tune is to strengthen multi-step reasoning, planning, and self-reflection on top of Gemma‑4's native <|channel>thought reasoning channel, while preserving its multimodal (text + image + audio + video), tool-calling, and long-context capabilities.
Trained with Unsloth — 2× faster training, lower VRAM, identical accuracy.
Model Summary
| Property | Details |
|---|---|
| Base model | unsloth/gemma-4-26b-a4b-it (google/gemma-4-26B-A4B-it) |
| Architecture | Gemma4ForConditionalGeneration (Mixture‑of‑Experts, multimodal) |
| Model type | gemma4 |
| Total parameters | ~26 B |
| Active parameters / token | ~4 B (MoE: 128 experts, top‑8 routing) |
| Modalities | Text, Image, Audio, Video (inputs) → Text (output) |
| Max context length | 262,144 tokens (262K) |
| Sliding window | 1,024 (every 6th layer is full attention) |
| Vocab size | 262,144 |
| Tensor dtype | bfloat16 |
| Tokenizer | Gemma‑4 SentencePiece (multimodal special tokens for `< |
| Chat template | Gemma‑4 conversational template with `< |
| Training framework | Unsloth 2026.5.7 |
| Fine-tuning dataset | angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k |
| License | GPL‑3.0 (this fine‑tune); base model under Gemma Terms of Use |
Architecture Details
- Text backbone (
gemma4_text):- 30 hidden layers, hidden size 2816, intermediate size 2112
- 16 attention heads, 8 KV heads, head dim 256, global head dim 512
- MoE blocks: 128 experts, top‑8 routing, MoE intermediate size 704
- Hybrid attention pattern: 5× sliding (window=1024) + 1× full attention, repeated
- Final‑logit softcap = 30.0, RMSNorm ε = 1e‑6
- RoPE: θ=1e6 (full‑attn, partial rotary 0.25), θ=1e4 (sliding)
- Tied input/output embeddings
- Vision tower (
gemma4_vision): 27 layers, hidden 1152, 16 heads, patch 16, 280 soft tokens per image, pooling kernel 3 - Audio tower (
Gemma4AudioFeatureExtractor): 16 kHz, 128 mel bins, 40 ms / token, up to 750 audio tokens - Video processor: 32 sampled frames, 70 soft tokens per frame max
Training
| Property | Details |
|---|---|
| Method | Supervised Fine‑Tuning (SFT) on reasoning traces |
| Device | Nvidia DGX Spark (x1) |
| Framework | Unsloth + 🤗 Transformers / TRL |
| Precision | bf16 |
| Dataset size | ~8,700 multi‑turn reasoning examples |
| Dataset source | Reasoning rollouts distilled from Claude Opus 4.6 / 4.7 |
| Reasoning format | Preserves Gemma‑4's native `< |
The training corpus emphasizes:
- Long, structured chain‑of‑thought reasoning
- Math, code, logic and step‑wise problem decomposition
- Self‑verification and answer revision patterns
- Instruction following with explicit thinking → answer separation
Reasoning data is distilled from Anthropic's Claude models. Outputs may reflect stylistic patterns of Claude (e.g. hedged tone, explicit step labels, "Let me think…" preambles). Use accordingly.
Intended Use
Primary use cases
- Reasoning‑heavy assistants (math, coding, agentic planning)
- Multimodal Q&A over images / audio / video
- Long‑context (up to 262K) summarization, retrieval, and document analysis
- Tool‑calling / function‑calling agents (native in chat template)
- Research on MoE + multimodal reasoning distillation
Out‑of‑scope / not recommended
- High‑stakes decisions (medical, legal, financial advice without human review)
- Generation of disallowed content under the Gemma Prohibited Use Policy
- Safety‑critical autonomous deployments without guardrails
How to Use
Install
pip install -U transformers accelerate
# Optional (recommended) for faster inference / fine-tuning:
pip install -U unsloth
Requires a recent
transformersbuild withgemma4model support.
Text generation (Transformers)
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a careful, step-by-step reasoner."},
{"role": "user", "content": "If a train leaves at 9:15 and travels for 2h 47m, when does it arrive?"},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=True, # turn on the <|channel>thought block
return_tensors="pt",
return_dict=True,
).to(model.device)
out = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=1.0,
top_p=0.95,
top_k=64,
)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))
Multimodal (image + text)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://example.com/diagram.png"},
{"type": "text", "text": "Explain what this diagram shows and reason about any inconsistencies."},
],
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
The processor also accepts {"type": "audio", ...} and {"type": "video", ...} content parts (see processor_config.json).
Tool calling
The chat template natively supports OpenAI‑style tools and Gemma‑native tool_calls / tool_responses. Pass tools to apply_chat_template(..., tools=[...]) and the template will emit <|tool>…<tool|> declarations and parse <|tool_call>…<tool_call|> blocks.
Faster inference with Unsloth
from unsloth import FastModel
model, processor = FastModel.from_pretrained(
"glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
max_seq_length = 8192,
load_in_4bit = False, # bf16 here; set True for 4-bit
dtype = None,
)
Recommended sampling
From generation_config.json:
| Param | Value |
|---|---|
temperature |
1.0 |
top_p |
0.95 |
top_k |
64 |
do_sample |
true |
eos_token_id |
[1, 106, 50] |
pad_token_id |
0 |
bos_token_id |
2 |
For deterministic reasoning, drop temperature to ~0.3–0.6 and disable sampling.
Chat Template & Reasoning Channel
This model uses Gemma‑4's structured chat template (chat_template.jinja) with:
- Role turns:
<|turn>system|user|model<turn|> - Thinking channel:
<|channel>thought ... <channel|>(gated byenable_thinking=True) - Tool declarations:
<|tool>…<tool|> - Tool calls / responses:
<|tool_call>…<tool_call|>/<|tool_response>…<tool_response|> - Multimodal placeholders:
<|image|>,<|audio|>,<|video|>
When add_generation_prompt=True is used without enable_thinking, the template emits an empty <|channel>thought<channel|> to suppress reasoning. Pass enable_thinking=True to enable the model's full chain‑of‑thought.
Files
| File | Purpose |
|---|---|
config.json |
Full model config (text + vision + audio sub‑configs) |
generation_config.json |
Default sampling parameters |
processor_config.json |
Image / audio / video processor settings |
tokenizer.json, tokenizer_config.json |
Gemma‑4 multimodal tokenizer |
chat_template.jinja |
Conversational + tool‑calling + reasoning template |
model-00001-of-00002.safetensors, model-00002-of-00002.safetensors |
bf16 weights (~51.6 GB total) |
model.safetensors.index.json |
Sharding index |
export_metadata.json |
Export provenance |
Limitations & Biases
- Hallucinations: Like all LLMs, this model can produce confident but incorrect answers, particularly outside its training distribution.
- Reasoning style transfer: Because the SFT data is distilled from Claude, stylistic and refusal patterns of Claude may leak into outputs.
- Dataset size: ~8.7k examples is small; expect targeted improvements on reasoning style rather than broad capability uplift over the base model.
- Multimodal grounding: Vision/audio/video capabilities are inherited from the base model and were not specifically targeted by this fine‑tune.
- Safety: No additional safety fine‑tuning was performed. The base Gemma‑4 safety guarantees apply, but downstream users should add their own guardrails.
License
- This fine-tune: GPL‑3.0
- Base model: subject to the Gemma Terms of Use and Gemma Prohibited Use Policy. You must comply with both when using or redistributing this model.
- Training data: see the dataset card for terms.
Citation
If you use this model, please cite the base model, the Unsloth project, and the dataset:
@misc{gemma4_2025,
title = {Gemma 4},
author = {Google DeepMind},
year = {2025},
url = {https://ai.google.dev/gemma}
}
@misc{unsloth,
title = {Unsloth: 2x faster LLM fine-tuning with 70% less memory},
author = {Daniel Han and Michael Han and {Unsloth team}},
year = {2024-2026},
url = {https://github.com/unslothai/unsloth}
}
@misc{claude_reasoning_8k7,
title = {claude-opus-4.6-4.7-reasoning-8.7k},
author = {angrygiraffe},
year = {2026},
url = {https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k}
}
Acknowledgements
- Google DeepMind — Gemma‑4 base model
- Unsloth team — Quant‑fixed checkpoint, training framework, and inference acceleration
- angrygiraffe — Reasoning distillation dataset
- Anthropic — Source model family (Claude Opus 4.6 / 4.7) for the distilled reasoning traces
- Downloads last month
- 168
Model tree for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled
Base model
google/gemma-4-26B-A4B