Instructions to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled")
model = AutoModelForImageTextToText.from_pretrained("glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled

SGLang

How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
    max_seq_length=2048,
)

Docker Model Runner
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled with Docker Model Runner:
```
docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled
```

Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune (Unsloth)

This is a fine-tune of google/gemma-4-26B-A4B-it (via the Unsloth-fixed checkpoint unsloth/gemma-4-26b-a4b-it) trained on angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — a ~8.7k-example reasoning trace dataset distilled from Claude Opus 4.6 / 4.7.

The goal of this fine-tune is to strengthen multi-step reasoning, planning, and self-reflection on top of Gemma‑4's native <|channel>thought reasoning channel, while preserving its multimodal (text + image + audio + video), tool-calling, and long-context capabilities.

Trained with Unsloth — 2× faster training, lower VRAM, identical accuracy.

Model Summary

Property	Details
Base model	`unsloth/gemma-4-26b-a4b-it` (`google/gemma-4-26B-A4B-it`)
Architecture	`Gemma4ForConditionalGeneration` (Mixture‑of‑Experts, multimodal)
Model type	`gemma4`
Total parameters	~26 B
Active parameters / token	~4 B (MoE: 128 experts, top‑8 routing)
Modalities	Text, Image, Audio, Video (inputs) → Text (output)
Max context length	262,144 tokens (262K)
Sliding window	1,024 (every 6th layer is full attention)
Vocab size	262,144
Tensor dtype	`bfloat16`
Tokenizer	Gemma‑4 SentencePiece (multimodal special tokens for `<
Chat template	Gemma‑4 conversational template with `<
Training framework	Unsloth `2026.5.7`
Fine-tuning dataset	`angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k`
License	GPL‑3.0 (this fine‑tune); base model under Gemma Terms of Use

Architecture Details

Text backbone (gemma4_text):
- 30 hidden layers, hidden size 2816, intermediate size 2112
- 16 attention heads, 8 KV heads, head dim 256, global head dim 512
- MoE blocks: 128 experts, top‑8 routing, MoE intermediate size 704
- Hybrid attention pattern: 5× sliding (window=1024) + 1× full attention, repeated
- Final‑logit softcap = 30.0, RMSNorm ε = 1e‑6
- RoPE: θ=1e6 (full‑attn, partial rotary 0.25), θ=1e4 (sliding)
- Tied input/output embeddings
Vision tower (gemma4_vision): 27 layers, hidden 1152, 16 heads, patch 16, 280 soft tokens per image, pooling kernel 3
Audio tower (Gemma4AudioFeatureExtractor): 16 kHz, 128 mel bins, 40 ms / token, up to 750 audio tokens
Video processor: 32 sampled frames, 70 soft tokens per frame max

Training

Property	Details
Method	Supervised Fine‑Tuning (SFT) on reasoning traces
Device	Nvidia DGX Spark (x1)
Framework	Unsloth + 🤗 Transformers / TRL
Precision	bf16
Dataset size	~8,700 multi‑turn reasoning examples
Dataset source	Reasoning rollouts distilled from Claude Opus 4.6 / 4.7
Reasoning format	Preserves Gemma‑4's native `<

The training corpus emphasizes:

Long, structured chain‑of‑thought reasoning
Math, code, logic and step‑wise problem decomposition
Self‑verification and answer revision patterns
Instruction following with explicit thinking → answer separation

Reasoning data is distilled from Anthropic's Claude models. Outputs may reflect stylistic patterns of Claude (e.g. hedged tone, explicit step labels, "Let me think…" preambles). Use accordingly.

Intended Use

Primary use cases

Reasoning‑heavy assistants (math, coding, agentic planning)
Multimodal Q&A over images / audio / video
Long‑context (up to 262K) summarization, retrieval, and document analysis
Tool‑calling / function‑calling agents (native in chat template)
Research on MoE + multimodal reasoning distillation

Out‑of‑scope / not recommended

High‑stakes decisions (medical, legal, financial advice without human review)
Generation of disallowed content under the Gemma Prohibited Use Policy
Safety‑critical autonomous deployments without guardrails

How to Use

Install

pip install -U transformers accelerate
# Optional (recommended) for faster inference / fine-tuning:
pip install -U unsloth

Requires a recent transformers build with gemma4 model support.

Text generation (Transformers)

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a careful, step-by-step reasoner."},
    {"role": "user", "content": "If a train leaves at 9:15 and travels for 2h 47m, when does it arrive?"},
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,         # turn on the <|channel>thought block
    return_tensors="pt",
    return_dict=True,
).to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))

Multimodal (image + text)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/diagram.png"},
            {"type": "text",  "text": "Explain what this diagram shows and reason about any inconsistencies."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

The processor also accepts {"type": "audio", ...} and {"type": "video", ...} content parts (see processor_config.json).

Tool calling

The chat template natively supports OpenAI‑style tools and Gemma‑native tool_calls / tool_responses. Pass tools to apply_chat_template(..., tools=[...]) and the template will emit <|tool>…<tool|> declarations and parse <|tool_call>…<tool_call|> blocks.

Faster inference with Unsloth

from unsloth import FastModel

model, processor = FastModel.from_pretrained(
    "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled",
    max_seq_length = 8192,
    load_in_4bit  = False,   # bf16 here; set True for 4-bit
    dtype = None,
)

Recommended sampling

From generation_config.json:

Param	Value
`temperature`	1.0
`top_p`	0.95
`top_k`	64
`do_sample`	true
`eos_token_id`	`[1, 106, 50]`
`pad_token_id`	0
`bos_token_id`	2

For deterministic reasoning, drop temperature to ~0.3–0.6 and disable sampling.

Chat Template & Reasoning Channel

This model uses Gemma‑4's structured chat template (chat_template.jinja) with:

Role turns: <|turn>system|user|model<turn|>
Thinking channel: <|channel>thought ... <channel|> (gated by enable_thinking=True)
Tool declarations: <|tool>…<tool|>
Tool calls / responses: <|tool_call>…<tool_call|> / <|tool_response>…<tool_response|>
Multimodal placeholders: <|image|>, <|audio|>, <|video|>

When add_generation_prompt=True is used without enable_thinking, the template emits an empty <|channel>thought<channel|> to suppress reasoning. Pass enable_thinking=True to enable the model's full chain‑of‑thought.

Files

File	Purpose
`config.json`	Full model config (text + vision + audio sub‑configs)
`generation_config.json`	Default sampling parameters
`processor_config.json`	Image / audio / video processor settings
`tokenizer.json`, `tokenizer_config.json`	Gemma‑4 multimodal tokenizer
`chat_template.jinja`	Conversational + tool‑calling + reasoning template
`model-00001-of-00002.safetensors`, `model-00002-of-00002.safetensors`	bf16 weights (~51.6 GB total)
`model.safetensors.index.json`	Sharding index
`export_metadata.json`	Export provenance

Limitations & Biases

Hallucinations: Like all LLMs, this model can produce confident but incorrect answers, particularly outside its training distribution.
Reasoning style transfer: Because the SFT data is distilled from Claude, stylistic and refusal patterns of Claude may leak into outputs.
Dataset size: ~8.7k examples is small; expect targeted improvements on reasoning style rather than broad capability uplift over the base model.
Multimodal grounding: Vision/audio/video capabilities are inherited from the base model and were not specifically targeted by this fine‑tune.
Safety: No additional safety fine‑tuning was performed. The base Gemma‑4 safety guarantees apply, but downstream users should add their own guardrails.

License

This fine-tune: GPL‑3.0
Base model: subject to the Gemma Terms of Use and Gemma Prohibited Use Policy. You must comply with both when using or redistributing this model.
Training data: see the dataset card for terms.

Citation

If you use this model, please cite the base model, the Unsloth project, and the dataset:

@misc{gemma4_2025,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://ai.google.dev/gemma}
}

@misc{unsloth,
  title  = {Unsloth: 2x faster LLM fine-tuning with 70% less memory},
  author = {Daniel Han and Michael Han and {Unsloth team}},
  year   = {2024-2026},
  url    = {https://github.com/unslothai/unsloth}
}

@misc{claude_reasoning_8k7,
  title  = {claude-opus-4.6-4.7-reasoning-8.7k},
  author = {angrygiraffe},
  year   = {2026},
  url    = {https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k}
}

Acknowledgements

Google DeepMind — Gemma‑4 base model
Unsloth team — Quant‑fixed checkpoint, training framework, and inference acceleration
angrygiraffe — Reasoning distillation dataset
Anthropic — Source model family (Claude Opus 4.6 / 4.7) for the distilled reasoning traces

Downloads last month: 168

Safetensors

Model size

27B params

Tensor type

BF16

Model tree for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

unsloth/gemma-4-26B-A4B-it

Finetuned

(12)

this model

glyphsoftware
/

gemma-4-26b-a4b-opus-4.7-distilled

Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune (Unsloth)

Model Summary

Architecture Details

Training

Intended Use

How to Use

Install

Text generation (Transformers)

Multimodal (image + text)

Tool calling

Faster inference with Unsloth

Recommended sampling

Chat Template & Reasoning Channel

Files

Limitations & Biases

License

Citation

Acknowledgements

Model tree for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled

Dataset used to train glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled