Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-26B-A4B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-26B-A4B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-4-26B-A4B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-26B-A4B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-4-26B-A4B-it
- SGLang
How to use google/gemma-4-26B-A4B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
docker model run hf.co/google/gemma-4-26B-A4B-it
fix(chat_template): Emit multimodal placeholders in tool response content-parts
What does this PR do?
→ When a tool message contains multimodal content parts (e.g. [{"type": "text", ...}, {"type": "image"}]), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. vLLM) to fail with:
Failed to apply prompt replacement for mm_items['image'][0]
→ Images in user messages work fine because the captured_content block properly handles all content types. The tool message branch was missing the same handling 🤗
→ Bug reported in: https://github.com/vllm-project/vllm/issues/41452
→ vLLM PR: https://github.com/vllm-project/vllm/pull/41459
Fixed by?
→ After rendering the tool response text block, emit <|image|>, <|audio|>, and <|video|> placeholders for any multimodal parts in the content array. This matches the pattern already used for regular message content in the captured_content block later in the template.
hey @harshaljanjani how are you doing? I'm sorry for the bad experience you are facing!
I tried reproducing the tokens duplication with tools + text + image but I couldn't.
could you please share:
- the transformers package version you are using
- a reproduction code (without exposing anything confidential or not public)
- confirm you are using the latest chat template as in the official
google/gemma-4-<size>-it?
thanks a lot!
Hi @lucianommartins !
Yeah sure, I'm leaving a detailed repro below here, along with the other details you asked for :)
→ The transformers package version you are using: transformers==5.8.0.dev0 (installed from source)
→ The vLLM version you are using (just in case): vllm==0.19.1rc1.dev72+g7b9de7c89
→ Confirmed that I'm using the latest chat_template.jinja of the official google/gemma-4-26B-A4B-it
Repro
The full analysis (with local screenshots) is also in the vLLM PR description: vllm-project/vllm#41459
Step 1: Start the vLLM server
Without --chat-template, vLLM uses the upstream (buggy) template and the tool-image request returns 500. With the fixed template (from the PR branch), both requests succeed:
# Run from the vllm repo root on the PR branch
vllm serve google/gemma-4-E2B-it \
--port 8199 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--max-model-len 1024 \
--gpu-memory-utilization 0.95 # optional, may be needed on smaller GPUs (e.g. L4 24GB)
Step 2: Reproduction script
import requests
import sys
BASE_URL = "http://localhost:8199"
TINY_PNG_B64 = (
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAA"
"jCB0C8AAAAASUVORK5CYII="
)
IMAGE_DATA_URL = f"data:image/png;base64,{TINY_PNG_B64}"
def get_model():
resp = requests.get(f"{BASE_URL}/v1/models")
return resp.json()["data"][0]["id"]
def send_chat(model, messages):
resp = requests.post(f"{BASE_URL}/v1/chat/completions", json={
"model": model,
"messages": messages,
"max_tokens": 32,
})
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
def main():
try:
model = get_model()
except Exception as e:
print(f"Could not connect to server: {e}")
sys.exit(1)
print(f"Using model: {model}")
print("\nSending image in a user message...")
try:
reply = send_chat(model, [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image briefly."},
{"type": "image_url", "image_url": {"url": IMAGE_DATA_URL}},
]},
])
print(f"Response: {reply[:120]}")
except Exception as e:
print(f"Failed: {e}")
print("\nSending image in a tool message...")
try:
reply = send_chat(model, [
{"role": "user", "content": "Download and describe the image."},
{"role": "assistant", "content": "", "tool_calls": [
{"id": "call_1", "type": "function",
"function": {"name": "download_image",
"arguments": '{"url": "https://example.com/x.png"}'}},
]},
{"role": "tool", "tool_call_id": "call_1", "content": [
{"type": "text", "text": "Image downloaded successfully."},
{"type": "image_url", "image_url": {"url": IMAGE_DATA_URL}},
]},
])
print(f"Response: {reply[:120]}")
except Exception as e:
print(f"Failed: {e}")
if __name__ == "__main__":
main()
Expected behavior
Before the fix, images in user messages worked correctly, but those in tool messages triggered a 500 Internal Server Error. Following the update, both message types now return a 200 OK status!
Template-level verification (no server needed)
If you wish to minimally confirm the root cause, no GPU or vLLM server is required:
import jinja2
# Load the upstream template (from any Gemma4-it model repo)
with open("chat_template.jinja") as f:
template = jinja2.Environment().from_string(f.read())
messages = [
{"role": "user", "content": "Download and describe the image."},
{"role": "assistant", "content": "", "tool_calls": [
{"id": "call_1", "type": "function",
"function": {"name": "download_image",
"arguments": '{"url": "https://example.com/x.png"}'}},
]},
{"role": "tool", "tool_call_id": "call_1", "content": [
{"type": "text", "text": "Image downloaded successfully."},
{"type": "image"},
]},
]
result = template.render(messages=messages, bos_token="<bos>", add_generation_prompt=True)
# False on upstream, True after fix
print("<|image|> present:", "<|image|>" in result)
Could you also please look into these Hub PRs? All four Gemma4-it presets ship the same buggy template.
→ gemma-4-31B-it Hub PR
→ gemma-4-26B-A4B-it Hub PR
→ gemma-4-E4B-it Hub PR
→ gemma-4-E2B-it Hub PR
Best,
Harshal
I can confirm a tool call response that passes in an image gives a 500 as described without this chat template fix.
Here are the dumped input and output tokens, as text, for the image tool response after the proposed chat template fix here:
!!! input:
<bos><|turn>user
Download and describe the image.<turn|>
<|turn>model
<|tool_call>call:download_image{url:<|"|>https://example.com/x.png<|"|>}<tool_call|><|tool_response>response:download_image{value:<|"|>Image downloaded successfully.<|"|>}<tool_response|>
<|image><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><image|>
<turn|>
<|channel>thought
<channel|>
!!! output:
The image is a solid black square with no discernible features, patterns, or objects. It is entirely uniform in color and lacks any light or detail.<turn|>
However, I was under the impression that the OpenAI Chat completions API only supports text tool responses. OpenAI's own docs explicitly say this, for example. And see for example https://github.com/openai/openai-python/blob/38d75d74a5626472cd7d1be9705ea8aba29a6b22/src/openai/types/chat/chat_completion_tool_message_param.py#L13-L15 where it only allows text content parts.
https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create#(resource)%20chat.completions%20%3E%20(model)%20chat_completion_tool_message_param%20%3E%20(schema)%20%3E%20(property)%20content%20%3E%20(variant)%201 - An array of content parts with a defined type. For tool messages, only type text is supported.
The Responses API does support multimodal tool responses. I haven't dug into exactly how we translate Responses API requests to chat template kwargs for models like Gemma 4, but the fact that Responses API allows this means it's probably reasonable to have the chat template handle this since we don't have a separate chat template just to handle Responses API specifics.
@harshaljanjani do you have a real-world inference client or app that uses multimodal tool responses and works with other models? ie this isn't just a theoretical thing that needs to be supported?
Is it weird that the tool response image is outside of the <|tool_response>...<tool_response|> tags? Intuitively, I'd expect the image to be inside those tags since the image is part of the tool response. And that would mean the chat template changes here are not quite right if we wanted the image inside that tool_response block.
But, I guess that depends on how the model was trained to handle multimodal tool responses (if it was).
Hi @bmbrowning , thanks for confirming the bug :)
To answer your questions:
→ Real-world use case? Yeah, the original issue (vllm-project/vllm#41452) was filed by a user likely hitting this in practice. In any case, agentic workflows where a tool fetches an image and returns alongside text are a common natural use case.
→ Image placement outside <|tool_response> tags: Actually this is consistent with how the template handles images in non-tool messages. <|image|> is always a standalone token. It has to be: format_tool_response_block serializes the response as structured text (response:name{value:...}), and <|image|> is a special token the vision encoder locates and replaces with embeddings. Putting it inside that serialization would break the format :)
→ Chat Completions spec: Fair point that the spec says text-only for tool responses. But in our case, vLLM already implements the Responses API whose function_call_output accepts an array of image/file objects, not just strings, thus indicative of multimodal support (see also this community thread with a working example). And at the pipeline level, vLLM already parses and preserves multimodal content in tool messages: so having the template match what the rest of the pipeline supports felt like the right call
@harshaljanjani thanks for all the digging on that.
@bmbrowning - I understand that we won't have issues with vLLM... but do you foresee issues with ie. Transformers?
@osanseviero - I think we need to check that with other partners using this chat_template and potentially the response api spec? vLLM overcomes the limitation with custom code, maybe the others don't.
@harshaljanjani @bmbrowning - do you see a way we could address a solution where we fix the tool calls limitations with mm but still being compliance with the response api spec?
@lucianommartins No worries :)
Regarding the compliance point, I think the fix should already be compliant with both specs given it's purely additive and doesn't change the text-only path at all. If tool content is a string (Chat Completion) the existing code should handle it without any changes. The fix only applies in the situation where content is a list with non-text parts (Responses API). It's BC with Chat Completions and FC with the Responses API afaik, but I'm happy to know if there's another consideration you were looking for (cc: @bmbrowning ).
Also as mentioned in the PR desc, I had previously run the tests on the Transformers end too and there are no breakages. The Transformers Gemma 4 tests there currently use a hardcoded template and not the actual chat template, so for testing I temporarily swapped it out with the updated/fixed template (feel free to repro) and all the tests pass as shown in the screenshot.
Transformers env:
transformersversion: 5.8.0.dev0- Platform: Linux-6.8.0-1053-gcp-x86_64-with-glibc2.35
- Python version: 3.12.13
huggingface_hubversion: 1.14.0safetensorsversion: 0.7.0accelerateversion: 1.13.0- PyTorch version (accelerator?): 2.11.0+cu130 (CUDA)
- Using distributed or parallel setup in script?: No
- Using GPU in script?: Yes
- GPU type: NVIDIA L4
Thank you for the fixes, merged!


