fix(chat_template): Emit multimodal placeholders in tool response content-parts

#38

What does this PR do?

→ When a tool message contains multimodal content parts (e.g. [{"type": "text", ...}, {"type": "image"}]), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. vLLM) to fail with:

Failed to apply prompt replacement for mm_items['image'][0]

→ Images in user messages work fine because the captured_content block properly handles all content types. The tool message branch was missing the same handling 🤗
→ Bug reported in: https://github.com/vllm-project/vllm/issues/41452
→ vLLM PR: https://github.com/vllm-project/vllm/pull/41459

Fixed by?

→ After rendering the tool response text block, emit <|image|>, <|audio|>, and <|video|> placeholders for any multimodal parts in the content array. This matches the pattern already used for regular message content in the captured_content block later in the template.

hey @harshaljanjani how are you doing? I'm sorry for the bad experience you are facing!

I tried reproducing the tokens duplication with tools + text + image but I couldn't.

could you please share:

  • the transformers package version you are using
  • a reproduction code (without exposing anything confidential or not public)
  • confirm you are using the latest chat template as in the official google/gemma-4-<size>-it?

thanks a lot!

Hi @lucianommartins !

Yeah sure, I'm leaving a detailed repro below here, along with the other details you asked for :)

The transformers package version you are using: transformers==5.8.0.dev0 (installed from source)
The vLLM version you are using (just in case): vllm==0.19.1rc1.dev72+g7b9de7c89
→ Confirmed that I'm using the latest chat_template.jinja of the official google/gemma-4-26B-A4B-it

Repro

The full analysis (with local screenshots) is also in the vLLM PR description: vllm-project/vllm#41459

Step 1: Start the vLLM server

Without --chat-template, vLLM uses the upstream (buggy) template and the tool-image request returns 500. With the fixed template (from the PR branch), both requests succeed:

# Run from the vllm repo root on the PR branch
vllm serve google/gemma-4-E2B-it \
    --port 8199 \
    --chat-template examples/tool_chat_template_gemma4.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --max-model-len 1024 \
    --gpu-memory-utilization 0.95    # optional, may be needed on smaller GPUs (e.g. L4 24GB)

Step 2: Reproduction script

import requests
import sys

BASE_URL = "http://localhost:8199"

TINY_PNG_B64 = (
    "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAA"
    "jCB0C8AAAAASUVORK5CYII="
)
IMAGE_DATA_URL = f"data:image/png;base64,{TINY_PNG_B64}"

def get_model():
    resp = requests.get(f"{BASE_URL}/v1/models")
    return resp.json()["data"][0]["id"]

def send_chat(model, messages):
    resp = requests.post(f"{BASE_URL}/v1/chat/completions", json={
        "model": model,
        "messages": messages,
        "max_tokens": 32,
    })
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

def main():
    try:
        model = get_model()
    except Exception as e:
        print(f"Could not connect to server: {e}")
        sys.exit(1)
    print(f"Using model: {model}")
    print("\nSending image in a user message...")
    try:
        reply = send_chat(model, [
            {"role": "user", "content": [
                {"type": "text", "text": "Describe this image briefly."},
                {"type": "image_url", "image_url": {"url": IMAGE_DATA_URL}},
            ]},
        ])
        print(f"Response: {reply[:120]}")
    except Exception as e:
        print(f"Failed: {e}")
    print("\nSending image in a tool message...")
    try:
        reply = send_chat(model, [
            {"role": "user", "content": "Download and describe the image."},
            {"role": "assistant", "content": "", "tool_calls": [
                {"id": "call_1", "type": "function",
                 "function": {"name": "download_image",
                              "arguments": '{"url": "https://example.com/x.png"}'}},
            ]},
            {"role": "tool", "tool_call_id": "call_1", "content": [
                {"type": "text", "text": "Image downloaded successfully."},
                {"type": "image_url", "image_url": {"url": IMAGE_DATA_URL}},
            ]},
        ])
        print(f"Response: {reply[:120]}")
    except Exception as e:
        print(f"Failed: {e}")

if __name__ == "__main__":
    main()

Expected behavior

Before the fix, images in user messages worked correctly, but those in tool messages triggered a 500 Internal Server Error. Following the update, both message types now return a 200 OK status!

Template-level verification (no server needed)

If you wish to minimally confirm the root cause, no GPU or vLLM server is required:

import jinja2

# Load the upstream template (from any Gemma4-it model repo)
with open("chat_template.jinja") as f:
    template = jinja2.Environment().from_string(f.read())

messages = [
    {"role": "user", "content": "Download and describe the image."},
    {"role": "assistant", "content": "", "tool_calls": [
        {"id": "call_1", "type": "function",
         "function": {"name": "download_image",
                      "arguments": '{"url": "https://example.com/x.png"}'}},
    ]},
    {"role": "tool", "tool_call_id": "call_1", "content": [
        {"type": "text", "text": "Image downloaded successfully."},
        {"type": "image"},
    ]},
]

result = template.render(messages=messages, bos_token="<bos>", add_generation_prompt=True)
# False on upstream, True after fix
print("<|image|> present:", "<|image|>" in result)

Could you also please look into these Hub PRs? All four Gemma4-it presets ship the same buggy template.
gemma-4-31B-it Hub PR
gemma-4-26B-A4B-it Hub PR
gemma-4-E4B-it Hub PR
gemma-4-E2B-it Hub PR

Best,
Harshal

I can confirm a tool call response that passes in an image gives a 500 as described without this chat template fix.

Here are the dumped input and output tokens, as text, for the image tool response after the proposed chat template fix here:

!!! input:
<bos><|turn>user
Download and describe the image.<turn|>
<|turn>model
<|tool_call>call:download_image{url:<|"|>https://example.com/x.png<|"|>}<tool_call|><|tool_response>response:download_image{value:<|"|>Image downloaded successfully.<|"|>}<tool_response|>

<|image><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><|image|><image|>

<turn|>
<|channel>thought
<channel|>
!!! output:
The image is a solid black square with no discernible features, patterns, or objects. It is entirely uniform in color and lacks any light or detail.<turn|>

However, I was under the impression that the OpenAI Chat completions API only supports text tool responses. OpenAI's own docs explicitly say this, for example. And see for example https://github.com/openai/openai-python/blob/38d75d74a5626472cd7d1be9705ea8aba29a6b22/src/openai/types/chat/chat_completion_tool_message_param.py#L13-L15 where it only allows text content parts.

The Responses API does support multimodal tool responses. I haven't dug into exactly how we translate Responses API requests to chat template kwargs for models like Gemma 4, but the fact that Responses API allows this means it's probably reasonable to have the chat template handle this since we don't have a separate chat template just to handle Responses API specifics.

@harshaljanjani do you have a real-world inference client or app that uses multimodal tool responses and works with other models? ie this isn't just a theoretical thing that needs to be supported?

Is it weird that the tool response image is outside of the <|tool_response>...<tool_response|> tags? Intuitively, I'd expect the image to be inside those tags since the image is part of the tool response. And that would mean the chat template changes here are not quite right if we wanted the image inside that tool_response block.

But, I guess that depends on how the model was trained to handle multimodal tool responses (if it was).

Hi @bmbrowning , thanks for confirming the bug :)
To answer your questions:
Real-world use case? Yeah, the original issue (vllm-project/vllm#41452) was filed by a user likely hitting this in practice. In any case, agentic workflows where a tool fetches an image and returns alongside text are a common natural use case.
Image placement outside <|tool_response> tags: Actually this is consistent with how the template handles images in non-tool messages. <|image|> is always a standalone token. It has to be: format_tool_response_block serializes the response as structured text (response:name{value:...}), and <|image|> is a special token the vision encoder locates and replaces with embeddings. Putting it inside that serialization would break the format :)
Chat Completions spec: Fair point that the spec says text-only for tool responses. But in our case, vLLM already implements the Responses API whose function_call_output accepts an array of image/file objects, not just strings, thus indicative of multimodal support (see also this community thread with a working example). And at the pipeline level, vLLM already parses and preserves multimodal content in tool messages: so having the template match what the rest of the pipeline supports felt like the right call

image

image

@harshaljanjani thanks for all the digging on that.

@bmbrowning - I understand that we won't have issues with vLLM... but do you foresee issues with ie. Transformers?

@osanseviero - I think we need to check that with other partners using this chat_template and potentially the response api spec? vLLM overcomes the limitation with custom code, maybe the others don't.

@harshaljanjani @bmbrowning - do you see a way we could address a solution where we fix the tool calls limitations with mm but still being compliance with the response api spec?

@lucianommartins No worries :)
Regarding the compliance point, I think the fix should already be compliant with both specs given it's purely additive and doesn't change the text-only path at all. If tool content is a string (Chat Completion) the existing code should handle it without any changes. The fix only applies in the situation where content is a list with non-text parts (Responses API). It's BC with Chat Completions and FC with the Responses API afaik, but I'm happy to know if there's another consideration you were looking for (cc: @bmbrowning ).
Also as mentioned in the PR desc, I had previously run the tests on the Transformers end too and there are no breakages. The Transformers Gemma 4 tests there currently use a hardcoded template and not the actual chat template, so for testing I temporarily swapped it out with the updated/fixed template (feel free to repro) and all the tests pass as shown in the screenshot.

Transformers env:

  • transformers version: 5.8.0.dev0
  • Platform: Linux-6.8.0-1053-gcp-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • huggingface_hub version: 1.14.0
  • safetensors version: 0.7.0
  • accelerate version: 1.13.0
  • PyTorch version (accelerator?): 2.11.0+cu130 (CUDA)
  • Using distributed or parallel setup in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA L4

image

osanseviero changed pull request status to merged
Google org

Thank you for the fixes, merged!

Sign up or log in to comment