Papers
arxiv:2511.22663

Architecture Decoupling Is Not All You Need For Unified Multimodal Model

Published on Nov 27
· Submitted by Kaituo Feng on Dec 1
Authors:
,
,
,
,
,
,
,
,
,

Abstract

The proposed Attention Interaction Alignment (AIA) loss improves cross-modal attention and performance in unified multimodal models for image generation and understanding without decoupling.

AI-generated summary

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

Community

Paper author Paper submitter
edited 5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Very interesting idea! A few questions for my understanding:

  1. In Eq. (1) Attention map Attn_{l}\left(h, q, k\right) II was assuming q is an index to the query tokens, and k is the index to Key Tokens - as they can have different number of tokens coming from different modalities. Not sure what is n ? And why is denominator not normalizing for K as well ? Is this a typo or done purposefully for some reason?
  2. I really liked your attempt at explaining why decoupled multi-modal models are able to compete with task specific (T2I only or I2T only ) models - and the hunch is they are treating these tasks as separate and and are not generalizing across tasks. However, the considered approach of using "cross modal Interaction Intensity" is not at all clear to me. Why should this metric computed in this way be indicative of whether or not these models are truly treating these tasks separately ?
  3. Minor comment: Color scheme in Figure 5 is flipped?
·
Paper author

Hi, thank you for your attention and for pointing out our typos.

First, you are correct that $N$ is unnecessary; this was a writing error on our part. As for $K$, the formulation is correct. Taking image generation as an example, we aim to calculate the attention of each generated image token towards all text tokens. Therefore, we only need to sum them up (since the attention scores are processed by softmax and range from 0 to 1).

Second, we believe that the conflict between understanding and generation tasks is reflected in the cross-modal interactions within the network. For instance, generation involves creating images based on text, while understanding involves generating text based on images. Thus, observing these cross-modal interaction patterns allows us to gain insights into their underlying mechanisms.

Finally, the color description in the caption of Figure 5 is indeed flipped. Thank you for pointing this out.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.22663 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.22663 in a Space README.md to link it from this page.

Collections including this paper 5