--- license: cc-by-nc-4.0 task_categories: - image-to-image tags: - image-editing --- # [ICLR 2026] Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing [Ziyun Zeng](https://stdkonjac.icu/), [David Junhao Zhang](https://junhaozhang98.github.io/), Wei Li, and [Mike Zheng Shou](https://cde.nus.edu.sg/ece/staff/shou-zheng-mike/) [![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg?logo=arxiv)](https://arxiv.org/abs/2509.01986) [![Project Page](https://img.shields.io/badge/Website-Project%20Page-green?logo=googlechrome&logoColor=white)](https://showlab.github.io/DIM/) [![Code](https://img.shields.io/badge/Code-GitHub%20Repo-blue?logo=github)](https://github.com/showlab/DIM) [![Hugging Face Datasets](https://img.shields.io/badge/๐Ÿค—%20Dataset-DIM--Edit-orange.svg)](https://huggingface.co/datasets/stdKonjac/DIM-Edit) [![Hugging Face Datasets](https://img.shields.io/badge/๐Ÿค—%20Dataset-DIM--T2I-orange.svg)](https://huggingface.co/datasets/stdKonjac/DIM-T2I) [![Hugging Face Models](https://img.shields.io/badge/๐Ÿค—%20Model-DIM--4.6B--Edit-orange.svg)](https://huggingface.co/stdKonjac/DIM-4.6B-Edit) [![Hugging Face Models](https://img.shields.io/badge/๐Ÿค—%20Model-DIM--4.6B--T2I-orange.svg)](https://huggingface.co/stdKonjac/DIM-4.6B-T2I) ![DIM-Edit](assets/dim_edit.png) ## ๐Ÿ“ฐ News - **`[2026-05-12]`** The **DIM** [project page](https://showlab.github.io/DIM/) is available. - **`[2026-01-26]`** ๐ŸŽ‰ **DIM** is accepted to **ICLR 2026**! - **`[2025-10-08]`** ๐Ÿš€ Released the **DIM-Edit** dataset and the **DIM-4.6B-T2I** / **DIM-4.6B-Edit** models. - **`[2025-09-02]`** ๐Ÿ“ The **DIM** paper is released on arXiv. ## ๐ŸŒŸ Highlights - ๐Ÿง  **Rebalanced architecture**: Let the understanding module be the *designer*, while the generation module focuses on *painting*. - ๐Ÿ“š **Two complementary datasets**: **DIM-T2I** (long-context T2I pairs) and **DIM-Edit** (CoT imaginations from GPT-4o). - โšก **Lightweight & efficient**: A โ„๏ธfrozen 3.0B VLM and a ๐Ÿ”ฅtrainable 1.6B DiT connected via a single MLP (4.6B params in total). - ๐Ÿ† **SOTA-competitive**: DIM-4.6B-Edit matches or surpasses much larger models on **ImgEdit** and **GEdit-Bench**. ## ๐Ÿ’ก Introduction Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation arises from an *imbalanced division of responsibilities*. The understanding module is usually treated as a translator that encodes instructions into conditions, while the generation module must act as both **designer** and **painter**. The result is that the generation module carries too much responsibility, even though it is not optimized for complex reasoning. To address this, we introduce **Draw-In-Mind (DIM)**, a dataset with two complementary parts: - ๐Ÿ–ผ๏ธ **DIM-T2I**: Millions of long-context imageโ€“text pairs that strengthen instruction comprehension. - โœ๏ธ **DIM-Edit**: 233K chain-of-thought imaginations from GPT-4o that provide explicit design blueprints. We connect a frozen **Qwen2.5-VL-3B** with a trainable **SANA1.5-1.6B** via a lightweight MLP, forming **DIM-4.6B-T2I/Edit**. With this setup, the understanding module takes on the *designer responsibility*, while the generation module focuses on rendering. Despite its modest size, DIM-4.6B-Edit achieves SOTA or competitive results on **ImgEdit** and **GEdit-Bench**, outperforming much larger models. ## ๐Ÿ“Š Performance
๐Ÿ“ˆ GenEval & MJHQ-30K > โ€  denotes using an LLM rewriter. For MJHQ(-30K), we report FID. | Model | Params | Sin. | Two | CT. | Colors | Pos. | Attr. | Overall | MJHQ | |----------------------------------------------------------------|:----------------:|:----:|:----:|:----:|:------:|:----:|:-----:|:-------:|:--------:| | Gen. Only | | PixArt-ฮฑ | 0.6B๐Ÿ”ฅ | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.48 | 6.14 | | SDXL | 2.6B๐Ÿ”ฅ | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | 8.76 | | DALL-Eยท3 | - | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 | - | | SD3-Medium | 2.0B๐Ÿ”ฅ | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 | 0.74 | 11.92 | | Unified | | Janus | 1.3B๐Ÿ”ฅ | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 | 10.10 | | Emu3-Genโ€  | 8.0B๐Ÿ”ฅ | 0.99 | 0.81 | 0.42 | 0.80 | 0.49 | 0.45 | 0.66 | - | | Show-o | 1.3B๐Ÿ”ฅ | 0.98 | 0.80 | 0.66 | 0.84 | 0.31 | 0.50 | 0.68 | 15.18 | | Show-o2-7B | 7.0B๐Ÿ”ฅ | 1.00 | 0.87 | 0.58 | 0.92 | 0.52 | 0.62 | 0.76 | - | | Janus-Pro-7B | 7.0B๐Ÿ”ฅ | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 | 13.48 | | BAGEL | 14.0B๐Ÿ”ฅ | 0.99 | 0.94 | 0.81 | 0.88 | 0.64 | 0.63 | 0.82 | - | | MetaQuery-Lโ€  | 3.0Bโ„๏ธ \| 3.2B๐Ÿ”ฅ | - | - | - | - | - | - | 0.78 | 6.35 | | **DIM-4.6B-T2Iโ€ ** | 3.0Bโ„๏ธ \| 1.6B๐Ÿ”ฅ | 0.99 | 0.89 | 0.63 | 0.86 | 0.62 | 0.61 | 0.77 | **5.50** |
๐Ÿ–Œ๏ธ ImgEdit Overall > Q3/7B indicates using Qwen2.5-VL-3/7B as the external designer during inference. By default, GPT-4o is employed > as the external designer to ensure the best performance. All models are evaluated using GPT-4.1. | Model | Add | Adj. | Ext. | Rep. | Rem. | Back. | Sty. | Hyb. | Act. | Overall | |-------------------|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|:--------:| | MagicBrush | 2.84 | 1.58 | 1.51 | 1.97 | 1.58 | 1.75 | 2.38 | 1.62 | 1.22 | 1.83 | | Instruct-P2P | 2.45 | 1.83 | 1.44 | 2.01 | 1.50 | 1.44 | 3.55 | 1.20 | 1.46 | 1.88 | | AnyEdit | 3.18 | 2.95 | 1.88 | 2.47 | 2.23 | 2.24 | 2.85 | 1.56 | 2.65 | 2.45 | | UltraEdit | 3.44 | 2.81 | 2.13 | 2.96 | 1.45 | 2.83 | 3.76 | 1.91 | 2.98 | 2.70 | | Step1X-Edit | 3.88 | 3.14 | 1.76 | 3.40 | 2.41 | 3.16 | 4.63 | 2.64 | 2.52 | 3.06 | | BAGEL | 3.56 | 3.31 | 1.70 | 3.30 | 2.62 | 3.24 | 4.49 | 2.38 | 4.17 | 3.20 | | UniWorld-V1 | 3.82 | 3.64 | 2.27 | 3.47 | 3.24 | 2.99 | 4.21 | 2.96 | 2.74 | 3.26 | | Janus-4o | 3.35 | 3.35 | 2.25 | 3.01 | 2.18 | 3.32 | 4.71 | 2.49 | 4.04 | 3.19 | | GPT-4o-Image | 4.61 | 4.33 | 2.90 | 4.35 | 3.66 | 4.57 | 4.93 | 3.96 | 4.89 | 4.20 | | **DIM-4.6B-Edit** | 4.09 | 3.47 | 2.30 | 4.00 | 3.43 | 3.87 | 4.92 | 2.85 | 4.08 | **3.67** |
๐Ÿ”ฌ ImgEdit Designer Ablation > โ€  The default setting. | Designer | Add | Adj. | Ext. | Rep. | Rem. | Back. | Sty. | Hyb. | Act. | Overall | |:-------------------|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|:--------:| | โ€“ | 3.53 | 3.23 | 2.01 | 3.49 | 1.47 | 3.42 | 4.79 | 2.35 | 3.64 | 3.10 | | Qwen2.5-VL-3B | 3.80 | 3.24 | 2.03 | 3.89 | 3.21 | 3.52 | 4.92 | 2.71 | 4.05 | 3.49 | | Qwen2.5-VL-7B | 3.95 | 3.35 | 2.25 | 3.85 | 3.31 | 3.57 | 4.88 | 2.81 | 4.02 | 3.55 | | MiMo-VL-7B | 3.95 | 3.32 | 2.20 | 3.75 | 2.46 | 3.82 | 4.88 | 2.52 | 3.93 | 3.43 | | InternVL3.5-8B | 3.98 | 3.40 | 2.05 | 4.14 | 3.30 | 3.84 | 4.94 | 2.77 | 3.89 | 3.59 | | GLM-4.1V-9B | 3.95 | 3.27 | 2.23 | 3.90 | 2.64 | 3.81 | 4.92 | 2.23 | 4.02 | 3.44 | | GPT-4oโ€  | 4.09 | 3.47 | 2.30 | 4.00 | 3.43 | 3.87 | 4.92 | 2.85 | 4.08 | **3.67** |
๐Ÿ–ผ๏ธ Qualitative Visualization > ๐ŸŸข **Green** and ๐Ÿ”ต **Blue** denote the edits of *Janus-4o* and *Step1X-Edit* respectively; > ๐Ÿ”ด **Red** denotes the edits of our models trained on different data corpora. ![Overall](assets/vis_overall.png) ![Add](assets/vis_add.png) ![Change](assets/vis_change.png) ![Remove](assets/vis_remove.png) ![Replace](assets/vis_replace.png) ![Transfer](assets/vis_transfer.png)
## ๐Ÿ“ฆ Dataset ### DIM-Edit **Step 1.** Download [**DIM-Edit**](https://huggingface.co/datasets/stdKonjac/DIM-Edit) from our ๐Ÿค— HF repo using the `hf` CLI: ```bash # 1. Install the huggingface_hub library (>= 0.32.0 for hf_xet support) pip install -U huggingface_hub # 2. Log in with your Hugging Face account token hf auth login # 3. Download the dataset hf download stdKonjac/DIM-Edit --repo-type dataset --local-dir ./DIM-Edit ``` **Step 2.** Merge and extract the split archives: ```bash cd DIM-Edit cat images.tar.gz.part* > images.tar.gz tar -xvzf images.tar.gz ``` **Step 3.** Each line of `tos_dataset_edit.jsonl` corresponds to a single sample with four fields: | Field | Description | |:--------------------|:----------------------------------------------------------------------------------| | `id` | Unique identifier for each sample. | | `image_path` | Path to the **source** image, beginning with `image/`. | | `image_path_target` | Path to the **target** image, beginning with `image/`. | | `prompt` | The CoT-style instruction describing how to transform the source into the target. | **Step 4.** Load the dataset with the ๐Ÿค— `datasets` library: ```python from datasets import load_dataset, Features, Value features = Features({ "id": Value("string"), "image_path": Value("string"), "image_path_target": Value("string"), "prompt": Value("string"), }) ds = load_dataset( "json", data_files="DIM-Edit/tos_dataset_edit.jsonl", features=features, split="train", ) print(ds[0]) ``` #### ๐Ÿ“œ DIM-Edit License The **DIM-Edit** dataset is released under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license. ### DIM-T2I Please refer to [`T2I_DATASET.md`](https://github.com/showlab/DIM/blob/main/data/T2I_DATASET.md) for download instructions and licensing details. ## ๐Ÿš€ Model ### โš™๏ธ Environment Setup ```bash pip install -r requirements.txt ``` ### ๐Ÿฆ™ Model Zoo Create a `checkpoints` folder in the root directory, then download the models from our ๐Ÿค— HF repo and move them into `checkpoints/`. ```bash mkdir checkpoints ``` > ๐Ÿ’ก To facilitate reproducibility, we release [**DIM-4.6B-Edit-Stage1**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit-Stage1), > which is trained solely on the **UltraEdit** dataset. Fine-tuning this checkpoint on our proposed > [**DIM-Edit**](https://huggingface.co/datasets/stdKonjac/DIM-Edit) dataset should reproduce > [**DIM-4.6B-Edit**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit). | Model | Task | Training Data | ImgEdit | Parameters | |:----------------------------------------------------------------------------------|:-------------:|:--------------------------:|:-------:|:---------------:| | [**DIM-4.6B-T2I**](https://huggingface.co/stdKonjac/DIM-4.6B-T2I) | Text-to-Image | DIM-T2I + 6.9M Public Data | โ€“ | 3.0Bโ„๏ธ + 1.6B๐Ÿ”ฅ | | [**DIM-4.6B-Edit-Stage1**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit-Stage1) | Image Editing | UltraEdit | 2.76 | 3.0Bโ„๏ธ + 1.6B๐Ÿ”ฅ | | [**DIM-4.6B-Edit**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit) | Image Editing | UltraEdit โ†’ DIM-Edit | 3.67 | 3.0Bโ„๏ธ + 1.6B๐Ÿ”ฅ | Organize the checkpoints as follows: ``` DIM/ โ””โ”€โ”€ checkpoints/ โ”œโ”€โ”€ DIM-4.6B-T2I/ โ”‚ โ”œโ”€โ”€ model.safetensors โ”‚ โ””โ”€โ”€ ... โ”œโ”€โ”€ DIM-4.6B-Edit-Stage1/ โ”‚ โ”œโ”€โ”€ model.safetensors โ”‚ โ””โ”€โ”€ ... โ””โ”€โ”€ DIM-4.6B-Edit/ โ”œโ”€โ”€ model.safetensors โ””โ”€โ”€ ... ``` ### ๐Ÿ”ฎ Inference
๐ŸŽจ T2I Generation Demo T2I instructions are provided in `cache/demo/tos_dataset_demo.jsonl`. Each line is a JSON instruction, e.g.: ```json { "id": "0000", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "A yummy cupcake floating in the air dark background" } ``` > The `image_path` is a placeholder โ€” modify `prompt` to generate your own image. Run: ```bash bash scripts/demo_t2i.sh ``` Generated images will be saved to `cache/inference/demo/DIM-4.6B-T2I/{id}_gen.jpg`.
โœ‚๏ธ Image Editing Demo edit instructions are provided in `cache/demo/tos_dataset_edit_demo.jsonl`. Each line looks like: ```json { "id": "0", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "Remove the lemons on the table.", "image_path_target": "./cache/demo/edit_demo_0000.png" } ``` `image_path` is the source image and `prompt` is the edit instruction; `image_path_target` is a placeholder. In `infer/demo_edit.py`, use the `set_designer_gpt` API with your own key to set GPT-4o as the external designer for optimal performance: ```python # GPT-4o as external designer model.set_designer_gpt(api_key=os.environ['OPENAI_API_KEY']) ``` Alternatively, use `set_designer_X` APIs for open-source VLMs (auto-downloaded to local disk): ```python # Qwen2.5-VL as external designer model.set_designer_qwen(version='Qwen/Qwen2.5-VL-3B-Instruct') model.set_designer_qwen(version='Qwen/Qwen2.5-VL-7B-Instruct') # InternVL3.5 as external designer (recommend using transformers==4.53.0) model.set_designer_internvl(version='OpenGVLab/InternVL3_5-8B-HF') # MiMo-VL as external designer model.set_designer_mimo(version='XiaomiMimo/MiMo-VL-7B-RL-2508') # GLM-4.1V as external designer (recommend using transformers==4.53.1) model.set_designer_glm(version='THUDM/GLM-4.1V-9B-Thinking') ``` Run: ```bash bash scripts/demo_edit.sh ``` The model first generates a CoT-guided edit instruction for each prompt (saved to `cache/inference/demo/DIM-4.6B-Edit/tos_dataset_edit_cot_demo_gen.jsonl`), then produces edited images at `cache/inference/demo/DIM-4.6B-Edit/{id}_edited.jpg`. A sample GPT-4o-generated CoT jsonl is provided at `cache/demo/tos_dataset_edit_cot_demo.jsonl` for reference.
### ๐Ÿ“œ Model License The models are developed based on [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) (subject to the [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE)) and [SANA1.5_1.6B_1024px](https://huggingface.co/Efficient-Large-Model/SANA1.5_1.6B_1024px) (subject to the [NVIDIA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_1.6B_1024px/blob/main/LICENSE.txt)). We retain ownership of all intellectual property rights in and to any derivative works and modifications that we made. ## ๐Ÿงช Evaluation
๐Ÿ“ GenEval We provide two evaluation jsonl files in `cache/GenEval` based on prompt type: 1. `tos_dataset.jsonl` โ€” Original prompts. 2. `tos_dataset_rewritten.jsonl` โ€” LLM-rewritten prompts. > The `image_path` field is a placeholder โ€” please replace it with a pseudo image on your local disk first. Run: ```bash bash scripts/eval_geneval.sh ``` Generated images will be saved to `cache/inference/DIM-4.6B-T2I/GenEval(_rewritten)`. Follow the [GenEval official repo](https://github.com/djghosh13/geneval) for metric calculation.
๐Ÿ–ผ๏ธ MJHQ-30K Download [MJHQ-30K](https://huggingface.co/datasets/playgroundai/MJHQ-30K) (only `mjhq30k_imgs.zip` is needed), extract under `cache/` as: ``` cache โ””โ”€โ”€ MJHQ-30K โ”œโ”€โ”€ animals โ”‚ โ”œโ”€โ”€ {id}.jpg โ”‚ โ””โ”€โ”€ ... โ”œโ”€โ”€ art โ”œโ”€โ”€ fashion โ”œโ”€โ”€ food โ”œโ”€โ”€ indoor โ”œโ”€โ”€ landscape โ”œโ”€โ”€ logo โ”œโ”€โ”€ people โ”œโ”€โ”€ plants โ””โ”€โ”€ vehicles ``` All MJHQ-30K prompts are in `cache/MJHQ-30K/tos_dataset.jsonl`. Run: ```bash bash scripts/eval_mjhq30k.sh ``` Generated images will be saved to `cache/inference/DIM-4.6B-T2I/MJHQ-30K`. We use [pytorch-fid](https://github.com/mseitzer/pytorch-fid) to compute FID.
โœ๏ธ ImgEdit Download [ImgEdit](https://huggingface.co/datasets/sysuyy/ImgEdit/tree/main) and organize under `cache/`: ``` cache โ””โ”€โ”€ ImgEdit โ””โ”€โ”€ Benchmark โ”œโ”€โ”€ hard โ”œโ”€โ”€ multiturn โ””โ”€โ”€ singleturn โ”œโ”€โ”€ animal โ”‚ โ”œโ”€โ”€ {id}.jpg โ”‚ โ””โ”€โ”€ ... โ”œโ”€โ”€ architecture โ”œโ”€โ”€ clothes โ”œโ”€โ”€ compose โ”œโ”€โ”€ daily object โ”œโ”€โ”€ for_add โ”œโ”€โ”€ human โ”œโ”€โ”€ style โ”œโ”€โ”€ transport โ”œโ”€โ”€ judge_prompt.json โ””โ”€โ”€ singleturn.json ``` Four evaluation jsonl files are provided in `cache/ImgEdit`: 1. `tos_dataset_edit.jsonl` โ€” Original prompts. 2. `tos_dataset_edit_cot.jsonl` โ€” CoT-style prompts from GPT-4o. 3. `tos_dataset_edit_cot_Qwen2.5-VL-3B-Instruct.jsonl` โ€” CoT-style prompts from Qwen2.5-VL-3B. 4. `tos_dataset_edit_cot_Qwen2.5-VL-7B-Instruct.jsonl` โ€” CoT-style prompts from Qwen2.5-VL-7B. Run: ```bash bash scripts/eval_imgedit.sh ``` Generated images will be saved to `cache/inference/DIM-4.6B-Edit/ImgEdit`. Follow the [ImgEdit official repo](https://github.com/PKU-YuanGroup/ImgEdit) for metric calculation.
๐Ÿ“ GEdit-Bench-EN Download [GEdit-Bench](https://huggingface.co/datasets/stepfun-ai/GEdit-Bench), extract raw images, and organize under `cache/`: ``` cache โ””โ”€โ”€ GEdit-Bench โ””โ”€โ”€ input_image_raw โ”œโ”€โ”€ {id}.png โ”œโ”€โ”€ {id}.png โ”œโ”€โ”€ {id}.png โ””โ”€โ”€ ... ``` Four evaluation jsonl files are provided in `cache/GEdit-Bench`: 1. `tos_dataset_edit_en.jsonl` โ€” Original prompts. 2. `tos_dataset_edit_en_cot.jsonl` โ€” CoT-style prompts from GPT-4o. 3. `tos_dataset_edit_en_cot_Qwen2.5-VL-3B-Instruct.jsonl` โ€” CoT-style prompts from Qwen2.5-VL-3B. 4. `tos_dataset_edit_en_cot_Qwen2.5-VL-7B-Instruct.jsonl` โ€” CoT-style prompts from Qwen2.5-VL-7B. Run: ```bash bash scripts/eval_gedit_bench.sh ``` Generated images will be saved to `cache/inference/DIM-4.6B-Edit/GEdit-Bench`. Follow the [GEdit-Bench official repo](https://github.com/stepfun-ai/Step1X-Edit) for metric calculation.
## ๐Ÿ“– Citation If you find **DIM** useful for your research, please consider citing our paper: ```bibtex @misc{zeng2025draw, title = {Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing}, author = {Zeng, Ziyun and Zhang, David Junhao and Li, Wei and Shou, Mike Zheng}, year = {2025}, eprint = {2509.01986}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2509.01986} } ```