Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions
Paper β’ 2506.00421 β’ Published β’ 5
This repository contains the Retrieval Module of our model, enabling chatbots to understand and respond to both visual and audio inputs in immersive, real-time interactions.
The Retrieval Module includes:
Our model consists of a Dialogue Module and a Retrieval Module. Both modules extend Qwen2-VL-2B-Instruct with audio comprehension via a CLAP-based linear adapter.
Our model is built on Qwen2-VL-2B-Instruct, and extends it with audio understanding capabilities using CLAP via a lightweight linear layer adapter. This design allows the model to process and response over textual, visual, and audio inputs within a unified framework.
The full model comprises two modules:
If you find this work useful, please cite:
@inproceedings{jang-etal-2025-enabling,
title = "Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions",
author = {Jang, Jihyoung and
Bae, Minwook and
Kim, Minji and
Hakkani-T{\"u}r, Dilek and
Kim, Hyounghun},
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1519/",
doi = "10.18653/v1/2025.acl-long.1519",
pages = "31481--31512",
ISBN = "979-8-89176-251-0"
}