Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

Overview

This repository contains the Retrieval Module of our model, enabling chatbots to understand and respond to both visual and audio inputs in immersive, real-time interactions.

The Retrieval Module includes:

🔧 Processor — handles multimodal input preprocessing (both visual and audio inputs)
🔌 Adapter — adapter bridging audio features to the vision-language model
🔊 Audio Embeddings — audio representations

Model Architecture

Our model consists of a Dialogue Module and a Retrieval Module. Both modules extend Qwen2-VL-2B-Instruct with audio comprehension via a CLAP-based linear adapter.

Description

Our model is built on Qwen2-VL-2B-Instruct, and extends it with audio understanding capabilities using CLAP via a lightweight linear layer adapter. This design allows the model to process and response over textual, visual, and audio inputs within a unified framework.

The full model comprises two modules:

Dialogue Module — generates contextually appropriate responses grounded in multimodal perception
Retrieval Module (this repository) — retrieves relevant memory to support long-term, dynamic conversations

Citation

If you find this work useful, please cite:

@inproceedings{jang-etal-2025-enabling,
    title = "Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions",
    author = {Jang, Jihyoung  and
      Bae, Minwook  and
      Kim, Minji  and
      Hakkani-T{\"u}r, Dilek  and
      Kim, Hyounghun},
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1519/",
    doi = "10.18653/v1/2025.acl-long.1519",
    pages = "31481--31512",
    ISBN = "979-8-89176-251-0"
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jihyoung/M3C-retrieval

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Finetuned

(344)

this model

Dataset used to train jihyoung/M3C-retrieval

Paper for jihyoung/M3C-retrieval

Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

Paper • 2506.00421 • Published May 31, 2025 • 5