Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise ell_1 advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives (2025)
- SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding (2025)
- VACoT: Rethinking Visual Data Augmentation with VLMs (2025)
- NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models (2025)
- Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings (2025)
- Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment (2025)
- Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/taming-hallucinations-boosting-mllms-video-understanding-via-counterfactual-video-generation-6965-f09d1a30
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper