Stateful Visual Encoders for Vision-Language Models Paper • 2606.04433 • Published 12 days ago • 8
Stateful Visual Encoders for Vision-Language Models Paper • 2606.04433 • Published 12 days ago • 8
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Paper • 2308.09936 • Published Aug 19, 2023 • 1
Matryoshka Query Transformer for Large Vision-Language Models Paper • 2405.19315 • Published May 29, 2024 • 1
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models Paper • 2410.08182 • Published Oct 10, 2024
Verbalized Representation Learning for Interpretable Few-Shot Generalization Paper • 2411.18651 • Published Nov 27, 2024
Interleaving Reasoning for Better Text-to-Image Generation Paper • 2509.06945 • Published Sep 8, 2025 • 16
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models Paper • 2509.25143 • Published Sep 29, 2025
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping Paper • 2510.08457 • Published Oct 9, 2025 • 14
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence Paper • 2512.10863 • Published Dec 11, 2025 • 22
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks Paper • 2604.08539 • Published Apr 9 • 50
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction Paper • 2603.05888 • Published Mar 6 • 2
Language Models Meet World Models: Embodied Experiences Enhance Language Models Paper • 2305.10626 • Published May 18, 2023 • 1
On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning Paper • 2210.10763 • Published Oct 19, 2022 • 1
OmniControlNet: Dual-stage Integration for Conditional Image Generation Paper • 2406.05871 • Published Jun 9, 2024
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation Paper • 2508.00728 • Published Aug 1, 2025
FrontierCS: Evolving Challenges for Evolving Intelligence Paper • 2512.15699 • Published Dec 17, 2025 • 5
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published Jan 23 • 40
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published Jan 23 • 40