CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents Paper • 2603.24440 • Published 17 days ago • 96
EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings Paper • 2603.13594 • Published 29 days ago • 147
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper • 2602.05400 • Published Feb 5 • 352
Grounding and Enhancing Informativeness and Utility in Dataset Distillation Paper • 2601.21296 • Published Jan 29 • 19
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique Paper • 2511.09067 • Published Nov 12, 2025 • 2
Grounding Computer Use Agents on Human Demonstrations Paper • 2511.07332 • Published Nov 10, 2025 • 107
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published Nov 4, 2025 • 103
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 39
MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems Paper • 2404.09486 • Published Apr 15, 2024 • 2
AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness Paper • 2507.01702 • Published Jul 2, 2025 • 4
view article Article BigCodeArena: Judging code generations end to end with code executions Oct 7, 2025 • 22
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 179
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43