Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights Paper • 2512.01816 • Published 15 days ago • 88
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published May 5 • 34
P1: Mastering Physics Olympiads with Reinforcement Learning Paper • 2511.13612 • Published 29 days ago • 132
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning Paper • 2511.01833 • Published Nov 3 • 15
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning Paper • 2511.01833 • Published Nov 3 • 15
OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! Paper • 2509.26495 • Published Sep 30 • 10
Symbolic Graphics Programming with Large Language Models Paper • 2509.05208 • Published Sep 5 • 46 • 7
Symbolic Graphics Programming with Large Language Models Paper • 2509.05208 • Published Sep 5 • 46 • 7
Symbolic Graphics Programming with Large Language Models Paper • 2509.05208 • Published Sep 5 • 46 • 7
SGP-Generation Collection Symbolic Graphic Programming with Large Language Model • 5 items • Updated Sep 11 • 3