A collection of benchmarks for evaluating LMs or VLMs under multi-turn interaction
Young-Jun Lee PRO
passing2961
AI & ML interests
Social Dialogue System, Multi-Modal Dialogue
Recent Activity
upvoted a paper 1 day ago
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World upvoted a paper 2 days ago
iOSWorld: A Benchmark for Personally Intelligent Phone Agents upvoted a paper 2 days ago
MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents