🌱VitaBench: Benchmarking LLM Agents
with Versatile Interactive Tasks in Real-world Applications

Wei He^†,*, Yueqing Sun^*, Hongyan Hao^*, Xueyuan Hao^*, Zhikang Xia, Qi Gu^†,
Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su,
Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao

Meituan LongCat Team
†Correspondence to: whe23@m.fudan.edu.cn, guqi03@meituan.com
*Indicates equal contribution.

📃 Paper 🤗 Dataset 🏆 Leaderboard Code arXiv

Abstract

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications.

Directional Weight Score

Overview

This plot summarizes per-model success rates under our two evaluation regimes. The most striking observation is the consistent and large performance gap: even the best model reaches only 48.3% success on the 300 single-scenario tasks, while performance falls to 30.0% on the 100 cross-scenario tasks. This sharp drop highlights challenges in domain switching, expanded tool selection, and long-horizon coordination. Relative model rankings are broadly preserved, but absolute success rates remain low—indicating substantial headroom for improvement.

Method

VitaBench is built through a two-stage pipeline. Stage I (Framework Design) abstracts real life-serving scenarios — Delivery, In-store Consumption, and OTA — into a directed graph of simplified API tools with explicit pre-/post-conditions and inter-tool dependencies, thereby encoding domain rules directly into tool structures and enabling cross-domain composition. Stage II (Task Creation) constructs tasks from anonymized real user profiles, composite instructions, and realistic environments augmented with curated distractors and transaction histories. Each task is iteratively validated with human checks to ensure clarity while preserving multiple valid solutions, resulting in 400 tasks with comprehensive databases.

Overview of the VitaBench construction pipeline

🏆Leaderboard🏆

Here we present comprehensive evaluation results on VitaBench. Last updated on 2025-09-30. For more detailed analysis, please refer to our paper on arXiv.

Rank	Models	Cross-Scenarios			Delivery			In-store			OTA
		Avg @4	Pass @4	Pass ^4	Avg @4	Pass @4	Pass ^4	Avg @4	Pass @4	Pass ^4	Avg @4	Pass @4	Pass ^4
		Thinking Models
1	o3 (high)	30.0	61.0	6.0	53.5	83.0	24.0	53.5	86.0	19.0	37.8	66.0	10.0
2	Claude-4.1-Opus (w/ thinking)	29.0	56.0	6.0	47.5	80.0	17.0	52.5	78.0	20.0	32.3	57.0	9.0
3	LongCat-Flash-Thinking	24.3	54.0	3.0	42.3	71.0	13.0	56.8	85.0	25.0	28.3	59.0	6.0
4	Gemini-2.5-Pro	23.5	53.0	5.0	49.0	81.0	16.0	43.8	78.0	12.0	26.5	54.0	6.0
5	Claude-4-Sonnet (w/ thinking)	23.0	51.0	6.0	46.0	78.0	15.0	51.5	80.0	21.0	29.0	55.0	9.0
6	GLM-4.5 (w/ thinking)	22.8	48.0	2.0	44.5	77.0	14.0	52.8	80.0	22.0	28.8	55.0	7.0
7	o4-mini (high)	19.5	49.0	1.0	44.5	80.0	15.0	46.5	81.0	15.0	23.5	50.0	5.0
8	Qwen3-235B-A22B-Thinking-2507	18.8	45.0	2.0	44.0	78.0	9.0	46.0	80.0	9.0	17.5	41.0	2.0
9	Doubao-Seed-1.6-Thinking	17.0	42.0	1.0	30.3	59.0	10.0	43.3	78.0	10.0	18.0	45.0	2.0
10	DeepSeek-R1-0528	14.5	39.0	0.0	40.3	72.0	11.0	41.3	79.0	7.0	13.0	32.0	2.0
11	Gemini-2.5-Flash (think on)	5.3	14.0	0.0	32.0	62.0	9.0	23.0	57.0	3.0	18.3	39.0	1.0
12	Qwen3-32B (w/ thinking)	5.0	24.0	0.0	22.8	53.0	4.0	26.5	60.0	3.0	7.3	18.0	1.0
Non-thinking Models
1	Claude-4.1-Opus (w/o thinking)	21.8	47.0	3.0	46.0	78.0	13.0	53.8	85.0	21.0	30.8	60.0	9.0
2	Claude-4-Sonnet (w/o thinking)	21.3	49.0	4.0	39.0	69.0	17.0	46.3	78.0	10.0	25.0	49.0	7.0
3	LongCat-Flash-Chat	20.3	45.0	2.0	39.5	71.0	15.0	50.5	84.0	15.0	22.8	49.0	2.0
4	GLM-4.5 (w/o thinking)	20.0	47.0	1.0	45.8	72.0	20.0	48.3	82.0	13.0	20.3	45.0	2.0
5	DeepSeek-V3.1 (w/o thinking)	16.3	40.0	1.0	34.0	67.0	6.0	42.5	76.0	7.0	18.3	47.0	1.0
6	Kimi-K2-0905	15.5	39.0	2.0	35.3	68.0	9.0	42.5	78.0	10.0	22.0	46.0	4.0
7	Qwen3-235B-A22B-Instruct-2507	14.3	38.0	0.0	34.3	66.0	6.0	44.8	87.0	13.0	20.0	45.0	1.0
8	GPT-4.1	13.8	35.0	0.0	37.8	67.0	11.0	42.5	71.0	17.0	19.8	42.0	1.0
9	Doubao-Seed-1.6	10.5	29.0	0.0	37.8	65.0	12.0	39.5	73.0	9.0	18.8	39.0	3.0
10	Gemini-2.5-Flash (think off)	5.8	17.0	1.0	31.0	65.0	6.0	22.8	46.0	3.0	18.5	44.0	1.0
11	Qwen3-32B (w/o thinking)	4.0	12.0	0.0	16.5	37.0	3.0	21.3	47.0	2.0	3.0	11.0	0.0
12	GPT-5 (minimal)	4.0	9.0	0.0	30.0	64.0	6.0	27.0	60.0	2.0	7.8	22.0	0.0
13	DeepSeek-V3-0324	3.8	12.0	0.0	25.3	53.0	5.0	34.3	71.0	5.0	10.3	26.0	1.0

¹We will periodically refresh the dataset by correcting errors, replacing outdated samples, and adding new challenging tasks. All leaderboard metrics are updated concurrently to reflect these changes.
²Due to API stability concerns, we are currently unable to evaluate some models for this benchmark. We are actively working to address these issues and include the latest models.
³While the tasks are grounded in real-world life-serving platforms where the majority of data is originally in Chinese, we are also preparing an English version of the dataset to facilitate broader research use.

BibTeX

@article{he2025vitabench,
    title={VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications}, 
    author={He, Wei and Sun, Yueqing and Hao, Hongyan and Hao, Xueyuan and Xia, Zhikang and Gu, Qi and Han, Chengcheng and Zhao, Dengchang and Su, Hui and Zhang, Kefeng and Gao, Man and Su, Xi and Cai, Xiaodong and Cai, Xunliang and Yang, Yu and Zhao, Yunke},
    journal={arXiv preprint arXiv:2509.26490},
    year={2025}
}

🌱VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications