Reliable Evaluation of Interactive LLM Agents in a World of Apps and People: AppWorld

Harsh Trivedi (Stony Brook University)

11. December 2024
from 10:00

Tomorrow in the Lamarr NLP Colloquium we have the pleasure to host Harsh Trivedi from Stony Brook University.

His recent work, AppWorld, received a Best Resource Paper award at ACL’24, and his work on Al safety via debate received a Best Paper award at the ML Safety workshop at NeurIPS’22. His work has made waves at Stanford, Google, Apple, and many other places (https://appworld.dev/talks)

Abstract: We envision a world where Al agents (assistants) are widely used for complex tasks in our digital and physical worlds and are broadly integrated into our society. To move towards such a future, we need an environment for a robust evaluation of agents’ capability, reliability, and trustworthiness.

In this talk, I’ll introduce AppWorld, which is a step towards this goal in the context of day-to-day digital tasks. AppWorld is a high-fidelity simulated world of people and their digital activities on nine apps like Amazon, Gmail, and Venmo. On top of this fully controllable world, we build a benchmark of complex day-to-day tasks such as splitting Venmo bills with roommates, which agents have to solve via interactive coding and API calls. One of the fundamental challenges with complex tasks lies in accounting for different ways in which the tasks can be completed. I will describe how we address this challenge using a reliable and programmatic evaluation framework. Our benchmarking evaluations show that even the best LLMs, like GPT-40, can only solve ~ 30% of such tasks, highlighting the challenging nature of the AppWorld benchmark./ will conclude by laying out future research that can be conducted on the foundation of AppWorld, such as the evaluation and development of multimodal, collaborative, safe, socially intelligent, resourceful, and fail-tolerant agents that can plan, adapt, and learn from environment feedback.

Project Website:

https://appworld.dev/

Bio: Harsh Trivedi is a final year PhD researcher at Stony Brook University, advised by Niranjan Balasubramanian. He is broadly interested in the development of reliable, explainable Al systems and their rigorous evaluation.

Specifically, his research spans the domains of Al agents, multi-step reasoning, Al safety, and efficient NLP. He has interned at Al2 and was a visiting researcher at NYU. If you’re interested, you can get in touch with him at hjtrivedi@cs.stonybrook.edu for follow-ups.

Reliable Evaluation of Interactive LLM Agents in a World of Apps and People: AppWorld

Related posts from this category