MoNaCo: Exhaustively Benchmarking AI Reasoning Across Dozens of Documents
Dr. Tomer Wolfson
Abstract:
The landscape of AI-powered web search is being transformed: evolving from static large language models to LLM-powered "agents" which surface hard-to-find information through multiple rounds of reasoning and retrieval. Such systems such as the Deep Research Agents by OpenAI and Gemini are increasingly embedded in users’ workflows---tasked with everything from simple question answering to writing complex, multi-step reports.
Evaluating these agents requires measuring their ability to formulate and execute long-horizon plans that demand integrating information from dozens or even hundreds of web pages. Yet existing benchmarks are largely limited to tasks involving only a handful of sources, failing to capture the recall-intensive challenges found in many real-world applications.
In this talk, I will introduce MoNaCo, the first benchmark that evaluates AI agents’ ability to answer long-horizon questions that span dozens---and in some cases hundreds---of web pages. MoNaCo consists of 1,315 human-authored questions, each paired with ground-truth answers, supporting evidence, and gold-standard reasoning chains. We evaluate 15 frontier LLMs and identify substantial performance gaps: the top-performing model, o3, achieves a perfect score on just 38.7% of the benchmark.
Using MoNaCo, I will show that deep reasoning across large document collections remains an open challenge for current AI agents, and I will discuss promising directions for bridging this gap.
Bio:
Tomer Wolfson is a Postdoctoral Fellow at the University of Pennsylvania, working with Prof. Dan Roth. His research lies in the intersection of Natural Language Processing and Data Management. He is interested in language models capable of understanding complex tasks that involve reasoning over hundreds of documents in order to piece together an answer. Tomer has been named a "Postdoctoral Fellow for AI and Data Science", awarded by the Israeli Council for Higher Education.
Previously, Tomer completed his PhD at Tel Aviv University, advised by Prof. Jonathan Berant and Prof. Daniel Deutch. During his PhD, he also served as a research intern at the Allen Institute for AI (Ai2), and was awarded the "PhD Fellowship for Data Science" by the Israeli Council for Higher Education.

