The latest in generative artificial intelligence includes AI agents that can access the web to find answers to questions. While promising, agentic technology is very much a work in progress.
In a paper published last week, OpenAI researchers relate how the company's Deep Research technology, which was built to use the Web, does far better than OpenAI's other models when answering web questions. It also does far better than humans on tasks requiring hours of searching.
Also: What are AI agents? How to access a team of personalized assistants
But Deep Research still stumbles almost half the time.
OpenAI's new test suggests Deep Research can be more tenacious and dogged in pursuit of an answer than human researchers for some tasks, but it still fails to come up with an answer often.
Called BrowseComp, the test is described by authors Jason Wei and team as "a simple yet challenging benchmark for measuring the ability of agents to browse the web."
The premise is that AI agents -- meaning, AI models that can browse "thousands of web pages" -- could be much more resourceful than humans, who have limited memory, get fatigued surfing the Web, and "can only attend to one thing at a time and cannot be parallelized," mean, can't direct their brains to operate on data in parallel streams of thought.
"Machine intelligence, on the other hand, has much more extensive recall and can operate tirelessly without getting distracted," write Wei and team.
Also: OpenAI's Deep Research can save you hours of work - and now it's a lot cheaper to access
Wei and team built on their prior work from last year, "SimpleQ&A," which tests AI models' ability to answer "short, fact-seeking questions." The questions covered TV and movie trivia, science, history, music, video games, politics, and other topics.
The BrowseComp set of 1,266 questions is designed to go beyond simple information retrieval, the authors relate. Instead, they are questions for which it's hard to find the answers -- or, as they put it, "challenging because they require searching through a large space of potential answers and matching them to constraints posed in the question," and "hard-to-find, deeply entangled information on the web."
For example, one question-answer pair is the following:
Identify the title of a research publication published before June 2023, that mentions cultural traditions, scientific processes, and culinary innovations. It is co-authored by three individuals: one of them was an assistant professor in West Bengal and another one holds a Ph.D.
(Answer: The Fundamentals of Bread Making: The Science of Bread)
They emphasize that such a question is easy to verify because the answer is contained in a single phrase that is "self-contained."
The questions and answers were developed by human "trainers," and they were selected as being impossible to solve with just OpenAI's ChatGPT, with or without browsing abilities. The questions were also impossible for an "early version" of Deep Research.
Demonstrating just how weak humans are at searching the Web, they first tested humans who were "familiar with the dataset" to answer the questions.
The results were not good for the humans. For 70% of the questions, humans gave up after two hours of effort. They only answered about 30% of the questions, and for 14% of their proposed answers, the humans' suggestions did not match the actual answer.
Wei and team hypothesize that humans with higher searching skills could do better: "It is possible that many of the problems that they gave up on would be solvable by experienced professionals (e.g., detectives or investigative journalists) with ample time."
After the humans, they tested Deep Research against OpenAI's GPT-4o (with and without browsing abilities), GPT-4.5, and the o1 model.
The results were abysmal. "GPT-4o and GPT-4.5 achieved near-zero accuracy, highlighting the difficulty of the benchmark," they write. "Without strong reasoning or tool use, models fail to retrieve the kinds of obscure, multi-hop facts BrowseComp targets."
O1 fared better, which "[suggests] that some BrowseComp answers can be surfaced through inference over internal knowledge."
Also: AI unleashes more advanced scams. Here's what to look out for (and how to stay protected)
With a score of 51.5%, Deep Research was "significantly better," and "it is particularly effective at answering the niche, non-intuitive questions that require browsing numerous websites," Wei and team write.
However, they also found that GPT-4o using browsing and Deep Research could err by being "overconfident" about wrong answers, which is known as a calibration error.
"Models with browsing capabilities such as GPT-4o with browsing and Deep Research exhibit higher calibration error," they write, "suggesting that access to web tools may increase the model's confidence in incorrect answers. This aligns with observations that Deep Research struggles with confidence calibration and often fails to convey uncertainty accurately at present."
To correct for calibration error, they did another test with Deep Research, in which the model had to output as many as 64 answers to each question. Then, they had the model pick the best of them. When it did so, Deep Research was pretty good at choosing the right answer among all the proposals.
That, write Wei and team, suggests that "the model frequently 'knows' when it's right, even if it struggles to express that certainty as a calibrated probability."
Also: Google's latest chip is all about reducing one huge hidden cost in AI
They note, too, that the success of Deep Research improves with more computing added to it when it searches the Web. Put differently, "performance scales smoothly as a function of the amount of test-time compute used." That squares with an increasing trend of throwing more GPU chips at the task of inference.
Wei and team don't directly offer any hypothesis about why Deep Research fails almost half the time, but the implicit answer is in the scaling of its ability with more compute. As they run more parallel tasks, and ask the model to evaluate multiple answers, the accuracy scales past 75% of the questions answered.
The implication is that it is essential to choose strategies that force the model to evaluate its own efforts rather than simply chasing a single answer. Without that evaluation stage, the model struggles a good deal of the time.
Also: With AI models clobbering every benchmark, it's time for human evaluation
A big hole in BrowseComp, the authors acknowledge, is that it is limited to questions that are easy for the computer to parse, and whose answers are easy to verify. None of the 1,266 questions included "long responses or ability to resolve ambiguity in user queries."
As a result, BrowseComp, they argue, tests "core" functions of AI agents but is not comprehensive. "The model must be very proficient at locating hard-to-find pieces of information, but it's not guaranteed that this generalizes to all tasks that require browsing."
Deep Research is available to users of OpenAI's Plus and Pro subscriptions.
Want more stories about AI?Sign up for Innovation, our weekly newsletter.