
iStockphoto
Companies like Google and Microsoft love to brag about the generative AI tools that their power deep research agents and search engines. In some cases, those AI tools are forced upon the consumer, such as the controversial AI summaries that appear atop Google search results.
Most people will take the AI search results provided by these products at face value, even if what those results are claiming isn’t factually correct. This is a big problem because as a new study that tested several generative AI search engines discovered, around one-third of the claims made by them are biased and not backed up by the sources they cite.
In the study, published recently on the pre-print server arXiv, the researchers tested several AI search engines including OpenAI’s GPT-4.5 and 5, You.com, Perplexity and Microsoft’s Bing Chat, as well as several deep research agents including GPT-5’s Deep Research feature, Bing Chat’s Think Deeper option, and the tools offered by You.com, Google Gemini and Perplexity.
According to New Scientist, 23 percent of the claims made by the Bing Chat search engine included unsupported statements, 31 percent of the claims made by the You.com and Perplexity AI search engines included unsupported statements, GPT-4.5 produced unsupported claims 47 percent of the time, and Perplexity’s deep research agent made unsupported claims a whopping 97.5 percent of the time.
“Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices,” the researchers wrote in their report.
The researchers concluded, “Our evaluation demonstrates that current public systems fall short of their promise to deliver trustworthy, source-grounded synthesis. Generative search engines tend to produce concise and relevant answers but consistently exhibit one-sided framing and frequent overconfidence, particularly on debate-style queries.
“Deep research agents, while reducing overconfidence and improving citation thoroughness, often overwhelm users with verbose, low-relevance responses and large fractions of unsupported claims. Importantly, our findings show that increasing the number of sources or length of responses does not reliably improve grounding or accuracy; instead, it can exacerbate user fatigue and obscure transparency.
“Citation practices remain a persistent weakness across both classes of systems,” they continued. “Many citations are either inaccurate or incomplete, with some models listing sources that are never cited or irrelevant to their claims. This creates a misleading impression of evidential rigor while undermining user trust.”
OpenAI, You.com, Microsoft and Google either didn’t respond to requests for comment or declined to comment on the study. Perplexity declined to comment and disagreed with the methodology of the study.