GPTZero Found 100+ Hallucinated Citations in NeurIPS 2025 Papers—Voice AI for Demos Proves Why Reading Beats Generating

# GPTZero Found 100+ Hallucinated Citations in NeurIPS 2025 Papers—Voice AI for Demos Proves Why Reading Beats Generating GPTZero just published their analysis of 4,841 papers accepted by NeurIPS 2025 (one of the world's most prestigious AI conferences). **Result: At least 100 confirmed hallucinations in 53 papers.** These aren't papers that got rejected. These are **accepted papers** that beat out 15,000+ other submissions (24.52% acceptance rate), survived 3+ rounds of peer review, and were presented live at the conference. And they all contain hallucinated citations—fake authors like "John Doe and Jane Smith," fabricated arXiv IDs that link to different papers, and references to articles that don't exist. The HN discussion (495 points, 264 comments in 5 hours) is split between outrage at the authors and despair at the reviewers. But there's a deeper pattern here that applies directly to Voice AI for demos: **Reading ground truth prevents hallucinations. Generating from memory creates them.** NeurIPS papers hallucinated citations because LLMs generated text from their training data instead of reading actual bibliographies. Voice AI for demos works because it reads the DOM structure instead of generating navigation instructions. Both succeed by reading what exists. Both fail when they try to generate what should exist. ## The NeurIPS Hallucination Tsunami: When Peer Review Meets AI Slop Here's what GPTZero found in the 4,841 accepted NeurIPS 2025 papers: **100+ confirmed hallucinated citations across 53 papers:** - Fabricated authors: "John Doe and Jane Smith" appearing in multiple papers - Fake arXiv IDs: "arXiv:2401.00001" linking to completely different papers - Non-existent papers: Titles that sound plausible but don't exist - Invented DOIs: URLs that return 404s or link to unrelated articles **Example from "SimWorld: An Open-ended Simulator for Agents":** > "John Doe and Jane Smith. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.00001, 2024." The title exists ([actual paper here](https://aclanthology.org/2024.acl-long.371/)), but the authors are fake and the arXiv ID links to a [different article](https://arxiv.org/abs/2401.00001). **Example from "Unmasking Puppeteers: Leveraging Biometric Leakage":** > "John Smith and Jane Doe. Deep learning techniques for avatar-based interaction in virtual environments. IEEE Transactions on Neural Networks and Learning Systems, 32(12):5600-5612, 2021. doi: 10.1109/TNNLS.2021.3071234." No author match. No title match. Doesn't exist in the publication. The URL and DOI are completely fabricated. These papers passed peer review at **the most prestigious AI conference in the world.** Each had 3+ reviewers. Each beat a 75% rejection rate. And all of them contain citations that were generated, not read. ## Why This Happened: The Submission Tsunami Overwhelmed Human Review NeurIPS submissions grew **220% between 2020 and 2025:** - 2020: 9,467 submissions - 2025: 21,575 submissions That's 12,108 additional papers to review in five years. Conference organizers had to recruit thousands of new reviewers, resulting in: - Oversight gaps (too many papers per reviewer) - Expertise misalignment (reviewers outside their domain) - Negligence (reviewers skimming instead of reading) - Fraud (collusion rings gaming the system) GPTZero notes: "Our purpose in publishing these results is to illuminate a critical vulnerability in the peer review pipeline, not criticize the specific organizers, area chairs, or reviewers who participated in NeurIPS 2025." But the vulnerability is structural. Peer review was designed for human-written papers at human submission rates. Generative AI creates a 10x volume increase with 0.1x the verification effort. **The result: A system trying to defend against challenges it was never designed for.** Sound familiar? That's exactly the problem Voice AI for demos solves. ## The Parallel to Voice AI: Read the DOM, Don't Generate Navigation Voice AI for demos reads the DOM structure instead of generating navigation instructions. This prevents the same hallucination failure mode that plagued NeurIPS papers: **What NeurIPS authors did:** 1. Use LLM to write related work section 2. LLM generates plausible-sounding citations from training data 3. Authors paste citations without verifying they exist 4. Result: "John Doe and Jane Smith" appear in published papers **What DOM-naive AI demos do:** 1. User asks "Click the login button" 2. LLM generates plausible-sounding selector from training data 3. Demo tries `.login-button` or `#userLogin` without reading the page 4. Result: Navigation fails or clicks the wrong element **What DOM-reading Voice AI does:** 1. User asks "Click the login button" 2. Voice AI reads the accessibility tree to find actual login elements 3. Voice AI returns ground truth: `button[aria-label="Sign in"]` at coordinates (123, 456) 4. Result: Navigation succeeds because it read what exists The architecture is identical to what would have prevented NeurIPS hallucinations: **If NeurIPS authors had read their bibliographies:** 1. LLM suggests citation: "John Doe and Jane Smith. Webvoyager..." 2. Author checks arXiv, Google Scholar, publication databases 3. Author discovers: arXiv ID points to wrong paper, authors are fake 4. Author replaces with actual citation or removes reference **If Voice AI generated navigation without reading DOM:** 1. LLM suggests selector: `.login-button` 2. Demo tries to click without verifying element exists 3. Demo discovers: Element doesn't exist, or worse, clicks wrong button 4. Demo fails or executes wrong action Reading prevents both failures. Generating creates both. ## The Three Types of Hallucinations in Both Contexts GPTZero's NeurIPS analysis reveals three categories of hallucinated citations that map directly to Voice AI failure modes: ### 1. Plausible But Nonexistent (Generated from Training Data) **NeurIPS example:** > "Deep learning techniques for avatar-based interaction in virtual environments." Title sounds reasonable. Topic fits the paper. But it doesn't exist—LLM generated it from patterns in its training data about "deep learning" + "virtual environments" + "avatar interaction." **Voice AI equivalent:** Generating `.submit-button` because most forms have a submit button, without reading that this specific form uses `