Gemini 2.5 Ultra Quietly Beats GPT-5 on Long-Document Reasoning

Google's latest update to Gemini 2.5 Ultra posts long-context reasoning scores that beat GPT-5 on standard benchmarks. The enterprise implications are significant.

By James Whitfield · June 3, 2026 · 6 min read

Google updated Gemini 2.5 Ultra this week with what the company described as “improved long-context reasoning” — language deliberately vague enough to conceal what the benchmarks actually show.

On RULER, the long-context evaluation benchmark that tests genuine reasoning rather than mere retrieval across large documents, Gemini 2.5 Ultra scores 94.1 at the 128k token context length. GPT-5 scores 89.3 at the same length. Claude 3.7 Sonnet scores 91.8.

These are not marginal differences. At 128,000 tokens — approximately 100,000 words, or a substantial novel — the ability to maintain coherent reasoning about information introduced early in the context becomes genuinely difficult. Gemini 2.5 Ultra is doing it better than any model currently available.

The practical implications are significant for specific use cases: legal document analysis, scientific literature review, codebase-level debugging, and financial document processing all involve contexts long enough that this capability gap matters. For a standard conversational use case, it doesn’t. For the enterprise customers Google is targeting, it increasingly does.

// Author

James Whitfield

James has been taking apart computers since he was nine. He covers the silicon that makes everything else possible, from fab geopolitics to the GPUs sitting in your rig. Based in London.

Gemini 2.5 Ultra Quietly Beats GPT-5 on Long-Document Reasoning

Leave a Reply Cancel reply

—

—

—