AI Giants Pt. 2: How Google Fixed Gemini's Blurry Vision

A look into the model's blurry past and a glimpse into Gemini 3.0's promising future.

Nov 20, 2025

This is Part 2 of our AI Giants series, where we examine the successes and shortcomings of today’s largest AI firms. In Part 1, we dove into Claude’s recent reliability crisis and the pitfalls that come with relying on cloud-hosted AI systems.

In October 2025, Google achieved something remarkable, posting its first-ever $100 billion quarterly revenue. With cloud services accelerating at 34% year-over-year growth, AI products drawing in more users and revenue, and the early success of their newly released Gemini 3.0, Google seems primed to capture a large share of the AI market. Yet before 3.0’s release, Google’s top model, Gemini 2.5, ranked a distant eighth on SWE-bench Verified, the industry’s premier software engineering benchmark, lagging behind (former) first place Claude Sonnet 4.5 by 17 percentage points. Google had built an AI juggernaut that exceled at analyzing six-hour videos and processing two million token context windows, but developers would increasingly turn to Claude when mission-critical code was on the line.

This was Gemini’s blurry vision problem. It provided extraordinary capabilities in some dimensions and showed concerning gaps in others, all while the parent company printed money. Gemini 3.0 promises to fill these gaps, with early reviews showing promising progress and unmatched model testing scores. Google now has the benchmark supremacy to attract more users while controlling the world’s largest digital ecosystem. If Gemini 3.0 continues to impress, it may cause a massive pivot in the AI ecosystem.

The AI Born from DeepMind’s Merger

To understand how Google reached this point, we must first explore Google’s history of AI development, diving into the advantages and limitations of Gemini that landed them in the market position they’re seeking to break out of.

Google’s path to Gemini began in 2023. By merging Google Brain and DeepMind into a single organization (Google DeepMind), they ended years of internal AI competition. DeepMind brought the pedigree of AlphaGo and AlphaFold, breakthrough systems that respectively defeated world champions in the game of Go and solved protein folding, while Google Brain contributed TensorFlow and the Transformer architecture that powers modern AI. The merger aimed to focus Google’s resources on competing with OpenAI and Anthropic rather than competing with itself.

Gemini 1.0 launched December 2023 with bold claims about multimodal superiority. Google marketed it as “built from the ground up for multimodality,” trained simultaneously on text, images, audio, and video rather than stitching separate models together. The technical approach differed fundamentally from competitors who added vision capabilities to text-first models. Gemini 2.5 Pro arrived March 2025, topping the LMArena leaderboard with the largest debut score jump in benchmark history. The model came in three sizes. Ultra for complex tasks, Pro for versatility, and Flash for speed and cost efficiency.

By early November 2025, Gemini reported 650 million monthly active users and powered 1.5 billion AI Overview interactions in Google Search. The company processed 7 billion API tokens per minute and counted 1.5 million developers who had tried the models. Google embedded Gemini across Search, Android, YouTube, Gmail, Maps, and Workspace, pursuing an “ambient AI” strategy rather than standalone chatbot dominance.

Where Gemini Has Thrived

Google’s models have achieved genuine technical differentiation in two domains: multimodal capabilities and context window size. Multimodal architecture is of particular importance, as Google pre-trained Gemini on text, images, audio, video, and code simultaneously.

Gemini 2.5 Flash Image generates images natively within conversations, edits uploaded photos conversationally, and maintains character consistency across generation rounds. DALL-E 3 requires separate workflows; Midjourney operates through Discord; Stable Diffusion demands technical setup. Only Gemini combines generation and analysis in natural conversation.

Video capabilities create the starkest competitive gap. Gemini 2.5 accepts video uploads of up to five minutes, processes YouTube URLs directly via API, and analyzes up to six hours of content through its two-million-token context window. The model achieves 81.3% accuracy (with subtitles) on VideoMME’s overall benchmarks, demonstrating temporal reasoning like counting discrete events across timelines. GPT-4o offers real-time camera interaction but cannot process uploaded video files. Claude doesn’t support video at all.

Veo 3, Google’s text-to-video model integrated into Gemini Advanced, achieved synchronized AI-generated audio alongside video in May 2025. The model generates 8-second clips at 1080p in under two minutes. Comparisons with OpenAI’s Sora 2 show mixed results. Sora produces longer clips with smoother motions, while Veo offers excellent cinematic coherence and faster generation. Gemini remains the only major LLM platform offering both video understanding and video generation in unified interface.

Context window supremacy provides a quantitative advantage. Gemini 2.5 Pro accepts 1,048,576 tokens as standard input with experimental support extending to two million tokens (the equivalent of 1,500 pages of text or 30,000 lines of code), while performance benchmarks show near-perfect retrieval accuracy even at maximum context length. In comparison, GPT-4 Turbo handles 128,000 tokens, with Claude supporting 200,000 tokens. Gemini’s context window isn’t merely larger – it enables qualitatively different applications, allowing legal teams to load entire depositions, developers to process complete codebases without chunking, and researchers to synthesize massive document collections.

Fourth Place Where Developers Care Most

Gemini 2.5 was not without its faults, however. The benchmark reality check arrived throughout 2025 as competitors raised the bar. On SWE-bench Verified, which tests AI on authentic GitHub issues requiring multi-file edits and systems thinking, Gemini 2.5 Pro achieved a 53.6% accuracy score, while Claude Sonnet 4.5 reached 70.6%, a gap representing hundreds of failed pull requests in production environments. Terminal-Bench showed Claude 4.5 scoring 42.8% on command-line automation while Gemini 2.5 managed just 32.6%.

However, Gemini dominated elsewhere. Humanity’s Last Exam showed 21.6% for Gemini 2.5 versus Claude’s 13.7%, proving itself as a leading model on graduate-level scientific reasoning and mathematics competitions. Google had built the world’s best AI for expert knowledge while failing to break the top 3 in practical engineering tasks.

The competitive landscape intensified throughout 2025. Claude Opus 4.1 arrived August 5, and Claude Sonnet 4.5 launched September 29, both demonstrating significant coding capability improvements. Google’s response was lackluster, pushing out incremental Flash model updates rather than fundamental leaps. Developers consistently reported that Claude delivered fewer bugs, more consistent architecture, and better test scaffolding in head-to-head comparisons.

Thanks for reading Frontier Foundry! This post is public so feel free to share it.

Powerful But Unreliable

Developer sentiment oscillated between impressed and frustrated. The most consistent praise centered on capabilities competitors could not match. Developers routinely loaded 30,000+ lines of code into Gemini’s million-token context window for whole-codebase analysis, generated complete presentations with images in single prompts, and analyzed multi-hour videos without preprocessing.

But reliability concerns dominated criticism. The most frequent complaint was responses that terminated mid-sentence, not from token limits, but from apparent completion signaling bugs. One experienced developer articulated the frustration precisely: “Reliability matters more than peak performance. I’d rather work with a model that consistently delivers complete responses than one that gives me half-thoughts I have to constantly prompt to continue.”

An October 2025 study published by Deutsche Welle and 21 other international public broadcasters compared Gemini 2.5, ChatGPT, Perplexity, and Copilot’s ability to accurately represent the news. The study found Gemini 2.5 performed the worst, with 72% of its responses having “significant sourcing issues.” The model hallucinated with sufficient frequency that multiple reviews report the Deep Research feature will sometimes fabricate references. Users reported incorrect calculations despite strong performance on formal mathematics benchmarks.

The September 2025 introduction of aggressive usage limits triggered community backlash. Free users received just five Gemini 2.5 Pro prompts daily. AI Pro subscribers ($19.99/month) faced 100-prompt daily caps. Developers accustomed to extensive coding sessions hit limits rapidly, perceiving this change as calculated monetization pressure precisely when Google needed developer mindshare.

Tom’s Guide crystallized the competitive positioning after testing Claude 4.5 against Gemini 2.5. “Claude consistently excelled when the task required precision, structure, or atmospheric storytelling, while Gemini shined in situations that called for creativity, playfulness, or practical developer workflows.” Essentially, mission-critical work should go to Claude, everything else could go to Gemini.

The 3.0 Upgrade

Despite these past issues, Gemini 3.0 might trigger a change of opinion. As developers have experimented with the new model over the last 48 hours, many have reported massive improvements in its agentic capabilities and usefulness as a coding tool. A report from Google DeepMind shows Gemini 3.0 dominates the benchmarks that once held it back.

The test pitted Gemini 3.0 against Gemini 2.5, Claude Sonnet 4.5, and ChatGPT-5.1. The result was Gemini 3.0 beating the other models in 19 of the 20 different industry benchmarks they were tested against. Some key comparisons are outlined below.

Humanities Last Exam (General Test of 2,500 Questions)
- Gemini 3.0 Pro – 37.5%
- ChatGPT-5.1 - 26.5%
- Claude Sonnet 4.5 - 13.7%

Terminal-Bench (Agentic Terminal Coding)
- Gemini 3.0 Pro – 54.2%
- ChatGPT-5.1 - 47.6%
- Claude Sonnet 4.5 - 42.8%

LiveCodeBench Pro (Compteitive Coding Problems)
- Gemini 3.0 Pro – 2,439 ELO
- ChatGPT-5.1 - 2,243 ELO
- Claude Sonnet 4.5 - 1,418 ELO

Simple QA Verified (Simple Question Fact Checking)
- Gemini 3.0 Pro – 72.1%
- ChatGPT-5.1 - 34.9%
- Claude Sonnet 4.5 - 29.3%

According to Google, the only test in which Gemini 3.0 was beaten was SWE-bench Verified, losing to Sonnet 4.5 and GPT-5.1 by 1% and 0.1% respectively. Yet according to SWE-bench’s own website, Gemini 3.0 ranks first with a 74.2% rating compared to second place Sonnet 4.5’s 70.6%. Seeing as we are still in the early stages of Gemini 3.0’s release, discrepancies are to be expected, and the numbers are bound to change, but all signs point to Gemini matching or surpassing the competition’s coding capabilities. A shift in AI market share appears imminent.

Leave a comment

Third Place in the Market… For Now

Gemini currently holds 13.5% market share in AI chatbots as of early November 2025, ranking third behind ChatGPT’s commanding 61% and Microsoft Copilot’s 14%. That understates actual reach, as Google embeds Gemini across properties with billions of users. Monthly active users grew to 650 million, while daily active users hit over 40 million.

Enterprise adoption shows the strongest traction. 46% of U.S. enterprises deploy Gemini in productivity workflows, doubling 2024’s levels. Fortune 500 penetration reached 41%, with Gemini being used in at least one department. Through the first half of 2025 alone, Google hosted 27 million enterprise users processing 2.3 billion document interactions through Workspace, with 92% of accounts including Gemini features. Their value proposition initially centered on seamless integration, with the raw capabilities now arriving.

The developer ecosystem remains Google’s struggle point, but as more users experiment with Gemini 3.0, that will likely change. Google counts 420,000 active API users, up 61% annually, with 310 million daily API requests. Developer adoption skews toward startups attracted by cost advantages. Gemini 1.5 Flash costs $0.07 per million tokens compared to GPT-4’s $3 per million, providing superior price-performance and multimodal breadth over specialized excellence. Gemini 3.0 cannot deliver the same price advantage as 1.5 Flash, but its improved capabilities will surely attract developers looking to improve their workflow. Time will tell if Google captures more of this market, but the continued praise of Gemini 3.0 is a strong indicator it will.

Clearing Gemini’s Blurry Vision

Gemini occupies an intriguing position in late 2025. The model leads in benchmarks measuring expert knowledge, multimodal reasoning, and mathematics. It offers the largest context windows commercially available, processes video better than competitors, and integrates across Google’s ecosystem. And it now appears to have finally found the key to success in software engineering, with Gemini 3.0 leading developer benchmarks, providing quality terminal automation, and freeing itself of the accuracy and reliability issues that long frustrated developers.

Gemini 3.0’s release revealed Google’s strategy. They first optimized Gemini for ecosystem integration and multimodal breadth over specialized coding excellence. They focused on building a model that excelled at tasks that leverage Google’s unique advantages, like analyzing YouTube videos, understanding context across Workspace, and processing massive documents. Now, Gemini can seemingly handle tasks requiring sustained logical reasoning through complex codebases, the precise area where other competitors, such as Claude, dominate.

Google’s financial strength and ecosystem control allowed strategic patience competitors could not afford. Alphabet prints money while investing billions in AI infrastructure. The company didn’t need immediate benchmark supremacy because it monetizes AI through subscriptions, advertising, and enterprise cloud services. OpenAI and Anthropic must justify their lofty valuations through direct AI product revenue.

This strategy has set Google up for success. Competitors can no longer offer capabilities that Gemini cannot match, while Workspace integration, cost efficiency, and multimodal capabilities make its array of models perfect for enterprise customers. Real demand existed even before Gemini 3.0’s release, exemplified by expansive enterprise deployments, such as KPMG reaching 90% employee adoption within two weeks, and the November 2025 Apple deal valuing Gemini at $1 billion annually.

All that remains to be seen is if the optimism surrounding Gemini 3.0’s coding abilities continues as more developers test it for themselves. Developer sentiment matters because individual choices compound into enterprise standards. Google controls distribution through Workspace and Android but doesn’t yet control developer mindshare. Initial reactions, however, tell us they may have it very soon.

The next six months will show how well Google’s bet pays off. The coding benchmark gaps appear to have been closed while preserving multimodal advantages. Enterprise customers deploying Gemini Enterprise must demonstrate productivity gains justifying premium pricing. The Apple partnership launching Spring 2026 must succeed. And developer sentiment must continue to shift in Gemini’s favor.

Google enters this position with financial strength, record revenues, and industry-leading cloud growth. The parent company thrives while the AI product gains ground on the competition. Other large cloud-based models remain prosperous, but Google seems to have found their corrective lenses. Gemini’s vision is now clear.

This article was written by Max Kozhevnikov, Data and Software Engineer at Frontier Foundry. Visit his LinkedIn here.

To stay up to date with our work, visit our website, or follow us on LinkedIn, X, and Bluesky. To learn more about the services we offer, please visit our product page.

AI Giants Pt. 2: How Google Fixed Gemini's Blurry Vision

A look into the model's blurry past and a glimpse into Gemini 3.0's promising future.

Discussion about this post