Google's AI Stack Has an Operations Problem

Google's AI Stack Has an Operations Problem

The week of Google I/O 2026, I kept thinking about something Theo Browne (t3.gg) said on stream — the guy who runs T3 Chat and has been one of the more reliable voices on developer tooling for the past few years. His take on the whole week was something like: "the model is great, the everything-else is not."

That's the actual story of this launch. Gemini 3.5 Flash is genuinely fast. The benchmark numbers are real. Google's research org is still producing frontier work. None of that is in dispute.

What's in dispute is whether any of it matters if you can't build reliably on top of it.

I've been integrating with Gemini's API for a few months now. The quality-per-dollar story is legitimately interesting. But every time something breaks, there's this specific dread you feel with Google products that you don't feel with AWS or Stripe. It's the dread of wondering whether the thing you're depending on will still exist in six months.

That fear got a lot louder this week.


The first thing that annoyed people was the pricing presentation. Google pushed hard on throughput numbers during the launch. What they underemphasized was what those throughput numbers actually cost you in practice, specifically the reasoning token overhead.

This is becoming a real problem across the whole space. A model that runs at 300 tokens/sec sounds fast. But if it's burning 4x the tokens to get there because it reasons through everything regardless of whether it needs to, your actual cost-per-useful-output is worse than whatever you were running before. Benchmarks reward raw benchmark performance. Production systems reward efficiency.

OpenAI has been quietly winning this fight. Not because their models are more capable, but because they've been tightening the gap between what the model spends and what it produces. When you're at scale, that gap is everything.


Then there was the Anti-Gravity situation.

Google had a CLI ecosystem built around Gemini that was, by developer consensus, actually good. It was open, well-documented, and had started accumulating real adoption. The kind of low-key traction that suggests developers are actually using a thing, not just evaluating it.

They moved away from it toward Anti-Gravity, a more closed environment that currently doesn't work as well as what it replaced. Theo covered this in detail. The reaction wasn't just "this new thing is worse." It was "of course Google did this."

That's the trust problem in miniature. When you ship a regression wrapped in a migration, developers don't just get frustrated at the regression. They update their model of you as a vendor. They start factoring in the probability that the next tool they like will also get replaced with something worse in twelve months.


And then Railway went down.

Timing is everything in narrative, and the timing here was brutal. Mid-I/O week, Railway infrastructure started reporting major outages tied to GCP. This wasn't Google's fault in any direct sense. But it became part of the story anyway, because the story was already about operational trust.

Here's the thing about being infrastructure: you don't get partial credit. Nobody is evaluating AWS on "great ideas, execution needs work." The game is uptime, predictability, pricing stability, and not surprising your users with breaking changes. Google is trying to compete on that axis now. It's not competing on it by the same standards it's being judged on.


There's a version of this post where I just run through the receipts: the outages, the deprecated tools, the pricing confusion, the internal politics that keep surfacing in places like Theo's channel and on HN. But I think the more interesting observation is structural.

Google's actual technical resources are enormous. They have TPUs nobody else has. They have research talent. They have Gemini running at scale in products used by more people than almost any AI company. If you told me a model from Google would top the reasoning benchmarks, I'd believe you.

The harder problem is that developer ecosystems are built on compounding trust. AWS didn't win because it launched the best S3 in 2006. It won because developers who adopted it in 2006 didn't get surprised, repeatedly, for fifteen years. Stripe has worse documentation than half the competitors that tried to eat its lunch, but nobody worries that the API is going to change out from under them next quarter.

Google keeps relaunching. The products are often good at launch. The trust deficit is in what happens after launch.


Gemini 3.5 Flash probably is Google's strongest reasoning model. If you ask me whether I'd use it in a new project, I'd say yes with some hedging. But the hedging matters. It's the kind of "yes, but..." that means you build an abstraction layer first. You don't hardcode the integration. You keep one eye on the alternatives.

That's not the relationship you want developers to have with your infrastructure.

The benchmark game and the workflow game are different games. Google is still playing the benchmark game very well. The workflow game has a different scoreboard, and this week was not a good week on it.

APOLLO
ARGUS
FORGE
PULSE
GRAPH
HOSPITALITY
MANUFACTURING
REAL ESTATE
APOLLO
ARGUS
FORGE
PULSE
GRAPH
HOSPITALITY
MANUFACTURING
REAL ESTATE