mantus.ai

AI CONFIDENCE, SERVED FRESH DAILY

Google’s Scale Problem, Your Quality Problem

Google’s Scale Problem, Your Quality Problem

·3 min read

Demis Hassabis describing Gemini Flash is not really describing a small model. He is describing a plumbing problem.

Google has AI running through Search, AI Overviews, AI Mode, Maps, YouTube, the Gemini app, and more than a dozen products with user counts most companies never see. Every answer costs money. Every delay is visible. Every extra second is multiplied across products used by billions.

So when Hassabis says these systems have to be served “extremely fast, extremely efficiently and cheaply and with low latency,” the key word is not fast. The key word is served.

A model in Google’s hands is not just an intelligence engine. It is something that has to fit inside a global system without blocking the flow.

Distillation is the attempt to carry behavior from a larger, stronger model into a smaller one that is cheaper and faster to run. Hassabis describes that compression into Flash and Flash Lite models as one of Google’s strengths.

The interviewer frames the trade off as something like 95% of the capability at one tenth the price. Hassabis does not confirm those exact numbers, but he accepts the shape of the argument: some capability can be traded for lower cost, lower latency, and faster iteration.

For Google, that trade can make sense even when the faster model gives up some capability on certain tasks. If the system is good enough to serve across huge products, the gains compound. A slightly weaker model that can be put everywhere may create more product value than a stronger model that is too slow or too expensive to push through the pipes.

That is good engineering. It is not proof that the same model is right for your hardest workflow.

The problem with “95% as capable” is that it is an aggregate claim. It does not tell you where the missing capability appears. It may disappear in places you do not care about. It may show up in a corner of the task that matters a lot. The number alone does not answer that question.

A support tool is a simple example. A Flash model might summarize ordinary tickets well, route most requests correctly, and keep the queue moving. That is useful. But if your business depends on catching the unusual ticket where the customer describes the real failure in one vague sentence, the average score is not the thing to trust. The question is whether the model catches that case often enough.

This is not about speed causing hallucinations. That is too simple, and the Hassabis conversation does not establish it. The supported point is narrower: Google is balancing capability against cost, latency, and scale. Flash is shaped by that balance.

A builder has to ask a different question. Not “is this model smart enough in general?” Not “is it fast?” Not “did the demo look clean?” The question is where the model fails under the workload that matters.

That means testing the cases where being wrong costs something. Use the real edge cases. Use the messy inputs. Use the examples that caused refunds, support escalations, or angry Slack threads last quarter. Compare the smaller model against the stronger one on those cases, not on the easy traffic where everything looks fine.

If it passes there, use it. Fast and cheap are not minor virtues. But do not borrow Google’s optimization target without checking whether it matches yours.

The model that solves Google’s scale problem may not solve your quality problem.