The case for small models

Most production AI does not need a frontier model.

That statement is controversial in 2025, but it is mostly just true. The default reflex to reach for GPT-5 or Claude 4 for every task is an industry-wide engineering error, driven partly by hype and partly by the convenience of API-first development. For a large fraction of the problems people are solving with AI right now, a small model does the job faster, cheaper, more predictably, and on the user's hardware.

This is not a prediction. It is already how shipped software should be built.

What I mean by small

1B to 20B parameters, either open-weight and self-hosted or a specialized API offering. Something you can run on a consumer GPU or a modest cloud instance. Not "small" in a research sense. Small relative to the frontier.

At this size, a lot of real work gets done:

Classification. Is this review positive. Does this message need escalation. Is this image a receipt.
Structured extraction. Pull the address, date, and total from this invoice.
Embedding generation. You don't want a 200B-parameter model for this. Dedicated embedding models are small for a reason.
Routing and intent detection. First-pass decisions about which downstream system to call.
Simple summarization. Not "write me an essay in my voice." Just "three bullets of this email."
Translation between structured formats. Convert this JSON to that schema, deterministically.
Most agentic sub-steps. Individual tool calls inside an agent loop are usually simple. The agent only needs a frontier model for the hard parts, if it needs one at all.

None of these tasks benefit from a 200B-parameter generalist. A 7B specialist wins.

Why defaulting to frontier is a mistake

Four structural reasons, in order of impact.

Latency. A local 7B model returns in 100 ms. A frontier API round-trip is 1 to 4 seconds. That is one to two orders of magnitude. For real-time UX, 100 ms is the difference between usable and unusable.
Cost. At production volume, frontier API costs compound. A small self-hosted model has a fixed GPU bill and handles millions of requests for roughly what ten thousand frontier calls cost.
Determinism and testability. A version of an open-weight model is a file. It doesn't change when the provider updates it. You can write regression tests that stay meaningful across months. Frontier APIs cannot promise this, and mostly don't try.
Privacy and offline. A local model never sends user data anywhere. For anything involving medical, legal, or personal content, that alone forces the choice.

None of these are properties of capability. They are properties of where and how the model runs. A better frontier model does not fix any of them.

When frontier genuinely wins

I'm not claiming small models do everything. They don't.

Use a frontier model when the task requires:

Genuinely hard reasoning. Multi-step planning across long contexts, novel problem solving, code generation above boilerplate.
Breadth. Open-ended conversation across arbitrary domains you don't know in advance.
Frontier-only capabilities. Very long contexts (1M tokens+), reliable tool use, some multimodal reasoning. Some of this will migrate down over time. Right now it is frontier-gated.
Exploration. When you don't yet know what the task needs. Frontier first, specialize later.

The pattern: frontier for prototypes, frontier for the hard 10%, small models for the 90% that is actually boring.

The position

Most AI systems being built right now are over-spec'd. They use frontier models for tasks a 7B-parameter model would handle at a tenth the latency and a hundredth the cost.

The correct engineering move: start with a frontier model to prove the task is solvable, then measure what a smaller model can do. In most cases the smaller one is sufficient. In most of the cases where it isn't, a small fine-tuned model is.

Building on frontier APIs is easier. Building on small models is better. Both are true, and the gap between those two sentences is where the industry sits right now.

All articles