Attention is just a weighted average

At its heart, the attention mechanism in a Transformer is a weighted average.

That sentence sounds reductive. It isn't. It's exact, and the details it omits are interesting for a reason other than what most explainers claim.

The formula, actually

Attention, in the standard formulation, is softmax(QK^T / √d) · V.

Unpack it:

V is a matrix. Each row is a value vector for one token in the input.
softmax(QK^T / √d) produces a weight vector (one per query position). Non-negative, sums to one.
Multiplying the weight vector by V gives a weighted average of the value vectors.

That's all. You had this in second-semester linear algebra: a weighted average of vectors, where the weights come from somewhere.

The somewhere is what's interesting.

Where the weights come from

The weights are dot products. For each query vector q and each key vector k, the unnormalized weight is q · k. High dot product means high alignment. High alignment means high weight.

This is geometry. Two vectors have a high dot product when they point in roughly the same direction. So attention is saying: "pay more attention to the values whose keys point in the same direction as my query."

That framing is not mystical. It's a similarity search, scored by angle, then used to weight an average.

The interesting part is not that the math is simple. The interesting part is what keys and queries get trained to represent.

What gets weighted, and why it matters

In a Transformer, the keys, queries, and values are learned projections of the input tokens. The training signal is next-token prediction. What that signal does to the K, Q, V matrices: it pushes them into a geometry where the dot product aligns keys with queries that predict the same context.

This is load-bearing. The attention mechanism doesn't decide what to attend to. It attends according to the geometry the rest of the training pressure built.

You can see this in practice. Attention heads in a trained model will reliably align on things like syntactic structure, semantic role, or positional offset. None of this is hand-specified. It is what the gradient converged on, given the freedom to learn any geometry that predicts well.

Why the simplicity matters

Two reasons.

It demystifies one of the most intimidating-looking formulas in modern ML. Once you see it as a weighted average with a particular weight-generating mechanism, it stops being magic.
It reframes where the intelligence lives. Not in the attention operation itself, which is linear algebra. In the projection matrices, which are where the model stores what it has learned about what to pay attention to.

The attention mechanism is a mixer. The projections are the content.

If you want to understand what a Transformer is doing, staring at the attention formula is the wrong end of the problem. The weights aren't where the model's knowledge lives. Look at what the keys and queries got trained to be.

All articles