Softmax is usually presented as a formula. It is better understood as a dial.
The formula is simple: for a vector x, softmax returns exp(x_i) / sum(exp(x_j)). The output is a probability distribution. The scores got turned into things that look like confidences.
The part most explanations skip: there is a temperature hidden in there, and it controls everything about the output's shape.
The hidden parameter
The full formula is softmax(x_i / T) where T is the temperature. When people write softmax without a temperature, they are writing softmax with T = 1. It is still there, just implicit.
Vary T and the output shape changes:
- T = 1. Default softmax. Scores become probabilities with moderate sharpness.
- T → 0. The largest score takes essentially all the probability mass. In the limit, softmax becomes argmax.
- T → ∞. All probabilities become roughly equal. In the limit, softmax becomes uniform.
This is the dial. Low temperature says "I am certain, the top answer is the answer." High temperature says "I am uncertain, the answers are close to equally likely."
Where this matters
Most places softmax shows up in modern ML have a temperature involved, even if it is not named.
- Language model sampling. Generating tokens at
T = 0.8gives coherent text with some variation. AtT = 1.5, text becomes creative and sometimes incoherent. AtT = 0, you get the greedy-decoded output every time. The temperature is literally exposing the dial to the user. - Attention. The
/√dscaling insoftmax(QK^T / √d)is a temperature adjustment. Without it, dot products in high dimensions become very large, softmax collapses to effectively argmax, and training breaks. The√dkeeps the dial in a usable range. - Knowledge distillation. Teacher models produce softened probabilities at high temperature. Students learn from the shape of the distribution, not just the top answer. The extra information in the "almost right" answers is useful training signal.
- Calibration. A model is calibrated when, over the outputs it gives 80% confidence, 80% are actually right. Temperature scaling is a standard trick to recalibrate an uncalibrated model. One scalar, applied to the pre-softmax logits, can fix systematic over- or under-confidence.
Same mechanism. Different applications.
What this changes about how you read softmax
When you see softmax(x) in code or a paper, ask two questions.
- What is the temperature. If it is not written, it is 1, and that is a choice.
- What does the choice say about the writer's assumption.
T = 1says "scores and probabilities are in similar ranges."T = 0.1says "top answer dominates."T = 10says "I do not trust these scores, average them out."
This reading reveals something not obvious from the formula: softmax is not just a probability-making function. It is an opinion about how confident the model should be, imposed on top of the raw scores.
The intuition to carry
Raw scores have arbitrary scale. Temperature is how you translate that scale into confidence.
Write softmax with an explicit temperature when you can. softmax(x / T) where T is a hyperparameter. Then you have to think about what value makes sense for your problem, instead of getting T = 1 by default and treating it as if it were neutral.
Default T = 1 is neutral the way a default font size of 16px is neutral. Reasonable, but a choice.