So we've trained a neural network on mountains of text — but how does it actually generate a response when you type a message? This step is called inference: the model takes your input and predicts what comes next, one token at a time.
Your input
"Tell me about..."
Inference
🧠 LLM
Output
"Here's what..."
The network outputs a probability for every possible next token. It could always pick the highest one (greedy decoding), but that produces repetitive, boring text. Instead, LLMs sample from the probability distribution — like rolling a weighted dice. A token with 30% probability gets picked 30% of the time, one with 5% still has a chance. This is why the same prompt can produce different answers each time.
❄️
T = 0
Predictable
🔥
T = 2
Creative
Temperature: controlling the randomness
But how much randomness do you want? That's what the temperature setting controls — a dial between 'boring but reliable' and 'creative but chaotic.' At temperature 0, the model skips sampling entirely and always picks the #1 word (greedy mode). At temperature 2, the probabilities get flattened so that even rare words have a fair shot.
Temperature Lab
Prompt: "The weather today is ___"
Your turn — try it out!
Predictable
Set temperature to get the most predictable output (always picks the top token).
Not quite. Very low temperature makes the distribution extremely peaked. (Target: 0-0.2)