4

Inference

Generating text with temperature and sampling

Watch this section

So we've trained a neural network on mountains of text — but how does it actually generate a response when you type a message? This step is called inference: the model takes your input and predicts what comes next, one token at a time.

Your input

"Tell me about..."

Inference

🧠 LLM

Output

"Here's what..."

The network outputs a probability for every possible next token. It could always pick the highest one (greedy decoding), but that produces repetitive, boring text. Instead, LLMs sample from the probability distribution — like rolling a weighted dice. A token with 30% probability gets picked 30% of the time, one with 5% still has a chance. This is why the same prompt can produce different answers each time.

❄️

T = 0

Predictable

🔥

T = 2

Creative

Temperature: controlling the randomness

But how much randomness do you want? That's what the temperature setting controls — a dial between 'boring but reliable' and 'creative but chaotic.' At temperature 0, the model skips sampling entirely and always picks the #1 word (greedy mode). At temperature 2, the probabilities get flattened so that even rare words have a fair shot.

Temperature Lab

Prompt: "The weather today is ___"

Temperature1.0
sunny
30.0%
cloudy
20.0%
warm
15.0%
cold
10.0%
rainy
8.0%
beautiful
7.0%
perfect
5.0%
terrible
3.0%
foggy
2.0%
🎯

Your turn — try it out!

1
2
3

Predictable

Set temperature to get the most predictable output (always picks the top token).

Your temperature:1.0