Inference

Generating text with temperature and sampling

So we've trained a neural network on mountains of text — but how does it actually generate a response when you type a message? This step is called inference: the model takes your input and predicts what comes next, one token at a time.

Your input

"Tell me about..."

→

Inference

LLM

→

Output

"Here's what..."

The network outputs a probability for every possible next token. It could always pick the highest one (greedy decoding), but that produces repetitive, boring text. Instead, LLMs sample from the probability distribution — like rolling a weighted dice. A token with 30% probability gets picked 30% of the time, one with 5% still has a chance. This is why the same prompt can produce different answers each time.

❄️

T = 0

Predictable

🔥

T = 2

Creative

Temperature: controlling the randomness

But how much randomness do you want? That's what the temperature setting controls — a dial between 'boring but reliable' and 'creative but chaotic.' At temperature 0, the model skips sampling entirely and always picks the #1 word (greedy mode). At temperature 2, the probabilities get flattened so that even rare words have a fair shot.

Temperature Lab

Prompt: "The weather today is ___"

Temperature1.0

sunny

30.0%

cloudy

20.0%

warm

15.0%

cold

10.0%

rainy

8.0%

beautiful

7.0%

perfect

5.0%

terrible

3.0%

foggy

2.0%

🎯

Your turn — try it out!

Predictable

Set temperature to get the most predictable output (always picks the top token).

Your temperature: 1.0

Not quite. Very low temperature makes the distribution extremely peaked. (Target: 0-0.2)