8

Reinforcement Learning

How RL and human feedback push LLMs beyond supervised learning

Watch this section

Training AI on human-written examples hits a ceiling — the model can only be as good as its teachers. Reinforcement Learning (RL) breaks through this limit with a simple but powerful idea: instead of showing the model the "right" answer, let it try many different answers and learn from the ones that work.

Think of learning to ride a bike. Nobody hands you a textbook on "The Correct Way to Balance." Instead, you get on, wobble, fall, adjust — and eventually discover what works through trial and error. That's reinforcement learning: learn by trying, not by copying.

🎲

Generate

100 different answers

Verify

Which ones are correct?

🏆

Reward

Train on the best ones

The RL loop: generate many candidates, check which ones are good, and reinforce the successful strategies.

For this loop to work, you need one crucial ingredient: a way to check if the answer is right. A math problem has a correct answer you can verify. A piece of code either passes its tests or it doesn't. But "write a beautiful poem"? That's subjective — there's no automatic checker.

🔢Math
Check the answer
Easy to verify
💻Code
Run the tests
Easy to verify
🧩Logic Puzzles
Verify the solution
Easy to verify
🌍Translation
Compare to reference
Moderate
📝Summarization
Check key facts
Moderate
✏️Creative Writing
Subjective taste
Hard to verify
💬Life Advice
No clear right answer
Hard to verify

RL works best where answers can be automatically verified. The easier the verification, the tighter the feedback loop.

This is a major reason AI coding has improved so dramatically. Code is the perfect RL playground: you write a function, run the test suite, and know instantly whether it works. No human reviewer needed — the computer checks the answer for you, millions of times.

Example: RL for code generation

Prompt

"Write a function that reverses a linked list"

Model generates 100 different solutions

31 pass

all tests green

54 fail

wrong output

15 crash

runtime error

Train on the 31 that passed — the model learns what good code looks like

No human labeler needed. The test suite is the reward signal — fast, cheap, and objective.

The same principle applies to math (check the answer), logic puzzles (verify the solution), and even games (did you win?). Anywhere a computer can judge the result, RL can generate millions of training examples — far more than any human could write.

A surprising discovery: DeepSeek-R1 showed that through RL, models don't just get better answers — they learn to think harder on difficult problems. Without being explicitly taught, they start spending more reasoning steps on tough questions and breezing through easy ones. The model discovers chain-of-thought reasoning on its own.

But look at the red zone on the spectrum above. "Is this poem beautiful?" "Is this explanation clear to a 10-year-old?" "Is this advice helpful without being harmful?" There's no test suite for these questions. Yet humans intuitively know a good answer when they see one.

This is where RLHF — Reinforcement Learning from Human Feedback — comes in. Instead of an automated checker, you use people as the reward signal.

📋

Two Responses

A and B

👤

Human Judges

Pick the better one

🎯

Reward Model

Learns preferences

RLHF: show humans two AI responses, ask which is better, and train a reward model on thousands of these comparisons.

The key insight: it's much easier to judge which response is better than to write a perfect response yourself. You don't need to be a chef to know which of two dishes tastes better. Thousands of these comparisons train a "reward model" that learns to predict what humans prefer — and that model becomes the automated judge for RL. This is what teaches models to say "I'm not sure" instead of confidently making things up, and to decline dangerous requests while staying helpful for everything else.

🎯

Your turn — try it out!

1
2

RL Tournament (1/2)

Play the role of the automated verifier. The model generated 6 solutions — select the correct ones to train on:

Problem:

What is 17 x 23?

17 x 23 = 17 x 20 + 17 x 3 = 340 + 51 = 391

17 x 23 = 391

17 x 23 = 17 x 20 + 17 x 3 = 340 + 41 = 381

17 x 23 = 20 x 23 - 3 x 23 = 460 - 69 = 391

17 x 23 is approximately 400

17 x 23 = 17 + 17 + ... (23 times) = 401