Reinforcement Learning
How RL and human feedback push LLMs beyond supervised learning
Watch this sectionTraining AI on human-written examples hits a ceiling — the model can only be as good as its teachers. Reinforcement Learning (RL) breaks through this limit with a simple but powerful idea: instead of showing the model the "right" answer, let it try many different answers and learn from the ones that work.
Think of learning to ride a bike. Nobody hands you a textbook on "The Correct Way to Balance." Instead, you get on, wobble, fall, adjust — and eventually discover what works through trial and error. That's reinforcement learning: learn by trying, not by copying.
🎲
Generate
100 different answers
✅
Verify
Which ones are correct?
🏆
Reward
Train on the best ones
The RL loop: generate many candidates, check which ones are good, and reinforce the successful strategies.
For this loop to work, you need one crucial ingredient: a way to check if the answer is right. A math problem has a correct answer you can verify. A piece of code either passes its tests or it doesn't. But "write a beautiful poem"? That's subjective — there's no automatic checker.
RL works best where answers can be automatically verified. The easier the verification, the tighter the feedback loop.
This is a major reason AI coding has improved so dramatically. Code is the perfect RL playground: you write a function, run the test suite, and know instantly whether it works. No human reviewer needed — the computer checks the answer for you, millions of times.
Example: RL for code generation
Prompt
"Write a function that reverses a linked list"
Model generates 100 different solutions
31 pass
all tests green
54 fail
wrong output
15 crash
runtime error
Train on the 31 that passed — the model learns what good code looks like
No human labeler needed. The test suite is the reward signal — fast, cheap, and objective.
The same principle applies to math (check the answer), logic puzzles (verify the solution), and even games (did you win?). Anywhere a computer can judge the result, RL can generate millions of training examples — far more than any human could write.
A surprising discovery: DeepSeek-R1 showed that through RL, models don't just get better answers — they learn to think harder on difficult problems. Without being explicitly taught, they start spending more reasoning steps on tough questions and breezing through easy ones. The model discovers chain-of-thought reasoning on its own.
But look at the red zone on the spectrum above. "Is this poem beautiful?" "Is this explanation clear to a 10-year-old?" "Is this advice helpful without being harmful?" There's no test suite for these questions. Yet humans intuitively know a good answer when they see one.
This is where RLHF — Reinforcement Learning from Human Feedback — comes in. Instead of an automated checker, you use people as the reward signal.
📋
Two Responses
A and B
👤
Human Judges
Pick the better one
🎯
Reward Model
Learns preferences
RLHF: show humans two AI responses, ask which is better, and train a reward model on thousands of these comparisons.
The key insight: it's much easier to judge which response is better than to write a perfect response yourself. You don't need to be a chef to know which of two dishes tastes better. Thousands of these comparisons train a "reward model" that learns to predict what humans prefer — and that model becomes the automated judge for RL. This is what teaches models to say "I'm not sure" instead of confidently making things up, and to decline dangerous requests while staying helpful for everything else.
Your turn — try it out!
RL Tournament (1/2)
Play the role of the automated verifier. The model generated 6 solutions — select the correct ones to train on:
Problem:
What is 17 x 23?
17 x 23 = 17 x 20 + 17 x 3 = 340 + 51 = 391
✓ 117 x 23 = 391
✓ 0.817 x 23 = 17 x 20 + 17 x 3 = 340 + 41 = 381
✗ 017 x 23 = 20 x 23 - 3 x 23 = 460 - 69 = 391
✓ 117 x 23 is approximately 400
✗ 017 x 23 = 17 + 17 + ... (23 times) = 401
✗ 0