9

The Grand Finale

Where LLMs are headed and everything you've learned in one pipeline

Watch this section

You've learned how LLMs go from raw internet text to helpful chat assistants — through pre-training, fine-tuning, and reinforcement learning. But the field is evolving at breakneck speed. Here's where things are heading.

👁️🎤🎬

Multimodal Models

Seeing, hearing, and creating

Modern LLMs don't just read text — they see images, understand audio, and generate pictures and video. Models like GPT-4o, Claude, and Gemini can analyze photos, read handwriting, interpret charts, and describe scenes. This works by converting images and audio into token-like representations that the model processes alongside text.

🔍💻🛠️

Tool Use & Agents

From answering to doing

Early LLMs could only generate text. Today's models can use tools: search the web, write and run code, call APIs, read files, and even operate your computer. This transforms them from "things that answer questions" into agents that take action. An agent can research a topic across multiple sources, write a report, and email it — all from a single instruction.

🧠💡

Reasoning Models

Thinking longer, answering better

Remember the DeepSeek-R1 discovery from the last chapter? That idea has become a major frontier. Reasoning models (like o1, o3, Claude with extended thinking, and DeepSeek-R1) spend more compute at inference time — "thinking" through a chain of steps before answering. For hard math, science, and coding problems, this dramatically improves accuracy. The trade-off: they're slower and more expensive, but they can solve problems that were previously out of reach.

🌐🔓

Open Source Revolution

Models for everyone

Models like LLaMA, Mistral, and DeepSeek are released openly — anyone can download, modify, and run them. This means AI isn't locked behind a few companies. Researchers, startups, and hobbyists can fine-tune models for their own needs, run them locally for privacy, or build entirely new applications on top. The gap between open and closed models is shrinking fast.

The big picture: LLMs started as text predictors, became conversational assistants, and are now evolving into multimodal agents that can reason, use tools, and take actions in the real world. The core mechanism — predict the next token — hasn't changed. What's changed is what counts as a token (images, audio, tool calls) and how much thinking happens before each prediction.

Final Exercise: Build the Complete Pipeline

One last test: can you put all 12 steps of the LLM pipeline in the right order from memory?

1Supervised fine-tuning (SFT)
2Evaluate base model on benchmarks
3Deploy chat model with safety filters
4Crawl the internet for text data
5RL optimization using reward model (RLHF)
6Collect human preference comparisons
7Train reward model on preferences
8Tokenize the training corpus
9Collect human conversation examples
10Pre-train transformer with next-token prediction
11Train BPE tokenizer on the data
12Filter and deduplicate data