The Grand Finale

Where LLMs are headed and everything you've learned in one pipeline

You've learned how LLMs go from raw internet text to helpful chat assistants — through pre-training, fine-tuning, and reinforcement learning. But the field is evolving at breakneck speed. Here's where things are heading.

👁️🎤🎬

Multimodal Models

Seeing, hearing, and creating

Modern LLMs don't just read text — they see images, understand audio, and generate pictures and video. Models like GPT-4o, Claude, and Gemini can analyze photos, read handwriting, interpret charts, and describe scenes. This works by converting images and audio into token-like representations that the model processes alongside text.

🔍💻🛠️

Tool Use & Agents

From answering to doing

Early LLMs could only generate text. Today's models can use tools: search the web, write and run code, call APIs, read files, and even operate your computer. This transforms them from "things that answer questions" into agents that take action. An agent can research a topic across multiple sources, write a report, and email it — all from a single instruction.

🧠⏳💡

Reasoning Models

Thinking longer, answering better

Remember the DeepSeek-R1 discovery from the last chapter? That idea has become a major frontier. Reasoning models (like o1, o3, Claude with extended thinking, and DeepSeek-R1) spend more compute at inference time — "thinking" through a chain of steps before answering. For hard math, science, and coding problems, this dramatically improves accuracy. The trade-off: they're slower and more expensive, but they can solve problems that were previously out of reach.

🌐🔓

Open Source Revolution

Models for everyone

Models like LLaMA, Mistral, and DeepSeek are released openly — anyone can download, modify, and run them. This means AI isn't locked behind a few companies. Researchers, startups, and hobbyists can fine-tune models for their own needs, run them locally for privacy, or build entirely new applications on top. The gap between open and closed models is shrinking fast.

The big picture: LLMs started as text predictors, became conversational assistants, and are now evolving into multimodal agents that can reason, use tools, and take actions in the real world. The core mechanism — predict the next token — hasn't changed. What's changed is what counts as a token (images, audio, tool calls) and how much thinking happens before each prediction.

Final Exercise: Build the Complete Pipeline

One last test: can you put all 12 steps of the LLM pipeline in the right order from memory?

1Filter and deduplicate datashould be #2

2Evaluate base model on benchmarksshould be #6

3Tokenize the training corpusshould be #4

4RL optimization using reward model (RLHF)should be #11

5Supervised fine-tuning (SFT)should be #8

6Collect human preference comparisonsshould be #9

7Collect human conversation examplesshould be #7

8Train reward model on preferencesshould be #10

9Deploy chat model with safety filtersshould be #12

10Train BPE tokenizer on the datashould be #3

11Crawl the internet for text datashould be #1

12Pre-train transformer with next-token predictionshould be #5