The Grand Finale
Where LLMs are headed and everything you've learned in one pipeline
Watch this sectionYou've learned how LLMs go from raw internet text to helpful chat assistants — through pre-training, fine-tuning, and reinforcement learning. But the field is evolving at breakneck speed. Here's where things are heading.
Multimodal Models
Seeing, hearing, and creating
Modern LLMs don't just read text — they see images, understand audio, and generate pictures and video. Models like GPT-4o, Claude, and Gemini can analyze photos, read handwriting, interpret charts, and describe scenes. This works by converting images and audio into token-like representations that the model processes alongside text.
Tool Use & Agents
From answering to doing
Early LLMs could only generate text. Today's models can use tools: search the web, write and run code, call APIs, read files, and even operate your computer. This transforms them from "things that answer questions" into agents that take action. An agent can research a topic across multiple sources, write a report, and email it — all from a single instruction.
Reasoning Models
Thinking longer, answering better
Remember the DeepSeek-R1 discovery from the last chapter? That idea has become a major frontier. Reasoning models (like o1, o3, Claude with extended thinking, and DeepSeek-R1) spend more compute at inference time — "thinking" through a chain of steps before answering. For hard math, science, and coding problems, this dramatically improves accuracy. The trade-off: they're slower and more expensive, but they can solve problems that were previously out of reach.
Open Source Revolution
Models for everyone
Models like LLaMA, Mistral, and DeepSeek are released openly — anyone can download, modify, and run them. This means AI isn't locked behind a few companies. Researchers, startups, and hobbyists can fine-tune models for their own needs, run them locally for privacy, or build entirely new applications on top. The gap between open and closed models is shrinking fast.
The big picture: LLMs started as text predictors, became conversational assistants, and are now evolving into multimodal agents that can reason, use tools, and take actions in the real world. The core mechanism — predict the next token — hasn't changed. What's changed is what counts as a token (images, audio, tool calls) and how much thinking happens before each prediction.
Final Exercise: Build the Complete Pipeline
One last test: can you put all 12 steps of the LLM pipeline in the right order from memory?