1

Training Data

How internet data is collected and curated for training

Watch this section

To build an AI that understands language, you need an almost unimaginable amount of text. We're not talking about a library — we're talking about a significant chunk of the entire internet.

How much data?

📚HuggingFace's FineWeb dataset~44 TB of text
📖That's roughly equivalent to~10 million books
⏱️Reading 24/7 at 250 words/min would take~100,000 years

FineWeb is one of the largest public training datasets — and it's still just one part of what goes into a model.

But you can't just dump raw web pages into a model. The internet is messy — spam, duplicates, broken HTML, toxic content. Data preparation is where most of the work happens.

🌐

Crawl

Raw HTML pages

🧹

Extract

Strip HTML, ads, nav

🔍

Deduplicate

Same article 1000x? Keep 1

Filter

Quality & safety

Most raw data gets thrown away. FineWeb kept only ~15% of the pages Common Crawl collected.

What makes good training data?

Keep:

  • Accurate, factual content
  • Well-written text
  • Educational or informative
  • Multiple languages

Discard:

  • SEO spam / clickbait
  • Toxic or hateful content
  • Garbled HTML / code artifacts
  • Duplicate boilerplate
🎯

Your turn — try it out!

1
2

Review these text samples and decide which to keep for training and which to discard.

Photosynthesis is the process by which plants convert light energy into chemical energy. This process occurs primarily in the chloroplasts of plant cells, using chlorophyll to absorb sunlight.

BUY NOW!!! Best deals on cheap electronics!!! Click here for FREE iPhone!!! Limited time offer!!! You won't believe these prices!!! Act now before it's too late!!!

Die Quantenmechanik beschreibt das Verhalten von Teilchen auf atomarer und subatomarer Ebene. Sie wurde im fruhen 20. Jahrhundert entwickelt und revolutionierte unser Verstandnis der Physik.

asdf jkl; asdf jkl; the the the the the the the the the. Error 404. Page not found. Cookie consent banner. Subscribe to our newsletter. Loading... Please wait...