What is training data? (Explained for kids and parents)

Updated May 8, 2026 · 360 words

Training data is the collection of examples you show a computer to teach it something. To teach a computer what a cat looks like, training data is photos of cats — usually thousands of them, each labeled "cat." The bigger and cleaner the training data, the smarter the AI gets.

How to explain it to a 7-year-old

🧒 "It's the homework you give the computer. You show it lots of examples — like 'this is a cat, this is a dog, this is a fish' — until it learns the difference."

How to explain it to a 14-year-old

🎒 "Training data is the dataset used to teach a model. Each example has an input (a photo) and a label (what's in it). The model guesses, gets corrected, adjusts. Quality and diversity of training data is the single biggest factor in how good the model becomes."

A real-world example

When you tag your friend's face in a photo on Google Photos, you're contributing to training data. That tag tells the system "this face = this name," and it learns to recognize that person in other photos.

Why training data quality matters

🟢 Good training data is diverse, balanced, and labeled correctly
🔴 Bad training data is biased, missing examples, or has wrong labels — and produces AI that's biased or wrong
⚠️ Famous example: early face-recognition tools were trained mostly on white faces and worked poorly for people with darker skin. The fix was better training data.

Where this comes up in Chippu

Band A explores it through play (a2-1, "How Does AI Learn"). Band C goes into how to spot bias in training data (c3-1).

Related terms

Machine learning — the process that uses training data
Neural network — what training data trains
Bias — what happens when training data isn't diverse enough

Frequently asked questions

Why does training data matter so much?

It's the only thing the AI ever sees. If training data is biased, missing categories, or poorly labeled, the AI inherits those flaws. 'Good model, bad data' produces bad AI; the data is more important than the algorithm.

Who makes the training data?

Different sources: scraped from the public web, manually labeled by humans, generated synthetically, or contributed by users (every photo you tag is training data).