The Little Book of llm.c

Version 0.1.1

Author

Duc-Tam Nguyen

Published

September 24, 2025

Content

Chapter 1 — Orientation

What llm.c Is (scope, goals, philosophy)
Repository Tour (folders, files, structure)
Makefile Targets & Flags (CPU, CUDA, options)
Quickstart: CPU Reference Path (train_gpt2.c)
Quickstart: 1-GPU Legacy Path (train_gpt2_fp32.cu)
Quickstart: Modern CUDA Path (train_gpt2.cu)
Starter Artifacts & Data Prep (dev/download_starter_pack.sh, dev/data/)
Debugging Tips & IDE Stepping (-g, gdb, lldb, IDEs)
Project Constraints & Readability Contract
Community, Discussions, and Learning Path

Chapter 2 — Data, Tokenization, and Loaders

GPT-2 Tokenizer Artifacts (gpt2_tokenizer.bin)
Binary Dataset Format (.bin with header + tokens)
Dataset Scripts in dev/data/ (Tiny Shakespeare, OpenWebText)
DataLoader Design (batching, strides, epochs)
EvalLoader and Validation Workflow
Sequence Length and Memory Budgeting
Reproducibility and Seeding Across Runs
Error Surfaces from Bad Data (bounds, asserts)
Tokenization Edge Cases (UNKs, EOS, BOS)
Data Hygiene and Logging

Chapter 3 — Model Definition & Weights

GPT-2 Config: vocab, layers, heads, channels
Parameter Tensors and Memory Layout
Embedding Tables: token + positional
Attention Stack: QKV projections and geometry
MLP Block: linear layers + activation
LayerNorm: theory and implementation (doc/layernorm)
Residual Streams: skip connections explained
Loss Head: tied embeddings and logits
Checkpoint Loading from PyTorch
Parameter Counting and Sanity Checks

Chapter 4 — CPU Inference (Forward only)

Forward Pass Walkthrough
Token and Positional Embedding Lookup
Attention: matmuls, masking, softmax on CPU
MLP: GEMMs and activation functions
LayerNorm on CPU (step-by-step)
Residual Adds and Signal Flow
Cross-Entropy Loss on CPU
Putting It All Together: The gpt2_forward
OpenMP Pragmas for Parallel Loops
CPU Memory Footprint and Performance

Chapter 5 — Training Loop (CPU Path)

Skeleton of Training Loop
AdamW Implementation in C
Learning Rate Schedulers (cosine, warmup)
Gradient Accumulation and Micro-Batching
Logging and Progress Reporting
Validation Runs in Training Loop
Checkpointing Parameters and Optimizer State
Reproducibility and Small Divergences
Command-Line Flags and Defaults
Example CPU Training Logs and Outputs

Chapter 6 — Testing, Profiling, & Parity

Debug State Structs and Their Role
test_gpt2.c: CPU vs PyTorch
test_gpt2cu.cu: CUDA vs PyTorch
Matching Outputs Within Tolerances
Profiling with profile_gpt2.cu
Measuring FLOPs and GPU Utilization
Reproducing Known Loss Curves
Common CUDA Pitfalls (toolchain, PTX)
cuDNN FlashAttention Testing (USE_CUDNN)
From Unit Test to Full Training Readiness

Chapter 7 — CUDA Training Internals (`train_gpt2.cu`)

CUDA Architecture Overview (streams, kernels)
Matrix Multiplication via cuBLAS/cuBLASLt
Attention Kernels: cuDNN FlashAttention
Mixed Precision: FP16/BF16 with Master FP32 Weights
Loss Scaling in Mixed Precision Training
Activation Checkpointing and Memory Tradeoffs
GPU Memory Planning: params, grads, states
Kernel Launch Configurations and Occupancy
CUDA Error Handling and Debugging
dev/cuda/: From Simple Kernels to High Performance

Chapter 8 — Multi-GPU & Multi-Node Training

Data Parallelism in llm.c
MPI Process Model and GPU Affinity
NCCL All-Reduce for Gradient Sync
Building and Running Multi-GPU Trainers
Multi-Node Bootstrapping with MPI
SLURM and PMIx Caveats
Debugging Multi-GPU Hangs and Stalls
Scaling Stories: GPT-2 124M → 774M → 1.6B
NCCL Tuning and Overlap Opportunities
Common Multi-GPU Errors and Fixes

Chapter 9 — Extending the Codebase

The dev/cuda Library for Custom Kernels
Adding New Dataset Pipelines (dev/data/*)
Adding a New Optimizer to the Codebase
Adding a New Scheduler (cosine, step, etc.)
Alternative Attention Mechanisms
Profiling and Testing New Kernels
Using PyTorch Reference as Oracle
Exploring Beyond GPT-2: LLaMA Example
Porting Playbook: C → Go/Rust/Metal
Keeping the Repo Minimal and Clean

Chapter 10 — Reproductions, Community, and Roadmap

Reproducing GPT-2 124M on Single Node
Reproducing GPT-2 355M (constraints and tricks)
Reproducing GPT-2 774M (scaling up)
Reproducing GPT-2 1.6B on 8×H100 (24h run)
CPU-only Fine-Tune Demo (Tiny Shakespeare)
Cost and Time Estimation for Runs
Hyperparameter Sweeps (sweep.sh)
Validating Evaluation and Loss Curves
Future Work: Kernel Library, Less cuDNN Dependence
Community, GitHub Discussions, and Suggested Learning Path