The Little Book of llm.c

Version 0.1.1

Author

Duc-Tam Nguyen

Published

September 24, 2025

Content

Chapter 1 — Orientation

  1. What llm.c Is (scope, goals, philosophy)
  2. Repository Tour (folders, files, structure)
  3. Makefile Targets & Flags (CPU, CUDA, options)
  4. Quickstart: CPU Reference Path (train_gpt2.c)
  5. Quickstart: 1-GPU Legacy Path (train_gpt2_fp32.cu)
  6. Quickstart: Modern CUDA Path (train_gpt2.cu)
  7. Starter Artifacts & Data Prep (dev/download_starter_pack.sh, dev/data/)
  8. Debugging Tips & IDE Stepping (-g, gdb, lldb, IDEs)
  9. Project Constraints & Readability Contract
  10. Community, Discussions, and Learning Path

Chapter 2 — Data, Tokenization, and Loaders

  1. GPT-2 Tokenizer Artifacts (gpt2_tokenizer.bin)
  2. Binary Dataset Format (.bin with header + tokens)
  3. Dataset Scripts in dev/data/ (Tiny Shakespeare, OpenWebText)
  4. DataLoader Design (batching, strides, epochs)
  5. EvalLoader and Validation Workflow
  6. Sequence Length and Memory Budgeting
  7. Reproducibility and Seeding Across Runs
  8. Error Surfaces from Bad Data (bounds, asserts)
  9. Tokenization Edge Cases (UNKs, EOS, BOS)
  10. Data Hygiene and Logging

Chapter 3 — Model Definition & Weights

  1. GPT-2 Config: vocab, layers, heads, channels
  2. Parameter Tensors and Memory Layout
  3. Embedding Tables: token + positional
  4. Attention Stack: QKV projections and geometry
  5. MLP Block: linear layers + activation
  6. LayerNorm: theory and implementation (doc/layernorm)
  7. Residual Streams: skip connections explained
  8. Loss Head: tied embeddings and logits
  9. Checkpoint Loading from PyTorch
  10. Parameter Counting and Sanity Checks

Chapter 4 — CPU Inference (Forward only)

  1. Forward Pass Walkthrough
  2. Token and Positional Embedding Lookup
  3. Attention: matmuls, masking, softmax on CPU
  4. MLP: GEMMs and activation functions
  5. LayerNorm on CPU (step-by-step)
  6. Residual Adds and Signal Flow
  7. Cross-Entropy Loss on CPU
  8. Putting It All Together: The gpt2_forward
  9. OpenMP Pragmas for Parallel Loops
  10. CPU Memory Footprint and Performance

Chapter 5 — Training Loop (CPU Path)

  1. Skeleton of Training Loop
  2. AdamW Implementation in C
  3. Learning Rate Schedulers (cosine, warmup)
  4. Gradient Accumulation and Micro-Batching
  5. Logging and Progress Reporting
  6. Validation Runs in Training Loop
  7. Checkpointing Parameters and Optimizer State
  8. Reproducibility and Small Divergences
  9. Command-Line Flags and Defaults
  10. Example CPU Training Logs and Outputs

Chapter 6 — Testing, Profiling, & Parity

  1. Debug State Structs and Their Role
  2. test_gpt2.c: CPU vs PyTorch
  3. test_gpt2cu.cu: CUDA vs PyTorch
  4. Matching Outputs Within Tolerances
  5. Profiling with profile_gpt2.cu
  6. Measuring FLOPs and GPU Utilization
  7. Reproducing Known Loss Curves
  8. Common CUDA Pitfalls (toolchain, PTX)
  9. cuDNN FlashAttention Testing (USE_CUDNN)
  10. From Unit Test to Full Training Readiness

Chapter 7 — CUDA Training Internals (train_gpt2.cu)

  1. CUDA Architecture Overview (streams, kernels)
  2. Matrix Multiplication via cuBLAS/cuBLASLt
  3. Attention Kernels: cuDNN FlashAttention
  4. Mixed Precision: FP16/BF16 with Master FP32 Weights
  5. Loss Scaling in Mixed Precision Training
  6. Activation Checkpointing and Memory Tradeoffs
  7. GPU Memory Planning: params, grads, states
  8. Kernel Launch Configurations and Occupancy
  9. CUDA Error Handling and Debugging
  10. dev/cuda/: From Simple Kernels to High Performance

Chapter 8 — Multi-GPU & Multi-Node Training

  1. Data Parallelism in llm.c
  2. MPI Process Model and GPU Affinity
  3. NCCL All-Reduce for Gradient Sync
  4. Building and Running Multi-GPU Trainers
  5. Multi-Node Bootstrapping with MPI
  6. SLURM and PMIx Caveats
  7. Debugging Multi-GPU Hangs and Stalls
  8. Scaling Stories: GPT-2 124M → 774M → 1.6B
  9. NCCL Tuning and Overlap Opportunities
  10. Common Multi-GPU Errors and Fixes

Chapter 9 — Extending the Codebase

  1. The dev/cuda Library for Custom Kernels
  2. Adding New Dataset Pipelines (dev/data/*)
  3. Adding a New Optimizer to the Codebase
  4. Adding a New Scheduler (cosine, step, etc.)
  5. Alternative Attention Mechanisms
  6. Profiling and Testing New Kernels
  7. Using PyTorch Reference as Oracle
  8. Exploring Beyond GPT-2: LLaMA Example
  9. Porting Playbook: C → Go/Rust/Metal
  10. Keeping the Repo Minimal and Clean

Chapter 10 — Reproductions, Community, and Roadmap

  1. Reproducing GPT-2 124M on Single Node
  2. Reproducing GPT-2 355M (constraints and tricks)
  3. Reproducing GPT-2 774M (scaling up)
  4. Reproducing GPT-2 1.6B on 8×H100 (24h run)
  5. CPU-only Fine-Tune Demo (Tiny Shakespeare)
  6. Cost and Time Estimation for Runs
  7. Hyperparameter Sweeps (sweep.sh)
  8. Validating Evaluation and Loss Curves
  9. Future Work: Kernel Library, Less cuDNN Dependence
  10. Community, GitHub Discussions, and Suggested Learning Path