The Little Book of llm.c
Version 0.1.1
Content
Chapter 1 — Orientation
- What llm.c Is (scope, goals, philosophy)
- Repository Tour (folders, files, structure)
- Makefile Targets & Flags (CPU, CUDA, options)
- Quickstart: CPU Reference Path (
train_gpt2.c
) - Quickstart: 1-GPU Legacy Path (
train_gpt2_fp32.cu
) - Quickstart: Modern CUDA Path (
train_gpt2.cu
) - Starter Artifacts & Data Prep (
dev/download_starter_pack.sh
,dev/data/
) - Debugging Tips & IDE Stepping (
-g
, gdb, lldb, IDEs) - Project Constraints & Readability Contract
- Community, Discussions, and Learning Path
Chapter 2 — Data, Tokenization, and Loaders
- GPT-2 Tokenizer Artifacts (
gpt2_tokenizer.bin
) - Binary Dataset Format (
.bin
with header + tokens) - Dataset Scripts in
dev/data/
(Tiny Shakespeare, OpenWebText) - DataLoader Design (batching, strides, epochs)
- EvalLoader and Validation Workflow
- Sequence Length and Memory Budgeting
- Reproducibility and Seeding Across Runs
- Error Surfaces from Bad Data (bounds, asserts)
- Tokenization Edge Cases (UNKs, EOS, BOS)
- Data Hygiene and Logging
Chapter 3 — Model Definition & Weights
- GPT-2 Config: vocab, layers, heads, channels
- Parameter Tensors and Memory Layout
- Embedding Tables: token + positional
- Attention Stack: QKV projections and geometry
- MLP Block: linear layers + activation
- LayerNorm: theory and implementation (
doc/layernorm
) - Residual Streams: skip connections explained
- Loss Head: tied embeddings and logits
- Checkpoint Loading from PyTorch
- Parameter Counting and Sanity Checks
Chapter 4 — CPU Inference (Forward only)
- Forward Pass Walkthrough
- Token and Positional Embedding Lookup
- Attention: matmuls, masking, softmax on CPU
- MLP: GEMMs and activation functions
- LayerNorm on CPU (step-by-step)
- Residual Adds and Signal Flow
- Cross-Entropy Loss on CPU
- Putting It All Together: The
gpt2_forward
- OpenMP Pragmas for Parallel Loops
- CPU Memory Footprint and Performance
Chapter 5 — Training Loop (CPU Path)
- Skeleton of Training Loop
- AdamW Implementation in C
- Learning Rate Schedulers (cosine, warmup)
- Gradient Accumulation and Micro-Batching
- Logging and Progress Reporting
- Validation Runs in Training Loop
- Checkpointing Parameters and Optimizer State
- Reproducibility and Small Divergences
- Command-Line Flags and Defaults
- Example CPU Training Logs and Outputs
Chapter 6 — Testing, Profiling, & Parity
- Debug State Structs and Their Role
test_gpt2.c
: CPU vs PyTorchtest_gpt2cu.cu
: CUDA vs PyTorch- Matching Outputs Within Tolerances
- Profiling with
profile_gpt2.cu
- Measuring FLOPs and GPU Utilization
- Reproducing Known Loss Curves
- Common CUDA Pitfalls (toolchain, PTX)
- cuDNN FlashAttention Testing (
USE_CUDNN
) - From Unit Test to Full Training Readiness
Chapter 7 — CUDA Training Internals (train_gpt2.cu
)
- CUDA Architecture Overview (streams, kernels)
- Matrix Multiplication via cuBLAS/cuBLASLt
- Attention Kernels: cuDNN FlashAttention
- Mixed Precision: FP16/BF16 with Master FP32 Weights
- Loss Scaling in Mixed Precision Training
- Activation Checkpointing and Memory Tradeoffs
- GPU Memory Planning: params, grads, states
- Kernel Launch Configurations and Occupancy
- CUDA Error Handling and Debugging
dev/cuda/
: From Simple Kernels to High Performance
Chapter 8 — Multi-GPU & Multi-Node Training
- Data Parallelism in llm.c
- MPI Process Model and GPU Affinity
- NCCL All-Reduce for Gradient Sync
- Building and Running Multi-GPU Trainers
- Multi-Node Bootstrapping with MPI
- SLURM and PMIx Caveats
- Debugging Multi-GPU Hangs and Stalls
- Scaling Stories: GPT-2 124M → 774M → 1.6B
- NCCL Tuning and Overlap Opportunities
- Common Multi-GPU Errors and Fixes
Chapter 9 — Extending the Codebase
- The
dev/cuda
Library for Custom Kernels - Adding New Dataset Pipelines (
dev/data/*
) - Adding a New Optimizer to the Codebase
- Adding a New Scheduler (cosine, step, etc.)
- Alternative Attention Mechanisms
- Profiling and Testing New Kernels
- Using PyTorch Reference as Oracle
- Exploring Beyond GPT-2: LLaMA Example
- Porting Playbook: C → Go/Rust/Metal
- Keeping the Repo Minimal and Clean
Chapter 10 — Reproductions, Community, and Roadmap
- Reproducing GPT-2 124M on Single Node
- Reproducing GPT-2 355M (constraints and tricks)
- Reproducing GPT-2 774M (scaling up)
- Reproducing GPT-2 1.6B on 8×H100 (24h run)
- CPU-only Fine-Tune Demo (Tiny Shakespeare)
- Cost and Time Estimation for Runs
- Hyperparameter Sweeps (
sweep.sh
) - Validating Evaluation and Loss Curves
- Future Work: Kernel Library, Less cuDNN Dependence
- Community, GitHub Discussions, and Suggested Learning Path