Chapter 9. Systems, Databases and Distributed Algorithms
Section 81. Concurrency Control
801 Two-Phase Locking (2PL)
Two-Phase Locking (2PL) is the cornerstone of concurrency control in databases. It ensures that multiple transactions can run safely together without violating consistency. The core idea is simple: a transaction first acquires all the locks it needs, then releases them only after it’s done. Once it starts unlocking, it can’t lock anything new again, that’s what gives it two distinct “phases.”
What Problem Are We Solving?
In a database, transactions often run in parallel. Without coordination, they can interfere with each other:
One transaction reads stale data
Another overwrites uncommitted changes
Or two transactions deadlock trying to grab each other’s locks
We need a way to serialize concurrent transactions, ensuring results are the same as if they had run one after another.
That’s where 2PL comes in. It guarantees conflict-serializable schedules, meaning no race conditions or interleaving chaos.
How Does It Work (Plain Language)?
Imagine a transaction as a careful shopper:
Growing Phase – Grab all the items (locks) you’ll need.
Shrinking Phase – Once you start putting items back (releasing locks), you can’t grab any more.
This two-phase rule ensures order, no transaction can “sneak in” between lock changes to break serializability.
Let’s walk through an example:
Step
Transaction A
Transaction B
Locks Held
Notes
1
Lock(X)
–
A:X
A starts growing
2
–
Lock(Y)
A:X, B:Y
B starts growing
3
Lock(Y)
–
A:X,Y, B:Y
Conflict, wait
4
Unlock(X), Unlock(Y)
–
–
A finishes (shrinking)
5
–
Lock(X)
B:X,Y
B continues
Once A starts unlocking, it can’t lock again. That’s the “two phases”: acquire, then release.
Lock acquired.
Lock released.
Cannot acquire lock after release phase!
Why It Matters
Ensures Serializability: All schedules are equivalent to some serial order
Foundational Principle: Basis for stricter variants like Strict 2PL and Conservative 2PL
Prevents Dirty Reads/Writes: Guarantees consistency under concurrency
Widely Used: Core in relational databases (MySQL, PostgreSQL)
Types of 2PL
Variant
Rule
Benefit
Basic 2PL
Acquire then release
Serializability
Strict 2PL
Hold locks until commit
Avoids cascading aborts
Conservative 2PL
Lock all at once
Deadlock-free
Try It Yourself
Simulate two transactions with overlapping data (X, Y) and apply 2PL.
Draw a lock timeline: when each transaction acquires/releases locks.
Compare results if locks were not used.
Add Strict 2PL: hold locks until commit, what changes?
Test Cases
Scenario
Locking Sequence
Result
T1 locks X → T2 locks Y → T1 locks Y
Deadlock (waits for Y)
Conflict
T1 locks X, Y → releases → T2 locks X
OK
Serializable
T1 unlocks X, then locks Y
Invalid (violates 2PL)
Error
All locks held until commit (Strict 2PL)
Safe
Serializable
Complexity
Time: O(n) per transaction (lock/unlock operations)
Space: O(#locks) for lock table
Two-Phase Locking is your guardrail for concurrency, it keeps transactions from trampling over each other, ensuring every result is consistent, predictable, and correct.
802 Strict Two-Phase Locking (Strict 2PL)
Strict Two-Phase Locking (Strict 2PL) is a stronger version of 2PL designed to simplify recovery and prevent cascading aborts. It follows the same two-phase rule, grow, then shrink, but with one important twist: no locks are released until the transaction commits or aborts.
What Problem Are We Solving?
Basic 2PL ensures serializability, but it still allows a subtle problem: If one transaction reads data written by another uncommitted transaction, and that first one later aborts, we’re left with dirty reads. Other transactions might have built on data that never should have existed.
Strict 2PL solves this by delaying all unlocks until the transaction ends. That way, no other transaction can read or write a value that’s not fully committed.
How Does It Work (Plain Language)?
Picture a cautious chef preparing a dish:
During the growing phase, the chef grabs all the ingredients (locks).
In the shrinking phase, they release everything, but only after serving (commit or abort).
This ensures no one tastes (reads) or borrows (writes) from a half-finished meal.
Let’s compare behaviors:
Transaction
Step
Action
Lock Held
Notes
T1
1
Lock(X)
X
Acquires lock
T2
2
Lock(X)
Wait
Must wait for T1
T1
3
Write(X), Commit
X
Still holds lock
T1
4
Unlock(X)
–
Unlocks only at commit
T2
5
Lock(X), Read(X)
X
Reads committed value
No transaction can see uncommitted data, guaranteeing safety even if a crash happens mid-run.
Prevents Cascading Aborts, uncommitted data is never read.
Simplifies Recovery, rollback only affects the failed transaction.
Ensures Strict Schedules, all reads/writes follow commit order.
Industry Standard, used in major DBMS engines for ACID safety.
Example Timeline
Time
T1 Action
T2 Action
Shared Data
Notes
1
Lock(X), Write(X=5)
–
X=5 (locked)
T1 owns X
2
–
Read(X)?
Wait
T2 must wait
3
Commit
–
X=5 (committed)
Safe
4
Unlock(X)
–
–
Lock freed
5
–
Read(X=5)
OK
T2 reads clean data
Try It Yourself
Simulate T1 writing X=10 and T2 reading X before T1 commits.
Show that Strict 2PL blocks T2.
Add a rollback before commit, confirm T2 never reads dirty data.
Visualize the lock table (resource → owner).
Compare with basic 2PL, what happens if T1 releases early?
Test Cases
Scenario
Strict 2PL Behavior
Outcome
T1 writes X, T2 reads X before commit
T2 waits
No dirty read
T1 aborts after T2 reads
Impossible
Safe
T1 unlocks before commit (violating rule)
Error
Inconsistent
All locks released on commit
OK
Serializable + Recoverable
Complexity
Time: O(n) (per lock/unlock)
Space: O(#locked items)
Strict 2PL trades a bit of concurrency for guaranteed safety. It’s the gold standard for ACID compliance, all reads are clean, all writes durable, all schedules strict.
803 Conservative Two-Phase Locking (C2PL)
Conservative Two-Phase Locking (C2PL) takes the 2PL principle one step further, it prevents deadlocks entirely by forcing a transaction to lock everything it needs at the very start, before doing any work. If even one lock isn’t available, the transaction waits instead of partially locking and risking circular waits.
What Problem Are We Solving?
In basic 2PL, transactions grab locks as they go. That flexibility is convenient but risky, if two transactions grab resources in different orders, they can deadlock (each waiting on the other forever).
Example deadlock:
T1 locks X, wants Y
T2 locks Y, wants X
Both wait forever, a classic standoff.
Conservative 2PL avoids this mess by planning ahead. It says: “If you can’t get all your locks now, don’t start.” This waits early, not late, trading throughput for certainty.
How Does It Work (Plain Language)?
Think of it like a chess game: before you move, you must claim all the pieces you’ll touch this turn. If any piece is taken, you sit out and try again later.
Steps:
Declare all locks needed (e.g., {X, Y}).
Request them all at once.
If all granted, run the transaction.
If any denied, release and wait.
Release all locks only after commit or abort.
This “all-or-nothing” lock acquisition ensures no circular wait because no transaction partially holds anything.
Step
Transaction
Action
Lock Table
Notes
1
T1 requests {X, Y}
Granted
X:T1, Y:T1
T1 can proceed
2
T2 requests {Y, Z}
Waits
Y locked
Avoids partial lock
3
T1 commits, releases
Free
–
Locks cleared
4
T2 retries {Y, Z}
Granted
Y:T2, Z:T2
Safe
No deadlock is possible, every transaction either gets everything or nothing.
Deadlock-Free, no transaction ever blocks another mid-lock.
Predictable Behavior, transactions either run or wait.
Safe Scheduling, ideal for real-time or critical systems.
Simple Recovery, fewer mid-flight dependencies.
Trade-off: less concurrency, waiting happens upfront, even if resources would’ve freed later.
Comparison Table
Feature
Basic 2PL
Strict 2PL
Conservative 2PL
Serializability
✅
✅
✅
Cascading Abort Prevention
❌
✅
✅
Deadlock Prevention
❌
❌
✅
Lock Timing
As needed
Hold till commit
All at start
Try It Yourself
Simulate two transactions (T1:{X,Y}, T2:{Y,X}).
Try with basic 2PL → deadlock.
Try with C2PL → one waits early, no deadlock.
Build a lock request queue for each resource.
Experiment with partial lock denial → transaction retries.
Test Cases
Scenario
Lock Requests
Result
T1:{X,Y}, T2:{Y,X}
All-or-nothing
No deadlock
T1:{A,B,C} granted, T2:{B} waits
Ordered access
Safe
Partial lock grant
Denied
Wait
All locks free
Granted
Run immediately
Complexity
Time: O(n) per lock request (checking availability)
Space: O(#resources × #transactions)
Conservative 2PL is your peacekeeper, by thinking ahead, it avoids the traps of mid-flight contention. It’s cautious, yes, but in systems where predictability matters more than speed, it’s a wise choice.
804 Timestamp Ordering (TO)
Timestamp Ordering (TO) is a non-locking concurrency control method that orders all transactions by timestamps. Instead of locks, it uses logical time to ensure that the result of concurrent execution is equivalent to some serial order based on when transactions started.
What Problem Are We Solving?
Lock-based protocols like 2PL prevent conflicts by blocking transactions, which can lead to deadlocks or waiting. Timestamp ordering avoids that.
The idea: each transaction gets a timestamp when it begins. Every read or write must respect that order. If an operation would violate it, the transaction rolls back and restarts.
So rather than blocking, TO says:
“If you’re too late, you restart. No waiting in line.”
How Does It Work (Plain Language)?
Think of a library checkout system where every reader has a ticket number. You can only borrow or return a book if your number fits the time order, if you come late but try to rewrite history, the librarian (scheduler) denies your request.
Each data item X keeps:
RT(X): Read Timestamp (largest timestamp that read X)
WT(X): Write Timestamp (largest timestamp that wrote X)
When a transaction T with timestamp TS(T) tries to read or write, we compare timestamps:
Operation
Condition
Action
Read(X)
If TS(T) < WT(X)
Abort (too late, stale data)
Write(X)
If TS(T) < RT(X) or TS(T) < WT(X)
Abort (conflict)
Otherwise
Safe
Execute and update timestamps
No locks, no waits, just immediate validation against logical time.
Example Walkthrough
Step
Transaction
Operation
Condition
Action
Notes
1
T1 (TS=5)
Write(X)
OK
WT(X)=5
Writes X
2
T2 (TS=10)
Read(X)
10 > 5
OK
Reads X
3
T1 (TS=5)
Read(Y)
OK
RT(Y)=5
Reads Y
4
T2 (TS=10)
Write(X)
10 > RT(X)=10
OK
WT(X)=10
5
T1 (TS=5)
Write(X)
5 < WT(X)=10
Abort
Too late
T1 tries to write after a newer transaction has modified X, not allowed.
Optimistic, transactions proceed freely, validated on each access
Good for Read-Mostly Workloads, fewer conflicts, more throughput
Drawback: high abort rates if many concurrent writes conflict on the same data.
Variants
Variant
Description
Use Case
Basic TO
Check timestamps at each operation
Simple databases
Thomas Write Rule
Ignore obsolete writes instead of aborting
Reduces aborts
Multiversion TO
Combine with snapshots (MVCC)
Modern systems (e.g., PostgreSQL)
Try It Yourself
Assign timestamps to T1=5, T2=10.
Let T1 write X, then T2 write X, allowed.
Now T1 tries to write X again, should abort.
Add a third transaction T3 (TS=15), reading and writing, trace timestamp updates.
Compare results with 2PL, how do waiting and aborting differ?
Test Cases
Scenario
Condition
Outcome
T1 (TS=5) reads after T2 (TS=10) writes
5 < WT(X)=10
Abort
T2 (TS=10) writes after T1 (TS=5) reads
10 > RT(X)=5
OK
T1 (TS=5) writes after T2 (TS=10) writes
5 < WT(X)=10
Abort
T2 (TS=10) reads X written by T1 (TS=5)
10 > WT(X)=5
OK
Complexity
Time: O(1) per access (timestamp check)
Space: O(#items) for RT and WT
Timestamp Ordering swaps waiting for rewinding, transactions race ahead but may be rolled back if they arrive out of order. It’s an elegant balance between optimism and order, perfect for systems that favor speed over contention.
805 Multiversion Concurrency Control (MVCC)
Multiversion Concurrency Control (MVCC) is a snapshot-based concurrency method that lets readers and writers coexist peacefully. Instead of blocking each other with locks, every write creates a new version of the data, and every reader sees a consistent snapshot of the database as of when it started.
What Problem Are We Solving?
In traditional locking schemes, readers block writers and writers block readers, slowing down workloads that mix reads and writes.
MVCC flips the script. Readers don’t block writers because they read old committed versions, and writers don’t block readers because they write new versions.
The result: high concurrency, no dirty reads, and a consistent view for each transaction.
How Does It Work (Plain Language)?
Imagine a library where no one fights over a single copy. Each time a writer updates a book, they make a new edition. Readers keep reading the edition they checked out when they entered.
Each version has:
WriteTS – when it was written
ValidFrom, ValidTo – version’s time range
Data value
When a transaction starts, it gets a snapshot timestamp (its view of time).
Readers see the latest version whose WriteTS ≤ snapshot.
Writers create new versions at commit time, marking older ones as expired.
Example Walkthrough
Step
Transaction
Operation
Version Table (X)
Visible To
1
T1 (TS=5)
Write X=10
X₁: {value=10, WriteTS=5}
All TS ≥ 5
2
T2 (TS=8)
Read X
Sees X₁ (TS=5)
OK
3
T3 (TS=12)
Write X=20
X₂: {value=20, WriteTS=12}
All TS ≥ 12
4
T2 (TS=8)
Read X again
Still X₁
Snapshot isolation
5
T2 commits
–
–
Consistent snapshot
Even if T3 writes new data, T2 keeps seeing its old snapshot, no inconsistency, no blocking.
Tiny Code (Conceptual Example)
C (Simplified Version Table)
#include <stdio.h>typedefstruct{int value;int writeTS;} Version;Version versions[10];int version_count =0;void write_value(int ts,int val){ versions[version_count].value = val; versions[version_count].writeTS = ts; version_count++; printf("Write: X=%d at TS=%d\n", val, ts);}int read_value(int ts){int visible =-1;for(int i =0; i < version_count; i++){if(versions[i].writeTS <= ts) visible = i;} printf("Read: X=%d (TS=%d)\n", versions[visible].value, ts);return versions[visible].value;}int main(){ write_value(5,10); write_value(12,20); read_value(8);// sees version 5 read_value(15);// sees version 12}
Why It Matters
Readers never block, they read consistent snapshots.
Writers never block readers, they add new versions.
Let T2 read before and after T3’s write, snapshot stays stable.
Add garbage collection: remove versions with WriteTS < min(active TS).
Compare with locking: what’s different in behavior and concurrency?
Test Cases
Scenario
Behavior
Result
Reader starts before writer
Reads old version
Consistent
Writer starts before reader
Reader sees writer only if committed
No dirty read
Concurrent writes
New version chain
Conflict detection
Long-running read
Snapshot stays fixed
Repeatable reads
Complexity
Time: O(#versions per item) to find visible version
Space: O(total versions) until garbage collected
MVCC is like a time-traveling database, every transaction gets its own consistent world. By turning conflict into coexistence, it powers the high-performance, non-blocking systems behind modern relational and distributed databases.
806 Optimistic Concurrency Control (OCC)
Optimistic Concurrency Control (OCC) assumes that conflicts are rare, so transactions can run without locks, and only check for conflicts at the end, during validation. If no conflicts are found, the transaction commits; if conflicts exist, it rolls back and retries.
What Problem Are We Solving?
Locking (like 2PL) prevents conflicts by blocking access, but that means waiting and deadlocks. In read-heavy workloads where collisions are infrequent, that’s wasteful.
OCC flips the mindset:
“Let everyone run freely. We’ll check for trouble at the finish line.”
This approach maximizes concurrency, especially in low-contention systems, by separating execution from validation.
How Does It Work (Plain Language)?
Think of a group project where everyone edits their own copy, then at the end, a teacher compares notes. If no two people changed the same part, all merges succeed; if not, someone has to redo.
OCC runs each transaction through three phases:
Phase
Description
1. Read Phase
Transaction reads data, makes local copies, computes changes.
2. Validation Phase
Before committing, check for conflicts with committed transactions.
3. Write Phase
If validation passes, apply updates atomically. Otherwise, abort.
No locks are used while reading or writing locally, only a validation check before commit decides success.
Example Walkthrough
Step
Transaction
Phase
Action
Result
1
T1
Read
Read X=5
Local copy
2
T2
Read
Read X=5
Local copy
3
T1
Compute
X=5+1
Local change (6)
4
T2
Compute
X=5+2
Local change (7)
5
T1
Validate
No conflict (T2 not committed)
Commit X=6
6
T2
Validate
Conflict: X changed since read
Abort and retry
Both worked in parallel; T2 must retry since its read set overlapped with a changed item.
Tiny Code (Conceptual Simulation)
C (Simple OCC Example)
#include <stdio.h>#include <stdbool.h>typedefstruct{int value;int version;} DataItem;bool validate(DataItem *item,int readVersion){return item->version == readVersion;}bool commit(DataItem *item,int*localValue,int readVersion){if(!validate(item, readVersion)){ printf("Abort: data changed by another transaction.\n");returnfalse;} item->value =*localValue; item->version++; printf("Commit successful. New value = %d\n", item->value);returntrue;}int main(){ DataItem X ={5,1};int local = X.value; local +=1; commit(&X,&local,1);// OKint local2 = X.value; local2 +=2; commit(&X,&local2,1);// Abort (version changed)}
Why It Matters
High Concurrency, no locks during execution.
No Deadlocks, transactions don’t block each other.
Ideal for Read-Heavy Workloads, where conflicts are rare.
Clear Validation Logic, easy to reason about correctness.
Trade-off: wasted work if many conflicts, transactions may repeat often.
Validation Rules (Simplified)
Each transaction T has:
Read Set (RS) – items read
Write Set (WS) – items written
TS(T) – timestamp
At commit, T passes validation if for every committed Tᵢ:
Tᵢ finishes before T starts, or
RS(T) ∩ WS(Tᵢ) = ∅ (no overlap)
If conflict found → abort and retry.
Try It Yourself
Run two transactions reading X, both writing new values.
Validate sequentially, see which one passes.
Add a third read-only transaction, should always pass.
Vary overlap between read/write sets to test conflict detection.
Test Cases
Scenario
Conflict
Outcome
Two transactions read same data, one writes
No
Both commit
Two write same data
Yes
One aborts
Read-only transactions
None
Always commit
High contention
Frequent
Many retries
Complexity
Time: O(#active transactions) during validation
Space: O(#read/write sets) per transaction
Optimistic Concurrency Control is trust but verify for databases, let transactions race ahead, then double-check before sealing the deal. In workloads where contention is rare, OCC shines with near-lock-free performance and clean serializable results.
807 Serializable Snapshot Isolation (SSI)
Serializable Snapshot Isolation (SSI) is a hybrid concurrency control scheme that merges the speed of MVCC with the safety of full serializability. It builds on snapshot isolation (SI), where every transaction sees a consistent snapshot, and adds conflict detection to prevent anomalies that SI alone can’t catch.
What Problem Are We Solving?
Snapshot Isolation (like in MVCC) avoids dirty reads and non-repeatable reads, but it is not fully serializable. It can still produce write skew anomalies, where two transactions read overlapping data and write disjoint but conflicting updates.
Example of write skew:
T1: reads X, Y → updates X
T2: reads X, Y → updates Y Both think the condition holds and commit, but together they break an invariant (e.g., “X + Y ≥ 1”).
SSI fixes this by detecting dangerous dependency patterns and aborting transactions that would violate serializability.
How Does It Work (Plain Language)?
Imagine every transaction walks through the database wearing snapshot glasses. They see the world as it was when they started. If two walkers make changes that can’t coexist in any real order, one gets stopped at the gate.
Steps:
Read Phase – Transaction reads from its snapshot and records dependencies.
Write Phase – Tentative writes stored, visible only after validation.
Anomaly-Free – prevents write skew, phantoms, and dangerous cycles.
Used by PostgreSQL – default for SERIALIZABLE isolation level.
Trade-off: needs dependency tracking and conflict analysis, which add overhead.
SSI Dependency Types
Dependency
Description
rw-conflict
T1 reads X, T2 later writes X
wr-conflict
T1 writes X, T2 later reads X
ww-conflict
Both write same X
SSI looks for rw-cycles (T1 → T2 → T3 → T1) as signs of non-serializability.
Try It Yourself
Simulate two transactions reading overlapping data and writing disjoint updates.
Check if invariant (e.g., X + Y ≥ 1) still holds after both commit.
Add conflict detection logic, abort one when cycle found.
Compare with plain SI, see anomaly disappear under SSI.
Test Cases
Scenario
Isolation
Outcome
Write skew (X+Y ≥ 1)
SI
Violated
Write skew (X+Y ≥ 1)
SSI
Prevented (abort)
Concurrent readers only
SSI
No aborts
Overlapping writes
SSI
Conflict → abort one
Complexity
Time: O(#dependencies) per validation
Space: O(#active transactions × #reads/writes)
Serializable Snapshot Isolation is like giving every transaction a time bubble, then checking afterward if those bubbles can line up without overlapping in illegal ways. It delivers serializable safety with snapshot performance, a modern best-of-both-worlds solution for databases.
808 Lock-Free Algorithm
Lock-Free algorithms are the superheroes of concurrency, they coordinate threads without using locks, avoiding deadlocks, priority inversion, and context-switch overhead. Instead of mutual exclusion, they rely on atomic operations (like Compare-and-Swap) to ensure correctness even when many threads race ahead together.
What Problem Are We Solving?
In traditional concurrency, locks are used to protect shared data. But locks can cause:
Deadlocks – when threads wait on each other forever
Starvation – some threads never get a chance
Blocking delays – a slow or paused thread holds everyone back
Lock-free algorithms fix this by ensuring progress, at least one thread always makes forward progress, no matter what others do.
The system never freezes, it keeps moving.
How Does It Work (Plain Language)?
Instead of locking a resource, a lock-free algorithm optimistically updates shared data using atomic primitives. If a conflict occurs, the thread retries, no waiting, no blocking.
Key primitive: Compare-And-Swap (CAS)
CAS(address, expected, new)
Atomically checks if *address == expected. If yes → replace with new and return true. If not → return false (someone else changed it).
Threads keep looping until their CAS succeeds, that’s the “lock-free dance.”
Example: Lock-Free Stack
Step
Thread A
Thread B
Stack Top
Notes
1
Reads top = X
–
X
A plans push(Y)
2
–
Reads top = X
X
B plans push(Z)
3
A CAS(X, Y)
succeeds
Y
Y → X
4
B CAS(X, Z)
fails
–
Top changed
5
B retries
CAS(Y, Z) succeeds
Z
Z → Y → X
No locks, both threads push safely via atomic retries.
System makes progress (at least one thread succeeds)
Obstruction-Free
Progress if no interference
Lock-Free sits in the middle, a good balance between safety and performance.
Try It Yourself
Implement a lock-free stack or counter using atomic_compare_exchange_weak.
Add two threads incrementing a shared counter, watch CAS retries in action.
Compare with a mutex-based version, note CPU usage and fairness.
Simulate interference, ensure at least one thread always moves forward.
Test Cases
Scenario
Description
Result
Single thread push
No conflict
Success
Two threads push
CAS retry loop
Both succeed
CAS failure
Detected by compare
Retry
Pause one thread
Other continues
Progress
Complexity
Time: O(1) average per operation (with retries)
Space: O(n) for data structure
Lock-Free algorithms are the art of optimism in concurrency, no waiting, no locking, just atomic cooperation. They shine in high-throughput, low-latency systems where speed and liveness matter more than simplicity.
809 Wait-Die / Wound-Wait
The Wait-Die and Wound-Wait schemes are classic deadlock prevention strategies in timestamp-based concurrency control. They use transaction timestamps to decide who waits and who aborts, keeping the system moving and avoiding circular waits entirely.
What Problem Are We Solving?
When multiple transactions compete for the same resources, deadlocks can occur:
T1 locks X and wants Y
T2 locks Y and wants X → both wait forever
We need a rule that breaks these cycles before they form.
The trick? Give every transaction a timestamp (its “age”) and use it to resolve conflicts deterministically, no cycles, no guessing.
How Does It Work (Plain Language)?
Each transaction T gets a timestamp TS(T) when it starts. Whenever T requests a lock held by another transaction U, we apply one of two strategies:
Scheme
Rule
Intuition
Wait-Die
If T is older than U → wait; else (younger) → abort (die)
Old ones wait, young ones restart
Wound-Wait
If T is older than U → wound (abort U); else (younger) → wait
Old ones preempt, young ones wait
Because timestamps never change, cycles cannot form, one direction always wins.
Example Walkthrough
Let TS(T1)=5 (older), TS(T2)=10 (younger)
Scenario
Scheme
Outcome
T1 (old) wants lock held by T2
Wait-Die: T1 waits Wound-Wait: T2 aborts
Safe either way
T2 (young) wants lock held by T1
Wait-Die: T2 aborts Wound-Wait: T2 waits
Safe either way
No cycle is possible, all waits move from older to younger, or abort younger, breaking loops.
Trade-off: younger transactions may abort frequently in high-contention systems.
Comparison
Feature
Wait-Die
Wound-Wait
Older wants lock
Wait
Abort younger
Younger wants lock
Abort
Wait
Starvation
Possible for young
Rare
Aggressiveness
Conservative
Aggressive
Implementation
Easier
Slightly complex
Try It Yourself
Assign timestamps T1=5, T2=10.
T1 wants T2’s lock → compare rules.
T2 wants T1’s lock → compare rules.
Add a third transaction T3=15 and simulate conflicts.
Observe how order always flows older → younger, never forming cycles.
Try integrating with 2PL: apply rules before acquiring locks.
Test Cases
Conflict
Wait-Die
Wound-Wait
Result
Old wants lock from young
Wait
Abort young
No deadlock
Young wants lock from old
Abort
Wait
No deadlock
Equal timestamps
Choose order
Choose order
Deterministic
Multiple waits
Directed by TS
Directed by TS
Acyclic graph
Complexity
Time: O(1) per conflict check (compare timestamps)
Space: O(#active transactions) for timestamp table
Wait-Die and Wound-Wait are elegant timestamp rules that turn potential deadlocks into quick decisions, old transactions keep their dignity, young ones retry politely.
810 Deadlock Detection (Wait-for Graph)
Deadlock Detection is the watchdog of concurrency control. Instead of preventing deadlocks in advance, it allows them to occur and then detects and resolves them automatically. This strategy is ideal for systems where deadlocks are rare but possible, and where concurrency should remain as high as possible.
What Problem Are We Solving?
When multiple transactions compete for shared resources, they may enter a circular wait that halts progress entirely.
Example:
T₁ locks X, then requests Y
T₂ locks Y, then requests X
Now neither can proceed. Both are waiting on each other, forming a deadlock.
If we cannot avoid such patterns ahead of time, we must detect them dynamically and recover by aborting one of the transactions.
How Does It Work (Plain Language)
We represent the system’s waiting relationships as a Wait-For Graph (WFG):
Each node represents a transaction.
A directed edge \(T_i \rightarrow T_j\) means “Transaction \(T_i\) is waiting for \(T_j\)” to release a resource.
A cycle in this graph implies a deadlock.
The detection algorithm:
Construct the wait-for graph from the current lock table.
Run a cycle detection algorithm (e.g., DFS or Tarjan’s SCC).
If a cycle exists, abort one transaction (the victim).
Release its locks, allowing other transactions to proceed.
This guarantees system liveness, deadlocks never persist indefinitely.
Example Walkthrough
Step
Transaction
Locks Held
Waiting For
Graph Edge
1
T₁ locks X
X
–
–
2
T₂ locks Y
Y
–
–
3
T₁ requests Y
held by T₂
Y
\(T₁ \rightarrow T₂\)
4
T₂ requests X
held by T₁
X
\(T₂ \rightarrow T₁\)
The resulting wait-for graph contains a cycle:
\[
T_1 \rightarrow T_2 \rightarrow T_1
\]
A deadlock has formed. The detector aborts one transaction (e.g., the youngest) to break the cycle.
Tiny Code (Conceptual Example)
#include <stdio.h>#include <stdbool.h>#define N 3// number of transactionsint graph[N][N];// adjacency matrixbool visited[N], stack[N];bool dfs(int v){ visited[v]= stack[v]=true;for(int u =0; u < N; u++){if(graph[v][u]){if(!visited[u]&& dfs(u))returntrue;elseif(stack[u])returntrue;// cycle found}} stack[v]=false;returnfalse;}bool has_cycle(){for(int i =0; i < N; i++) visited[i]= stack[i]=false;for(int i =0; i < N; i++)if(!visited[i]&& dfs(i))returntrue;returnfalse;}int main(){ graph[0][1]=1;// T1 -> T2 graph[1][0]=1;// T2 -> T1if(has_cycle()) printf("Deadlock detected.\n");else printf("No deadlock.\n");}
Output:
Deadlock detected.
Why It Matters
Detects all deadlocks, including multi-transaction cycles
Maximizes concurrency, since no locks are preemptively withheld
Ensures progress, by selecting and aborting a victim
Used in databases and operating systems where concurrency is complex
Trade-off: deadlocks must actually occur before being resolved, which may waste partial work.
Deadlock Resolution Strategy
Victim Selection Choose a transaction to abort based on:
Age (youngest first)
Cost (least work done)
Priority (lowest first)
Rollback Abort the victim and release its locks.
Restart Retry the aborted transaction after a short delay.
A Gentle Proof (Why It Works)
Let the Wait-For Graph be \(G = (V, E)\), where:
\(V\) = set of active transactions
\(E\) = set of edges \((T_i, T_j)\), meaning \(T_i\) waits for \(T_j\)
A deadlock exists if and only if there is a cycle in \(G\).
Proof sketch:
(If) Suppose a cycle exists: \[
T_1 \rightarrow T_2 \rightarrow \cdots \rightarrow T_k \rightarrow T_1
\] Each transaction in the cycle waits for the next. No transaction can proceed since each holds a resource another needs. Therefore, they are all blocked, a deadlock.
(Only if) Conversely, if a set of transactions is deadlocked, each must be waiting for another in the set. Constructing edges for these wait relationships forms a cycle.
Thus, detecting cycles in \(G\) is equivalent to detecting deadlocks. Once a cycle is found, removing any vertex \(T_v\) (aborting a transaction) breaks the cycle:
\[
G' = G - {T_v}
\]
and allows progress to resume.
Try It Yourself
Construct a graph: \[
T_1 \rightarrow T_2, \quad T_2 \rightarrow T_3, \quad T_3 \rightarrow T_1
\] Detect the cycle using DFS.
Abort \(T_3\), remove its edges, and verify that no cycles remain.
Compare with Wait-Die and Wound-Wait:
Those prevent cycles.
This approach detects and breaks them after the fact.
Experiment with victim selection rules and measure system throughput.
Let \(V\) be the number of transactions and \(E\) the number of edges.
Time Complexity: \[ O(V + E) \] (using Depth-First Search)
Space Complexity: \[ O(V^2) \] (for adjacency matrix representation)
Deadlock Detection acts as a runtime safety net. It accepts that deadlocks may arise in high-concurrency systems and ensures they never persist by identifying cycles in the wait-for graph and removing one transaction. This keeps the system live, responsive, and deadlock-free over time.
Section 82. Logging, Recovery, and Commit Protocols
811 Write-Ahead Logging (WAL)
Write-Ahead Logging (WAL) is the foundation of reliable storage systems. It ensures that updates to data are never applied before being recorded, allowing recovery after crashes. The golden rule of WAL: log first, write later.
What Problem Are We Solving?
In any durable database or file system, failures can strike mid-write. Without precautions, we might end up with partially applied updates, corrupting the data.
Example:
T₁ updates record X
System crashes before writing X to disk
After restart, we must replay or undo changes to restore consistency. WAL provides the structure for doing exactly that.
By writing intentions to a log before applying them, WAL guarantees that every update is reproducible or reversible.
How Does It Work (Plain Language)
WAL maintains a sequential log on stable storage. Each log record describes:
The transaction ID
The old value (for undo)
The new value (for redo)
The WAL protocol enforces two key rules:
Write-Ahead Rule: Before modifying any data page on disk, its log record must be flushed to the log.
Commit Rule: A transaction is committed only after all its log records are safely on disk.
So, even if a crash happens, the log can replay all completed operations.
Example Walkthrough
Step
Transaction
Action
Log
Data
1
T₁
Start
[BEGIN T₁]
–
2
T₁
Update X: 10 → 20
[T₁, X, old=10, new=20]
In memory
3
T₁
Flush log
[T₁, X, old=10, new=20]
Persisted
4
T₁
Write X=20 to disk
–
Updated
5
T₁
Commit
[COMMIT T₁]
Durable
If crash occurs:
Before Step 3 → no log record → no action
After Step 3 → log says what to redo → recovery replays X=20
Durability (D in ACID): No committed change is ever lost.
Atomicity (A in ACID): Uncommitted changes can be undone via the log.
Crash Recovery: Replay committed updates (redo), roll back uncommitted (undo).
Efficiency: Sequential log writes are faster than random data writes.
A Gentle Proof (Why It Works)
Suppose \(L\) is the log, \(D\) the data pages, and \(T_i\) a transaction.
For each update \(u\):
Log record \(r(u)\) is flushed to \(L\) before \(u\) is applied to \(D\). \[
\text{write}(L, r(u)) \Rightarrow \text{write}(D, u)
\]
On commit, \(L\) contains every record \(r(u)\) of \(T_i\).
If the system crashes:
For committed \(T_i\): redo all \(r(u)\) from \(L\).
For uncommitted \(T_i\): undo using old values in \(L\).
Thus, after recovery: \[
D' = \text{REDO(committed)} + \text{UNDO(uncommitted)}
\] ensuring the database matches a valid serial state.
Try It Yourself
Simulate a transaction updating \(X=10 \to 20\).
Log before data write → crash → recover via redo.
Reverse order (data before log). Observe recovery failure.
Add [BEGIN, UPDATE, COMMIT] records and test recovery logic.
Experiment with undo logging vs redo logging.
Test Cases
Scenario
Log State
Recovery Action
Crash before log write
–
Ignore (no record)
Crash after log, before data
[T₁, X, 10, 20]
Redo
Crash before commit
[BEGIN, UPDATE]
Undo
Crash after commit
[BEGIN, UPDATE, COMMIT]
Redo
Complexity
Time: \(O(n)\) per recovery (scan log)
Space: \(O(n)\) log records per transaction
Write-Ahead Logging is the journal of truth in a database. Every change is written down before it happens, ensuring that even after a crash, the system can always find its way back to a consistent state.
812 ARIES Recovery (Algorithms for Recovery and Isolation Exploiting Semantics)
ARIES is the gold standard of database recovery algorithms. It builds on Write-Ahead Logging (WAL) and combines redo, undo, and checkpoints to guarantee atomicity and durability, even in the face of crashes. ARIES powers major systems like DB2, SQL Server, and PostgreSQL variants.
What Problem Are We Solving?
When a database crashes, we face three tasks:
Redo committed work (to ensure durability).
Undo uncommitted work (to maintain atomicity).
Resume from a recent checkpoint (to avoid scanning the entire log).
Earlier systems often had to choose between efficiency and correctness. ARIES solves this by combining three principles:
Write-Ahead Logging (WAL), log before data writes.
Repeating History, redo everything since the last checkpoint.
Physiological Logging, log changes at the page level, not just logical or physical.
How Does It Work (Plain Language)
Think of ARIES as a time machine for your database. After a crash, it replays the past exactly as it happened, then rewinds uncommitted changes.
The ARIES recovery process runs in three phases:
Analysis Phase
Scan the log forward from the last checkpoint.
Reconstruct the Transaction Table (TT) and Dirty Page Table (DPT).
Identify transactions that were active at crash time.
Redo Phase
Reapply all updates from the earliest log sequence number (LSN) in the DPT.
Repeat history to bring the database to the exact pre-crash state.
Undo Phase
Roll back uncommitted transactions using Compensation Log Records (CLRs).
Each undo writes a CLR for idempotent recovery.
After undo completes, the database reflects only committed work.
ARIES is the architecture of resilience, it carefully replays the past, repairs the present, and preserves the future. By blending redo, undo, and checkpoints, it guarantees that every crash leads not to chaos, but to consistency restored.
813 Shadow Paging
Shadow Paging is a copy-on-write recovery technique that eliminates the need for a log. Instead of writing to existing pages, it creates new copies (shadows) and atomically switches to them at commit. If a crash occurs before the switch, the old pages remain untouched, guaranteeing consistency.
What Problem Are We Solving?
Traditional recovery methods (like WAL and ARIES) maintain logs to replay or undo changes. This adds overhead and complexity.
Shadow Paging offers a simpler alternative:
No undo or redo phase
No need for logs
Commit = pointer swap
By using page versioning, it ensures that uncommitted changes never overwrite stable data.
How Does It Work (Plain Language)
Imagine the database as a tree of pages, with a root page pointing to all others. Instead of updating in place, each modification creates a new copy (shadow page). The transaction updates pointers privately, and when ready to commit, it atomically replaces the root.
Steps:
Maintain a page table (mapping logical pages → physical pages).
On update, copy the target page, modify the copy, and update the page table.
At commit, atomically update the root pointer to the new page table.
If crash before commit, old root is still valid → automatic rollback.
No log replay, no undo, no redo, just pointer swaps.
Example Walkthrough
Suppose we have a page table:
Logical Page
Physical Page
A
1
B
2
Transaction wants to update B:
Copy page 2 → page 3
Update page 3
Update table to point B → 3
On commit, replace old root pointer with new table
If crash occurs before commit, system uses old root → B=2 (old version). If commit succeeds, root now points to table with B=3 (new version).
Let \(R_0\) be the root pointer before the transaction and \(R_1\) the root after commit.
During updates:
All changes occur in shadow pages, leaving \(R_0\) untouched.
Commit = atomic pointer swap: \[
R \leftarrow R_1
\] If crash occurs:
If before swap → \(R = R_0\) (old consistent state)
If after swap → \(R = R_1\) (new consistent state)
Thus the invariant holds: \[
R \in { R_0, R_1 }
\] No intermediate state is ever visible, ensuring atomicity and durability.
Try It Yourself
Build a page table mapping {A, B, C}.
Perform shadow copies on update.
Simulate crash before and after commit, verify recovery.
Extend with multiple levels (root → branch → leaf).
Compare performance with WAL: write amplification vs simplicity.
Test Cases
Scenario
Action
Result
Update page, crash before commit
Root not swapped
Old data visible
Update page, commit succeeds
Root swapped
New data visible
Partial write during swap
Swap atomic
One valid root
Multiple updates
All or none
Atomic commit
Complexity
Time: \(O(n)\) (copy modified pages)
Space: \(O(\text{\#updated pages})\) (new copies)
Shadow Paging is copy-on-write recovery made simple. By treating updates as new versions and using atomic root swaps, it turns complex recovery logic into pointer arithmetic, a clean, elegant path to consistency.
814 Two-Phase Commit (2PC)
The Two-Phase Commit (2PC) protocol ensures atomic commitment across distributed systems. It coordinates multiple participants so that either all commit or all abort, preserving atomicity even when nodes fail.
It’s the cornerstone of distributed transactions in databases, message queues, and microservices.
What Problem Are We Solving?
In a distributed transaction, multiple nodes (participants) must agree on a single outcome. If one commits and another aborts, global inconsistency results.
We need a coordination protocol that ensures:
All or nothing outcome
Agreement despite failures
Durable record of the decision
The Two-Phase Commit protocol achieves this by introducing a coordinator that manages a two-step handshake across all participants.
How Does It Work (Plain Language)
Think of a conductor leading an orchestra:
First, they ask each musician, “Are you ready?”
Only when everyone says yes, the conductor signals “Play!”
If any musician says “No”, the performance stops.
Similarly, 2PC proceeds in two phases:
Phase 1: Prepare (Voting Phase)
Coordinator sends PREPARE to all participants.
Each participant:
Writes its prepare record to disk (for durability).
Replies YES if ready to commit, NO if not.
Phase 2: Commit (Decision Phase)
If all voted YES:
Coordinator writes COMMIT record and sends COMMIT to all.
If any voted NO or timeout:
Coordinator writes ABORT record and sends ABORT.
Each participant follows the coordinator’s final decision.
Messages ensure all commit or all abort: \[
\forall i, j:; \text{state}(P_i) = \text{state}(P_j)
\]
Hence, atomicity and agreement hold, even under partial failures.
Try It Yourself
Simulate two participants and one coordinator.
All YES → commit
One NO → abort
Add a timeout in the coordinator before Phase 2.
What happens? (Participants block)
Add logging: [BEGIN, PREPARE, COMMIT]
Recover coordinator after crash, reissue final decision.
Compare behavior with 3PC (non-blocking variant).
Test Cases
Scenario
Votes
Result
Notes
All YES
YES, YES, YES
Commit
All agree
One NO
YES, NO, YES
Abort
Atomic abort
Timeout (no response)
YES, –, YES
Abort
Safety over progress
Crash after prepare
All YES
Blocked
Wait for coordinator
Complexity
Message Complexity: \(O(n)\) per phase
Log Writes: 1 per phase (prepare, commit)
Blocking: possible if coordinator fails post-prepare
Two-Phase Commit is the atomic handshake of distributed systems, a simple, rigorous guarantee that all participants move together, or not at all. It laid the groundwork for fault-tolerant consensus protocols that followed.
815 Three-Phase Commit (3PC)
The Three-Phase Commit (3PC) protocol extends Two-Phase Commit (2PC) to avoid its main weakness, blocking. It ensures that no participant ever remains stuck waiting for a decision, even if the coordinator crashes. 3PC achieves this by inserting a pre-commit phase, separating agreement from execution.
What Problem Are We Solving?
2PC guarantees atomicity, but not liveness. If the coordinator fails after all participants vote YES, everyone waits indefinitely, the system stalls.
3PC fixes this by ensuring that:
All participants move through the same sequence of states
No state is ambiguous after a crash
Timeouts always lead to safe progress (abort or commit)
This makes 3PC a non-blocking atomic commitment protocol, assuming no network partitions and bounded message delays.
How Does It Work (Plain Language)
3PC divides the commit process into three phases, adding a pre-commit handshake before final commit.
Phase
Name
Description
1
CanCommit?
Coordinator asks if participants can commit.
2
PreCommit
If all vote YES, coordinator sends pre-commit (promise). Participants prepare to commit and acknowledge.
3
DoCommit
Coordinator sends final commit. Participants complete and acknowledge.
If any participant or coordinator times out waiting for a message, it can safely decide (commit or abort) based on its last known state.
Crash-tolerant: Safe state transitions after recovery
3PC improves availability compared to 2PC, but requires synchronous assumptions (bounded delays). In real-world networks with partitions, Paxos Commit or Raft are preferred.
A Gentle Proof (Why It Works)
Let \(P_i\) be a participant with state \(s_i(t)\) at time \(t\).
Invariant: \[
\forall i, j:\ s_i(t) \neq \text{COMMIT} \land s_j(t) = \text{ABORT}
\] That is, no process commits while another aborts.
In WAIT, if timeout occurs → abort (safe).
In PRECOMMIT, all participants acknowledged → all can safely commit.
Hence, no uncertain or contradictory outcomes arise.
Each state implies a safe local decision: \[
\begin{cases}
\text{WAIT} \implies \text{ABORT} \
\text{PRECOMMIT} \implies \text{COMMIT}
\end{cases}
\] Therefore, even with timeouts, global consistency is preserved.
Try It Yourself
Simulate all participants voting YES, system commits.
Make one participant vote NO, all abort.
Crash the coordinator during PRECOMMIT, participants commit safely.
Compare with 2PC, where would blocking occur?
Test Cases
Scenario
Votes
Failure
Result
All YES
None
None
Commit
One NO
P2
–
Abort
Crash in WAIT
Timeout
Abort
Safe
Crash in PRECOMMIT
Timeout
Commit
Safe
Network delay (bounded)
–
–
Non-blocking
Complexity
Phases: 3 rounds of messages
Message Complexity: \(O(n)\) per phase
Time: One more phase than 2PC, but no blocking
Three-Phase Commit is the non-blocking evolution of 2PC, by inserting a pre-commit handshake, it ensures that agreement and action are separate, so failures can never leave the system waiting in limbo.
816 Checkpointing
Checkpointing is the process of periodically saving a consistent snapshot of a system’s state so that recovery after a crash can resume from the checkpoint instead of starting from the very beginning. It’s the backbone of fast recovery in databases, operating systems, and distributed systems.
What Problem Are We Solving?
Without checkpoints, recovery after a crash requires replaying the entire log, which can be slow and inefficient. Imagine a database with millions of operations, restarting from scratch would take forever.
By creating checkpoints, we mark safe positions in the log so that:
Only actions after the last checkpoint need to be redone or undone
Recovery time is bounded and predictable
System can resume faster after failure
Checkpointing trades a small amount of runtime overhead for massive recovery speedup.
How Does It Work (Plain Language)
A checkpoint captures a snapshot of all necessary recovery information:
The transaction table (TT) (active transactions)
The dirty page table (DPT) (pages not yet written to disk)
The log position marking where recovery can resume
During normal operation:
System runs and appends log records (like WAL).
Periodically, a checkpoint is written:
[BEGIN CHECKPOINT]
Snapshot TT and DPT
[END CHECKPOINT]
During recovery:
Scan log from the last checkpoint, not from the beginning.
Redo or undo only what happened afterward.
Example Walkthrough
Step
Log Entry
Description
1
[BEGIN T₁]
Transaction starts
2
[UPDATE T₁, X, 10→20]
Modify data
3
[BEGIN CHECKPOINT]
Capture snapshot
4
{TT: {T₁}, DPT: {X}}
Write table states
5
[END CHECKPOINT]
Finish checkpoint
6
[UPDATE T₁, Y, 5→7]
Continue operations
If a crash occurs after Step 6, recovery starts after Step 3, not Step 1.
Types of Checkpointing
Type
Description
Example
Consistent
All transactions paused
Simpler, slower
Fuzzy
Taken while system runs
Used in ARIES
Coordinated
Global sync in distributed systems
Snapshot algorithm
Uncoordinated
Each node independent
Risk of inconsistent states
Most modern systems use fuzzy checkpointing, no global pause, only metadata consistency.
Works with WAL and ARIES, key building block of recovery
Trade-offs:
Overhead during checkpointing
Extra disk writes
Must ensure snapshot consistency
A Gentle Proof (Why It Works)
Let \(L = [r_1, r_2, \ldots, r_n]\) be the log and \(C_k\) a checkpoint after record \(r_k\).
The recovery rule:
Redo all log records after \(C_k\)
Undo incomplete transactions from \(C_k\) forward
Since \(C_k\) captures all prior committed states, we have: \[
\text{state}(C_k) = \text{apply}(r_1, \ldots, r_k)
\]
So after crash: \[
\text{Recover} = \text{Redo}(r_{k+1}, \ldots, r_n)
\]
Checkpoint ensures we never need to revisit \(r_1, \ldots, r_k\) again.
Try It Yourself
Simulate a log with 10 updates and 2 checkpoints.
Recover starting from last checkpoint.
Compare runtime with full log replay.
Add dirty pages to checkpoint, redo only affected ones.
Implement fuzzy checkpoint (no pause, capture snapshot metadata).
Test Cases
Scenario
Action
Recovery Start
Result
No checkpoint
Replay all
\(r_1\)
Slow
One checkpoint
Start after checkpoint
\(r_k\)
Faster
Multiple checkpoints
Use last one
\(r_m\)
Fastest
Fuzzy checkpoint
No pause
\(r_m\)
Efficient
Complexity
Checkpointing Time: \(O(\text{\#dirty pages})\)
Recovery Time: \(O(\text{log length after checkpoint})\)
Space: small metadata overhead
Checkpointing is the pause button for recovery, a snapshot of safety that lets systems bounce back quickly. By remembering where it last stood firm, a database can restart with confidence, skipping over the past and diving straight into the present.
817 Undo Logging
Undo Logging is one of the earliest and simplest recovery mechanisms in database systems. Its core idea is straightforward: never overwrite a value until its old version has been saved to the log. After a crash, the system can undo any uncommitted changes using the saved old values.
What Problem Are We Solving?
In systems that modify data directly on disk (in-place updates), a crash could leave incomplete or inconsistent data behind. We need a way to rollback uncommitted transactions safely and restore the database to its previous consistent state.
Undo Logging solves this by logging before writing, ensuring that old values are always recoverable.
This principle is known as Write-Ahead Rule:
Before any change is written to the database, the old value must be written to the log.
How Does It Work (Plain Language)
Undo Logging maintains a log of old values for every update.
Log record format:
<T, X, old_value>
where:
T = transaction ID
X = data item
old_value = value before modification
Protocol Rules
Write-Ahead Rule Log the old value before writing to disk.
Commit Rule A transaction commits only after all writes are flushed to disk.
If a crash occurs:
Committed transactions are left as-is.
Uncommitted transactions are undone using old values in the log.
Example Walkthrough
Step
Action
Log
Data
1
<START T₁>
Start transaction
–
2
<T₁, X, 10>
Log old value
–
3
X = 20
Write new value
20
4
<COMMIT T₁>
Commit recorded
20
If crash occurs before <COMMIT T₁>:
Recovery scans log backward
Finds <T₁, X, 10>
Restores X = 10
Recovery Algorithm (Backward Pass)
Scan log backward.
For each uncommitted transaction \(T\):
For each record <T, X, old_value>: restore \(X \leftarrow old_value\)
Write <END T> for each undone transaction.
All committed transactions remain unchanged.
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{char* txn;char* var;int old_val;int new_val;} LogEntry;int main(){int X =10; LogEntry L ={"T1","X",10,20}; printf("<START %s>\n", L.txn); printf("<%s, %s, %d>\n", L.txn, L.var, L.old_val); X = L.new_val; printf("X = %d (updated)\n", X);// Simulate crash before commit printf("Crash! Rolling back...\n");// Undo X = L.old_val; printf("Undo: X restored to %d\n", X);}
Output:
<START T1>
<T1, X, 10>
X = 20 (updated)
Crash! Rolling back...
Undo: X restored to 10
Why It Matters
Simple and effective rollback mechanism
Guaranteed atomicity, uncommitted updates never persist
Used in early DBMS and transactional file systems
Trade-offs:
Writes must be in-place (no shadow copies)
Requires flushing log before every write → slower
No built-in redo (can’t restore committed updates)
A Gentle Proof (Why It Works)
Let \(L\) be the log sequence, and \(D\) the database.
Invariant: Before any update \((X \leftarrow v_{\text{new}})\), its old value \(v_{\text{old}}\) is logged: \[
\text{write}(L, \langle T, X, v_{\text{old}} \rangle) \Rightarrow \text{write}(D, X = v_{\text{new}})
\]
Upon crash:
For any transaction \(T\) without <COMMIT T>, scan backward, restoring each \(X\): \[
X \leftarrow v_{\text{old}}
\]
Update two variables (X, Y) under one transaction.
Crash before <COMMIT>, rollback both.
Crash after <COMMIT>, no rollback.
Extend with <END T> records for cleanup.
Test Cases
Log
Action
Result
<START T₁>, <T₁, X, 10>, crash
Uncommitted
Undo X=10
<START T₁>, <T₁, X, 10>, <COMMIT T₁>
Committed
No action
Multiple txns, one uncommitted
Partial undo
Rollback only active
Complexity
Time: \(O(n)\) (scan log backward)
Space: \(O(\text{\#updates})\) (log size)
Undo Logging is the guardian of old values, always writing the past before touching the present. If the system stumbles, the log becomes its map back to safety, step by step, undo by undo.
818 Redo Logging
Redo Logging is the dual of Undo Logging. Instead of recording old values for rollback, it logs new values so that after a crash, the system can reapply (redo) all committed updates. It ensures durability by replaying only the operations that made it to the commit point.
What Problem Are We Solving?
If a system crashes before flushing data to disk, some committed transactions might exist only in memory. Without protection, their updates would be lost, violating durability.
Redo Logging solves this by logging all new values before commit, so recovery can reconstruct them later, even if data pages were never written.
The rule is simple:
Never declare a transaction committed until all its new values are logged to stable storage.
How Does It Work (Plain Language)
Redo Logging keeps records of new values for each update.
Log record format:
<T, X, new_value>
where:
T = transaction ID
X = data item
new_value = value after modification
Rules of Redo Logging
Log Before Commit Every <T, X, new_value> must be written before <COMMIT T>.
Write Data After Commit Actual data pages can be written after the transaction commits.
Redo on Recovery If a committed transaction’s changes weren’t applied to disk, reapply them.
Uncommitted transactions are ignored, their changes never reach the database.
Example Walkthrough
Step
Action
Log
Data
1
<START T₁>
Start transaction
–
2
<T₁, X, 20>
Log new value
–
3
<COMMIT T₁>
Commit
–
4
X = 20
Write data
20
If crash occurs after commit but before writing X, recovery will redo the update from log: X = 20.
Recovery Algorithm (Forward Pass)
Scan log forward from start.
Identify committed transactions.
For each committed transaction \(T\):
For each record <T, X, new_value> reapply \(X \leftarrow new_value\)
Write <END T> to mark completion.
No undo needed, uncommitted updates are never written.
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{char* txn;char* var;int new_val;} LogEntry;int main(){int X =10; LogEntry L ={"T1","X",20}; printf("<START %s>\n", L.txn); printf("<%s, %s, %d>\n", L.txn, L.var, L.new_val); printf("<COMMIT %s>\n", L.txn);// Crash before data write printf("Crash! Recovering...\n");// Redo X = L.new_val; printf("Redo: X = %d\n", X);}
Ensures durability (D in ACID), committed updates never lost
Simple recovery, only reapply committed updates
Safe to delay writes, data written lazily
Used in systems with deferred writes
Trade-offs:
Requires forward recovery scan
Can’t undo, assumes uncommitted updates never flushed
Must force <COMMIT> record before declaring success
A Gentle Proof (Why It Works)
Let \(L\) be the log and \(D\) the data pages.
Invariant:
Before commit: \[
\forall (T, X, v_{\text{new}}) \in L,\ \text{no write to } D
\]
At commit: \[
\text{write}(L, \langle \text{COMMIT } T \rangle) \Rightarrow \text{all new values in } L
\]
On recovery:
For committed \(T\), reapply updates: \[
X \leftarrow v_{\text{new}}
\]
For uncommitted \(T\), skip (no changes persisted)
Thus final state is: \[
D_{\text{after recovery}} = D_{\text{after committed txns}}
\] ensuring atomicity and durability.
Try It Yourself
Simulate \(T_1\): update \(X=10 \to 20\), crash before writing \(X\).
Replay log → verify \(X=20\).
Add uncommitted \(T_2\), no redo applied.
Combine with checkpoint, skip old transactions.
Test Cases
Log
Action
Result
<START T₁>, <T₁, X, 20>, <COMMIT T₁>
Redo
X=20
<START T₁>, <T₁, X, 20> (no commit)
Skip
X unchanged
Two txns, one committed
Redo one
Only T₁ applied
Commit record missing
Skip
Safe recovery
Complexity
Time: \(O(n)\) (forward scan)
Space: \(O(\text{\#updates})\) (log size)
Redo Logging is the replay artist of recovery, always forward-looking, always restoring the future you meant to have. By saving every new value before commit, it ensures that no crash can erase what was promised.
819 Quorum Commit
Quorum Commit is a distributed commit protocol that ensures consistency by requiring a majority (quorum) of replicas to agree before a transaction is considered committed. It’s the foundation of fault-tolerant replication systems such as Dynamo, Cassandra, and Paxos-based databases.
Instead of one coordinator forcing all participants to commit (like 2PC), quorum commit spreads control across replicas, commit happens only when enough nodes agree.
What Problem Are We Solving?
In replicated systems, data is stored across multiple nodes for fault tolerance. We need a way to ensure:
Durability: data survives node failures
Consistency: all replicas converge
Availability: progress despite partial failures
If we required all replicas to acknowledge every write, one crashed node would block progress. Quorum Commit fixes this by requiring only a majority: \[
W + R > N
\] where:
\(N\) = total replicas
\(W\) = write quorum size
\(R\) = read quorum size
This ensures every read overlaps with the latest committed write.
How Does It Work (Plain Language)
Each transaction (or write) is sent to N replicas. The coordinator waits for W acknowledgments before declaring commit success.
On reads, the client queries R replicas and merges responses. Because \(W + R > N\), at least one replica in any read quorum has the latest value.
Steps for Write:
Send write to all \(N\) replicas.
Wait for \(W\) acknowledgments.
If \(W\) reached → commit success.
If fewer → abort or retry.
Steps for Read:
Query \(R\) replicas.
Collect responses.
Choose value with latest timestamp/version.
This model supports eventual consistency or strong consistency, depending on \(W\) and \(R\).
Example Walkthrough
Parameter
Value
\(N = 3\)
total replicas
\(W = 2\)
write quorum
\(R = 2\)
read quorum
Write “X=10”:
Send to replicas A, B, C
A and B respond → quorum reached → commit
C may lag, will catch up later
Read “X”:
Query A, C
A has X=10, C has X=old
Choose latest (A wins)
Because \(W + R = 4 > 3\), overlap guarantees correctness.
Tiny Code (Conceptual Simulation)
#include <stdio.h>int main(){int N =3, W =2, ack =0;int replicas[]={1,1,0};// A, B ack; C failsfor(int i =0; i < N; i++){if(replicas[i]) ack++;}if(ack >= W) printf("Commit success (acks=%d, quorum=%d)\n", ack, W);else printf("Commit failed (acks=%d, quorum=%d)\n", ack, W);}
Consistency–Availability Tradeoff: tunable via \(W\) and \(R\)
Foundation for Dynamo, Cassandra, Riak, CockroachDB
Trade-offs:
Eventual consistency if replicas lag
Higher coordination than single-node commit
Conflicts if concurrent writes (solved via versioning)
A Gentle Proof (Why It Works)
Let:
\(N\) = total replicas
\(W\) = write quorum size
\(R\) = read quorum size
To ensure every read sees the latest committed write: \[
W + R > N
\]
Proof sketch:
Write completes after \(W\) replicas store the update.
Read queries \(R\) replicas.
Since \(W + R > N\), there must be at least one replica \(p\) that received both. Thus, every read overlaps with the latest write.
So every read intersects with the set of written replicas, ensuring consistency.
Try It Yourself
Set \(N=5\), \(W=3\), \(R=3\) → check overlap.
Simulate replica failures:
If 1 fails → still commit (3 of 5).
If 3 fail → no quorum → abort.
Reduce \(W\) to 1 → faster but weaker consistency.
Visualize \(W+R > N\) overlap condition.
Test Cases
\(N\)
\(W\)
\(R\)
\(W + R > N\)
Consistent?
Result
3
2
2
4 > 3
✅
Strong consistency
3
2
1
3 = 3
✅
OK but fragile
3
1
1
2 < 3
❌
Inconsistent
5
3
3
6 > 5
✅
Safe overlap
Complexity
Write latency: wait for \(W\) acks → \(O(W)\)
Read latency: wait for \(R\) responses → \(O(R)\)
Fault tolerance: up to \(N - W\) failures
Quorum Commit is the voting system of distributed data, decisions made by majority, not unanimity. By tuning \(W\) and \(R\), you can steer the system toward strong consistency, high availability, or low latency, one quorum at a time.
820 Consensus Commit
Consensus Commit merges the atomic commit protocol of 2PC with the fault tolerance of consensus algorithms like Paxos or Raft. It ensures that a distributed transaction reaches a durable, consistent decision (commit or abort) even if coordinators or participants crash.
This is how modern distributed databases (e.g., Spanner, CockroachDB, Calvin) achieve atomicity and consistency in the presence of failures.
What Problem Are We Solving?
Traditional Two-Phase Commit (2PC) ensures atomicity but is blocking, if the coordinator fails after all participants prepare, they may wait forever.
Consensus Commit solves this by replacing the coordinator’s single point of failure with a consensus group. The commit decision is reached through majority agreement, so even if some nodes fail, others can continue.
Consensus Commit wraps 2PC inside a consensus layer (like Paxos or Raft). Instead of a single coordinator deciding, the commit decision itself is replicated and agreed upon.
Steps:
Prepare Phase
Each participant votes YES/NO (like 2PC).
Votes sent to a leader node.
Consensus Phase
Leader proposes a final decision (commit/abort).
Decision is replicated via consensus protocol (Paxos/Raft).
Majority accepts → decision is durable.
Commit Phase
Decision broadcast to all participants.
Each applies commit/abort locally.
Even if the leader crashes, the decision can be recovered.
So, the decision itself is stored redundantly across nodes, eliminating coordinator failure as a single point of blocking.
Example Walkthrough
Step
Role
Action
1
Leader
Collect votes from participants
2
Participants
Vote YES / NO
3
Leader
Propose final decision = COMMIT
4
Replicas
Reach consensus via Raft/Paxos
5
Majority
Agree → Decision durable
6
All
Apply COMMIT locally
If the leader fails after step 4, a new leader reads the replicated log and continues from the same decision.
Complexity in coordinating multiple consensus groups
A Gentle Proof (Why It Works)
Let:
\(P_i\) = participant nodes
\(L\) = leader
\(Q\) = majority quorum
Key properties:
Agreement: Commit decision is replicated to a majority \(Q\). Any new leader must read from \(Q\), so all leaders share the same decision: \[
\forall L_1, L_2:\ \text{decision}(L_1) = \text{decision}(L_2)
\]
Atomicity: All participants apply the same decision: \[
\forall i, j:\ \text{state}(P_i) = \text{state}(P_j)
\]
Durability: Once quorum stores the decision, it cannot be lost, even if some replicas fail.
Fault Tolerance: survives \(f\) failures out of \(2f+1\) replicas
Consensus Commit is atomic commitment with brains, blending the simplicity of 2PC with the resilience of consensus. It guarantees that even in a storm of failures, every node sees the same decision, and that decision is final, durable, and agreed by majority.
Section 83. Scheduling
821 First-Come First-Served (FCFS)
First-Come First-Served (FCFS) is the simplest scheduling algorithm, it processes tasks strictly in the order they arrive. No preemption, no priority, just a fair queue: the first job in is the first job out.
It’s used in operating systems, job schedulers, and even I/O queues where fairness matters more than responsiveness.
What Problem Are We Solving?
When multiple jobs compete for a shared resource (CPU, disk, printer), we need a policy to decide which runs first.
The FCFS strategy solves this by providing:
Fairness: every job gets a turn
Simplicity: easy to implement
Predictability: execution order is transparent
But because jobs run non-preemptively, long jobs can block short ones, a phenomenon known as the convoy effect.
How Does It Work (Plain Language)
Jobs are ordered by arrival time. The scheduler always picks the earliest waiting job.
Steps:
Maintain a FIFO queue of ready jobs.
When the CPU is free, dequeue the first job.
Run it to completion (no interruption).
Repeat for the next job in line.
No priorities. No time slicing. Just fairness in time order.
Example Walkthrough
Suppose 3 jobs arrive:
Job
Arrival
Burst Time
J₁
0
5
J₂
1
3
J₃
2
8
Execution Order: J₁ → J₂ → J₃
Job
Start
Finish
Turnaround
Waiting
J₁
0
5
5
0
J₂
5
8
7
4
J₃
8
16
14
6
Average waiting time = \((0 + 4 + 6)/3 = 3.33\)
The later a job arrives, the longer it may wait, especially behind long tasks.
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{int id;int burst;} Job;int main(){ Job q[]={{1,5},{2,3},{3,8}};int n =3, time =0;for(int i =0; i < n; i++){ printf("Job %d starts at %d, runs for %d\n", q[i].id, time, q[i].burst); time += q[i].burst; printf("Job %d finishes at %d\n", q[i].id, time);}}
Output:
Job 1 starts at 0, runs for 5
Job 1 finishes at 5
Job 2 starts at 5, runs for 3
Job 2 finishes at 8
Job 3 starts at 8, runs for 8
Job 3 finishes at 16
Why It Matters
Fairness: all jobs treated equally
Simplicity: trivial to implement
Good throughput when tasks are similar
But:
Convoy effect: one long job delays all others
Poor responsiveness: no preemption for I/O-bound jobs
Not ideal for interactive systems
A Gentle Proof (Why It Works)
Let \(n\) jobs arrive at times \(a_i\) with burst times \(b_i\) (sorted by arrival).
In FCFS, each job starts after the previous finishes: \[
S_1 = a_1, \quad S_i = \max(a_i, F_{i-1})
\]\[
F_i = S_i + b_i
\]
Turnaround time: \[
T_i = F_i - a_i
\]
Waiting time: \[
W_i = T_i - b_i
\]
Because the order is fixed, FCFS minimizes context switching overhead, though not necessarily waiting time.
Try It Yourself
Simulate 5 jobs with varying burst times.
Compute average turnaround and waiting time.
Add a very long job first, observe convoy effect.
Compare with Shortest Job First (SJF), note differences.
Test Cases
Jobs
Arrival
Burst
Order
Avg Waiting
3
0,1,2
5,3,8
FCFS
3.33
3
0,2,4
2,2,2
FCFS
2.0
3
0,1,2
10,1,1
FCFS
High (convoy)
Complexity
Time: \(O(n)\) (linear scan through queue)
Space: \(O(n)\) (queue size)
FCFS is the gentle giant of scheduling, slow to adapt, but steady and fair. Its simplicity makes it ideal for batch systems and FIFO queues, though modern schedulers often add priorities and preemption to tame its convoy effect.
822 Shortest Job First (SJF)
Shortest Job First (SJF) scheduling always picks the job with the smallest execution time next. It’s the optimal scheduling algorithm for minimizing average waiting time, small tasks never get stuck behind long ones.
There are two variants:
Non-preemptive SJF: once a job starts, it runs to completion.
Preemptive SJF (Shortest Remaining Time First, SRTF): if a new job arrives with a shorter remaining time, it preempts the current one.
What Problem Are We Solving?
First-Come First-Served (FCFS) can cause the convoy effect, short tasks wait behind long ones. SJF fixes that by prioritizing shorter tasks, improving average turnaround and responsiveness.
The intuition:
“Always do the easiest thing first.”
This mirrors real-world queues, serve quick customers first to minimize average waiting.
How Does It Work (Plain Language)
Maintain a list of ready jobs.
When CPU is free, pick the job with smallest burst time.
Run it (non-preemptive), or switch if a shorter job arrives (preemptive).
Repeat until all jobs finish.
If burst times are known (e.g., predicted via exponential averaging), SJF gives provably minimal waiting time.
Example Walkthrough (Non-Preemptive)
Job
Arrival
Burst
J₁
0
7
J₂
1
4
J₃
2
1
J₄
3
4
Execution order:
At \(t=0\), only J₁ (7) → run J₁
When J₁ finishes at \(t=7\), pick shortest among {J₂(4), J₃(1), J₄(4)} → J₃
Total waiting time = smaller than FCFS or non-preemptive SJF.
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{int id, burst;} Job;void sjf(Job jobs[],int n){// Simple selection-sort style schedulingfor(int i =0; i < n -1; i++){int min = i;for(int j = i +1; j < n; j++)if(jobs[j].burst < jobs[min].burst) min = j; Job temp = jobs[i]; jobs[i]= jobs[min]; jobs[min]= temp;}int time =0;for(int i =0; i < n; i++){ printf("Job %d starts at %d, burst = %d\n", jobs[i].id, time, jobs[i].burst); time += jobs[i].burst;}}int main(){ Job jobs[]={{1,7},{2,4},{3,1},{4,4}}; sjf(jobs,4);}
Output:
Job 3 starts at 0, burst = 1
Job 2 starts at 1, burst = 4
Job 4 starts at 5, burst = 4
Job 1 starts at 9, burst = 7
Why It Matters
Minimizes average waiting time, provably optimal
Fair for small jobs, but starves long ones
Foundation for priority-based and multilevel feedback schedulers
Trade-offs:
Requires knowledge or estimation of burst times
Not suitable for unpredictable workloads
May lead to starvation (long jobs never scheduled)
A Gentle Proof (Why It Works)
Let jobs have burst times \(b_1, b_2, \dots, b_n\), sorted ascending. Total waiting time: \[
W = \sum_{i=1}^{n-1} (n - i) b_i
\] SJF minimizes \(W\) because exchanging any longer job earlier increases total waiting. Hence, by Shortest Processing Time First (SPT) principle, SJF is optimal for minimizing mean waiting time.
Try It Yourself
Simulate 4 jobs with burst = 7, 4, 1, 4.
Draw a Gantt chart for non-preemptive SJF.
Now add preemption (SRTF), compare results.
Try a long job arriving early, observe starvation.
Compare average waiting time vs FCFS.
Test Cases
Jobs
Burst
Order
Avg Waiting
Note
3
5, 2, 1
3→2→1
2
SJF optimal
4
7,4,1,4
3→2→4→1
5.25
Matches example
3
2,2,2
Any
2
Tie safe
Complexity
Sorting: \(O(n \log n)\)
Scheduling: \(O(n)\)
Space: \(O(n)\)
SJF is the strategist of schedulers, always choosing the shortest path to reduce waiting. Elegant in theory, but demanding in practice, it shines when burst times are known or predictable.
824 Priority Scheduling
Priority Scheduling selects the next job based on its priority value, not its arrival order or length. Higher-priority jobs run first; lower ones wait. It’s a generalization of SJF (where priority = \(1/\text{burst time}\)) and underpins many real-world schedulers like those in Linux, Windows, and databases.
There are two main modes:
Non-preemptive: once started, the job runs to completion.
Preemptive: if a higher-priority job arrives, it interrupts the current one.
What Problem Are We Solving?
In real systems, not all jobs are equal:
Some are time-critical (e.g., interrupts, real-time tasks)
Others are low-urgency (e.g., background jobs)
We need a mechanism that reflects importance or urgency, scheduling by priority rather than by fairness or order.
How Does It Work (Plain Language)
Each job has a priority number (higher = more urgent).
Algorithm steps:
Insert incoming jobs into a ready queue, sorted by priority.
Pick the job with the highest priority.
Run (to completion or until preempted).
On completion or preemption, select next highest priority.
Priority may be:
Static: assigned once
Dynamic: updated (e.g., aging, feedback)
Example Walkthrough (Non-Preemptive)
Job
Arrival
Burst
Priority
J₁
0
5
2
J₂
1
3
4
J₃
2
1
3
Order of selection:
\(t=0\): J₁ starts (only job)
\(t=1\): J₂ arrives (priority 4 > 2), but J₁ non-preemptive
\(t=5\): J₂ (4), then J₃ (3)
Execution order: J₁ → J₂ → J₃
Job
Start
Finish
Turnaround
Waiting
J₁
0
5
5
0
J₂
5
8
7
4
J₃
8
9
7
6
Example Walkthrough (Preemptive)
Same jobs, preemptive mode:
\(t=0\): Run J₁ (p=2)
\(t=1\): J₂ arrives (p=4) → preempt J₁
\(t=1–4\): Run J₂ (done)
\(t=4\): J₁ resumes
\(t=5\): J₃ arrives (p=3) → preempt J₁ again
\(t=5–6\): Run J₃
\(t=6–9\): Finish J₁
Execution order: J₁ → J₂ → J₁ → J₃ → J₁
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{int id, burst, priority;} Job;void sort_by_priority(Job jobs[],int n){for(int i =0; i < n -1; i++)for(int j = i +1; j < n; j++)if(jobs[j].priority > jobs[i].priority){ Job t = jobs[i]; jobs[i]= jobs[j]; jobs[j]= t;}}int main(){ Job jobs[]={{1,5,2},{2,3,4},{3,1,3}};int n =3, time =0; sort_by_priority(jobs, n);for(int i =0; i < n; i++){ printf("Job %d (P=%d) runs from %d to %d\n", jobs[i].id, jobs[i].priority, time, time + jobs[i].burst); time += jobs[i].burst;}}
Output:
Job 2 (P=4) runs from 0 to 3
Job 3 (P=3) runs from 3 to 4
Job 1 (P=2) runs from 4 to 9
Why It Matters
Expresses urgency directly
Used in OS kernels, device drivers, real-time systems
Enables service differentiation (foreground vs background)
Trade-offs:
Starvation possible (low-priority jobs may never run)
Needs aging or dynamic priorities to ensure fairness
Priority inversion possible (low-priority blocking high-priority)
A Gentle Proof (Why It Works)
Let each job \(J_i\) have priority \(p_i\). The scheduler picks job \(J_k\) where: \[
p_k = \max_i(p_i)
\]
To ensure bounded waiting, use aging: \[
p_i(t) = p_i(0) + \alpha t
\] where \(\alpha\) is a small increment per unit time. Eventually, every job’s priority rises enough to execute, preventing starvation.
Try It Yourself
Simulate jobs with priorities 5, 3, 1 → observe order.
Add aging: every time unit, increase waiting job’s priority.
Compare preemptive vs non-preemptive runs.
Add I/O-bound jobs with high priority, note responsiveness.
Test Cases
Job Priorities
Mode
Order
Starvation?
4, 3, 2
Non-preemptive
1→2→3
No
5, 1, 1
Preemptive
1→2/3
Maybe
Aging on
Preemptive
Fair rotation
No
Complexity
Sorting-based: \(O(n \log n)\) per scheduling decision
Dynamic aging: \(O(n)\) update per tick
Preemption overhead: depends on frequency
Priority Scheduling is the executive scheduler, giving the CPU to whoever shouts loudest. It’s powerful but political: without aging, the quiet ones might never be heard.
825 Multilevel Queue Scheduling
Multilevel Queue Scheduling divides the ready queue into multiple sub-queues, each dedicated to a class of processes, for example, system, interactive, batch, or background. Each queue has its own scheduling policy, and queues themselves are scheduled using fixed priorities or time slices.
This design mirrors real operating systems, where not all processes are equal, some need immediate attention (like keyboard interrupts), while others can wait (like backups).
What Problem Are We Solving?
In a single-queue scheduler (like FCFS or RR), all processes compete together. But real systems need differentiation:
Foreground (interactive) jobs need fast response.
Background (batch) jobs need throughput.
System tasks need instant service.
Multilevel queues solve this by classification + specialization:
Different job types → different queues
Different queues → different policies
How Does It Work (Plain Language)
Partition processes into distinct categories (e.g., system, interactive, batch).
Assign each category to a queue.
Each queue has its own scheduling algorithm (RR, FCFS, SJF, etc.).
Queue selection policy:
Fixed priority: higher queue always served first.
Time slicing: share CPU between queues (e.g., 80% user, 20% background).
Example Queue Setup
Queue
Type
Policy
Priority
Q₁
System
FCFS
1 (highest)
Q₂
Interactive
RR
2
Q₃
Batch
SJF
3 (lowest)
CPU selection order:
Always check Q₁ first.
If empty, check Q₂.
If Q₂ empty, check Q₃.
Each queue runs its own local scheduler independently.
Example Walkthrough
Suppose:
Q₁: System → J₁(3), J₂(2)
Q₂: Interactive → J₃(4), J₄(3)
Q₃: Batch → J₅(6)
Fixed-priority scheme:
Run all Q₁ jobs (J₁, J₂) first (FCFS)
Then Q₂ jobs (RR)
Finally Q₃ (SJF)
Result: System responsiveness guaranteed, background delayed.
Supports job differentiation (system vs user vs batch)
Combines multiple policies for hybrid workloads
Predictable service for high-priority queues
Trade-offs:
Rigid separation: processes can’t move between queues
Starvation: lower queues may never run (in fixed priority)
Complex tuning: balancing queue shares requires care
A Gentle Proof (Why It Works)
Let queues \(Q_1, Q_2, \ldots, Q_k\) with priorities \(P_1 > P_2 > \ldots > P_k\).
For fixed priority:
CPU always serves the non-empty queue with highest \(P_i\).
Thus, system tasks (higher \(P_i\)) never blocked by user tasks.
For time slicing:
Each queue gets CPU share \(s_i\), with \(\sum s_i = 1\).
Over time \(T\), each queue executes for \(s_i T\), ensuring fair allocation across classes.
This ensures bounded delay and deterministic control.
Try It Yourself
Create 3 queues: System (FCFS), Interactive (RR), Batch (SJF).
Simulate fixed-priority vs time-slice scheduling.
Add a long-running batch job → observe starvation.
Switch to time-slice → compare fairness.
Experiment with different share ratios.
Test Cases
Queues
Policy
Mode
Result
2
FCFS, RR
Fixed priority
Fast system jobs
3
FCFS, RR, SJF
Time-slice
Balanced fairness
2
RR, SJF
Fixed priority
Starvation risk
Complexity
Scheduling: \(O(k)\) (choose queue) + local queue policy
Space: \(O(\sum n_i)\) (total jobs across queues)
Multilevel Queue Scheduling is like a tiered city, express lanes for the urgent, side streets for the steady. Each level runs by its own rhythm, but the mayor (scheduler) decides who gets the CPU next.
826 Earliest Deadline First (EDF)
Earliest Deadline First (EDF) is a dynamic priority scheduling algorithm used primarily in real-time systems. At any scheduling decision point, it picks the task with the closest deadline. If a new task arrives with an earlier deadline, it can preempt the current one.
EDF is optimal for single-processor real-time systems, if a feasible schedule exists, EDF will find it.
What Problem Are We Solving?
In real-time systems, timing is everything, missing a deadline can mean failure (e.g., missed sensor reading or delayed control signal).
We need a scheduling policy that:
Always meets deadlines (if possible)
Adapts to dynamic task arrivals
Ensures predictability under time constraints
EDF achieves this by making the most urgent task run first, urgency measured by deadline proximity.
How Does It Work (Plain Language)
Each task \(T_i\) has:
Execution time: \(C_i\)
Period (or arrival time): \(P_i\)
Absolute deadline: \(D_i\)
At each moment:
Collect all ready tasks.
Select the one with the earliest deadline.
Run it (preempt if another task arrives with an earlier \(D_i\)).
For periodic tasks with execution time \(C_i\) and period \(P_i\):
\[
U = \sum_{i=1}^{n} \frac{C_i}{P_i}
\]
EDF guarantees all deadlines if:
\[
U \le 1
\]
That is, total CPU utilization ≤ 100%.
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{int id, exec, deadline;} Task;void edf(Task t[],int n){int time =0;while(n >0){int min =0;for(int i =1; i < n; i++)if(t[i].deadline < t[min].deadline) min = i; printf("Time %d–%d: Task %d (D=%d)\n", time, time + t[min].exec, t[min].id, t[min].deadline); time += t[min].exec;for(int j = min; j < n -1; j++) t[j]= t[j +1]; n--;}}int main(){ Task tasks[]={{1,2,5},{2,1,3},{3,2,7}}; edf(tasks,3);}
Output:
Time 0–1: Task 2 (D=3)
Time 1–3: Task 1 (D=5)
Time 3–5: Task 3 (D=7)
Why It Matters
Optimal for single CPU, meets all deadlines if feasible
Dynamic priority: adapts as deadlines approach
Widely used in real-time OS and embedded systems
Trade-offs:
Overhead: frequent priority updates and preemptions
Needs accurate deadlines and execution times
Can thrash under overload (misses multiple deadlines)
A Gentle Proof (Why It Works)
Suppose EDF fails to meet a deadline while a feasible schedule exists. Then at the missed deadline \(D_i\), some task \(T_j\) with \(D_j > D_i\) must have run instead. Swapping their execution order would bring \(T_i\) earlier without delaying \(T_j\) past \(D_j\), contradicting optimality.
Hence, EDF always finds a feasible schedule if one exists.
EDF is the watchmaker of schedulers, always attending to the next ticking clock. When tasks have precise deadlines, it’s the most reliable guide to keep every second in order.
827 Rate Monotonic Scheduling (RMS)
Rate Monotonic Scheduling (RMS) is a fixed-priority real-time scheduling algorithm. It assigns priorities based on task frequency (rate): the shorter the period, the higher the priority. RMS is optimal among all fixed-priority schedulers, if a set of periodic tasks cannot be scheduled by RMS, no other fixed-priority policy can do better.
What Problem Are We Solving?
In real-time systems, tasks repeat periodically and must finish before their deadlines. We need a static, predictable scheduler with:
Low runtime overhead (fixed priorities)
Deterministic timing
Guaranteed feasibility under certain CPU loads
EDF (dynamic) is optimal but costly to maintain. RMS trades a bit of flexibility for simplicity and determinism.
How Does It Work (Plain Language)
Each periodic task \(T_i\) is characterized by:
Period \(P_i\) (interval between releases)
Computation time \(C_i\) (execution per cycle)
Deadline = end of period (\(D_i = P_i\))
RMS rule:
Assign higher priority to the task with smaller \(P_i\) (higher frequency).
Scheduler steps:
Sort all tasks by increasing period (shorter = higher priority).
Run the highest-priority ready task.
Preempt lower ones if necessary.
Repeat every cycle.
Example Walkthrough
Task
Execution \(C_i\)
Period \(P_i\)
Priority
T₁
1
4
High
T₂
2
5
Medium
T₃
3
10
Low
Timeline:
\(t=0\): Run T₁ (1 unit)
\(t=1\): Run T₂ (2 units)
\(t=3\): Run T₃ (3 units)
\(t=4\): T₁ releases again → preempts T₃
The CPU always picks the highest-frequency ready task.
Utilization Bound (Feasibility Test)
For \(n\) periodic tasks with periods \(P_i\) and execution times \(C_i\), RMS guarantees all deadlines if total utilization:
\[
U = \sum_{i=1}^{n} \frac{C_i}{P_i} \le n \cdot (2^{1/n} - 1)
\]
As \(n \to \infty\), \[
U \to \ln 2 \approx 0.693
\]
So up to 69.3% CPU utilization is guaranteed safe. Above that, schedulability depends on specific task alignment.
RMS is the metronome of real-time scheduling, simple, rhythmic, and fixed. Each task knows its place, and the CPU dances to the beat of their periods.
828 Lottery Scheduling
Lottery Scheduling introduces probabilistic fairness into CPU scheduling. Each process holds a certain number of tickets, and at every scheduling decision, the scheduler randomly draws one ticket, the owner gets the CPU. Over time, the proportion of CPU time each process receives approximates its ticket share.
It’s like holding a lottery for CPU time: more tickets → more chances to win → more CPU time.
What Problem Are We Solving?
Traditional schedulers (like FCFS or Priority) are deterministic, but rigid. They can cause:
Each process \(P_i\) holds \(t_i\) tickets. The total pool is \(T = \sum_i t_i\).
At each scheduling step:
Draw a random number \(r \in [1, T]\).
Find the process whose cumulative ticket range includes \(r\).
Run that process for one quantum.
Repeat.
Over many draws, process \(P_i\) runs approximately:
\[
\text{Fraction of time} = \frac{t_i}{T}
\]
Thus, the expected CPU share is proportional to its ticket count.
Example Walkthrough
Process
Tickets
Expected Share
P₁
50
50%
P₂
30
30%
P₃
20
20%
Over 1000 quanta:
P₁ runs ≈ 500 times
P₂ runs ≈ 300 times
P₃ runs ≈ 200 times
Small variations occur (random), but long-term fairness holds.
Example Ticket Ranges
Process
Tickets
Range
P₁
50
1–50
P₂
30
51–80
P₃
20
81–100
If \(r = 73\) → P₂ wins If \(r = 12\) → P₁ wins If \(r = 95\) → P₃ wins
Tiny Code (Conceptual Simulation)
#include <stdio.h>#include <stdlib.h>#include <time.h>typedefstruct{int id, tickets;} Proc;int pick_winner(Proc p[],int n){int total =0;for(int i =0; i < n; i++) total += p[i].tickets;int draw = rand()% total +1;int sum =0;for(int i =0; i < n; i++){ sum += p[i].tickets;if(draw <= sum)return i;}return0;}int main(){ srand(time(NULL)); Proc p[]={{1,50},{2,30},{3,20}};int n =3;for(int i =0; i <10; i++){int w = pick_winner(p, n); printf("Quantum %d: Process %d runs\n", i +1, p[w].id);}}
Output (sample):
Quantum 1: Process 1 runs
Quantum 2: Process 2 runs
Quantum 3: Process 1 runs
Quantum 4: Process 3 runs
...
Why It Matters
Proportional fairness: share CPU by weight
Dynamic control: just adjust ticket counts
Starvation-free: everyone gets some chance
Simple to implement probabilistically
Trade-offs:
Approximate fairness: not exact short-term
Randomness: unpredictable at small scale
Overhead: random number generation, cumulative scan
A Gentle Proof (Why It Works)
Let \(t_i\) be tickets for process \(i\), and \(T\) be total tickets. Each draw is uniformly random over \([1, T]\).
Probability \(P_i\) wins: \[
P(\text{win}_i) = \frac{t_i}{T}
\]
Over \(N\) quanta, expected wins: \[
E[\text{runs}_i] = N \cdot \frac{t_i}{T}
\]
By the Law of Large Numbers, actual runs converge to \(E[\text{runs}_i]\) as \(N \to \infty\): \[
\frac{\text{runs}_i}{N} \to \frac{t_i}{T}
\]
Thus, long-term proportional fairness is guaranteed.
Try It Yourself
Assign tickets: (50, 30, 20). Run 100 quanta, measure distribution.
Increase P₃’s tickets, see its share rise.
Remove tickets mid-run, process disappears.
Combine with compensation tickets for I/O-bound jobs.
Compare with Round Robin (equal tickets).
Test Cases
Processes
Tickets
Expected Ratio
Observed (approx)
(50, 30, 20)
100
5:3:2
502:303:195
(1, 1, 1)
3
1:1:1
~Equal
(90, 10)
2
9:1
~9:1
Complexity
Draw: \(O(n)\) (linear scan) or \(O(\log n)\) (tree)
Space: \(O(n)\)
Overhead: small per quantum
Lottery Scheduling is the casino of schedulers, fair in the long run, playful in the short run. Instead of rigid rules, it trusts probability to balance the load, giving everyone a ticket, and a chance.
829 Multilevel Feedback Queue (MLFQ)
Multilevel Feedback Queue (MLFQ) is one of the most flexible and adaptive CPU scheduling algorithms. Unlike Multilevel Queue Scheduling (where a process is stuck in one queue), MLFQ allows processes to move between queues, gaining or losing priority based on observed behavior.
It blends priority scheduling, Round Robin, and feedback adaptation, rewarding interactive jobs with fast service and demoting CPU-hungry jobs.
What Problem Are We Solving?
In real-world systems, processes vary:
Some are interactive (short bursts, frequent I/O)
Others are CPU-bound (long computations)
We don’t always know this beforehand. MLFQ learns dynamically:
If a process uses too much CPU → lower priority
If a process frequently yields (I/O-bound) → higher priority
Thus, MLFQ adapts automatically, achieving a balance between responsiveness and throughput.
How Does It Work (Plain Language)
Maintain multiple ready queues, each with different priority levels.
Each queue has its own time quantum (higher priority → shorter quantum).
A new job enters the top queue.
If it uses its whole quantum → demote to a lower queue.
If it yields early (I/O wait) → stay or promote.
Always schedule from the highest non-empty queue.
Periodically reset all processes to the top (aging).
Example Setup
Queue
Priority
Time Quantum
Policy
Q₁
High
1 unit
RR
Q₂
Medium
2 units
RR
Q₃
Low
4 units
FCFS
Rules:
Always prefer Q₁ over Q₂, Q₂ over Q₃.
Use time slice = quantum of current queue.
If process finishes early → stays.
If uses full quantum → demote.
Example Walkthrough
Jobs:
Job
Burst
Behavior
J₁
8
CPU-bound
J₂
2
I/O-bound
Timeline:
\(t=0\): J₁ (Q₁, quantum 1) → uses full → demote to Q₂
\(t=1\): J₂ (Q₁, quantum 1) → yields early → stays
\(t=2\): J₂ (Q₁) → finishes
\(t=3\): J₁ (Q₂, quantum 2) → uses full → demote to Q₃
\(t=5\): J₁ (Q₃, FCFS) → finish
Outcome:
J₂ (interactive) finishes fast
J₁ (CPU-bound) gets fair share, lower priority
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{int id, burst, queue;} Job;int main(){ Job jobs[]={{1,8,1},{2,2,1}};int time =0;while(jobs[0].burst >0|| jobs[1].burst >0){for(int i =0; i <2; i++){if(jobs[i].burst <=0)continue;int q = jobs[i].queue;int quantum =(q ==1)?1:(q ==2)?2:4;int run = jobs[i].burst < quantum ? jobs[i].burst : quantum; printf("t=%d: Job %d (Q%d) runs for %d\n", time, jobs[i].id, q, run); time += run; jobs[i].burst -= run;if(jobs[i].burst >0&& run == quantum && q <3) jobs[i].queue++;}}}
Output:
t=0: Job 1 (Q1) runs for 1
t=1: Job 2 (Q1) runs for 1
t=2: Job 2 (Q1) runs for 1
t=3: Job 1 (Q2) runs for 2
t=5: Job 1 (Q3) runs for 4
t=9: Job 1 (Q3) runs for 1
Why It Matters
Adaptive: no need to know job lengths ahead of time
Fairness: all jobs eventually get CPU time
Responsiveness: favors short and I/O-bound tasks
Widely used: foundation of UNIX and modern OS schedulers
Trade-offs:
Complex tuning: many parameters (quanta, queues)
Overhead: managing promotions/demotions
Starvation risk: if not periodically boosted
A Gentle Proof (Why It Works)
Over time, CPU-bound jobs descend to lower queues, leaving top queues free for short, interactive ones. Given enough reset periods, every job eventually returns to top priority.
Thus, MLFQ guarantees eventual progress and low latency for I/O-bound tasks.
Let queues \(Q_1, Q_2, \ldots, Q_k\) with quanta \(q_1 < q_2 < \cdots < q_k\). A job that uses full quanta each time moves to queue \(Q_k\) after \(k-1\) steps. Periodic resets restore fairness, preventing indefinite starvation.
Try It Yourself
Simulate 3 jobs: CPU-bound, short burst, interactive.
Assign quanta: 1, 2, 4.
Watch promotions/demotions.
Add periodic reset → verify fairness.
Compare with Round Robin and Priority Scheduling.
Test Cases
Jobs
Bursts
Behavior
Result
2
8, 2
CPU vs I/O
I/O finishes early
3
5, 3, 2
Mixed
CPU job demoted
1
10
CPU-bound
Ends in low queue
Complexity
Scheduling: \(O(\text{\#queues})\) per decision
Space: \(O(\sum n_i)\)
Overhead: promotions, demotions, resets
MLFQ is the chameleon of schedulers, constantly adapting, learning job behavior on the fly. It rewards patience and responsiveness alike, orchestrating fairness across time.
830 Fair Queuing (FQ)
Fair Queuing (FQ) is a network scheduling algorithm that divides bandwidth fairly among flows. It treats each flow as if it had its own private queue and allocates transmission opportunities proportionally and smoothly, preventing one flow from dominating the link.
In CPU scheduling terms, FQ is the weighted Round Robin of packets, everyone gets a turn, but heavier weights get proportionally more bandwidth.
What Problem Are We Solving?
In shared systems (like routers or CPU schedulers), multiple flows or processes compete for a single resource. Without fairness, one greedy flow can monopolize capacity, causing starvation or jitter for others.
Fair Queuing ensures:
Fair bandwidth sharing
Isolation between flows
Smooth latency for well-behaved traffic
Used widely in network routers, I/O schedulers, and virtualization systems.
How Does It Work (Plain Language)
Each flow maintains its own FIFO queue of packets (or jobs). Conceptually, the scheduler simulates bit-by-bit Round Robin across all active flows, then transmits whole packets in the same order.
The trick is to assign each packet a virtual finish time and always send the one that would finish earliest under perfect fairness.
Key Idea: Virtual Finish Time
For each packet \(p_{i,k}\) (packet \(k\) of flow \(i\)):
\(F_{i,k-1}\) = finish time of previous packet in flow \(i\)
\(V(t)\) = system virtual time at \(t\)
\(w_i\) = weight (priority) of flow \(i\)
Scheduler always picks the packet with smallest \(F_{i,k}\) next.
This emulates Weighted Fair Queuing (WFQ) if \(w_i\) varies.
Example Walkthrough
Suppose two flows:
Flow
Packet Length
Weight
F₁
100 bytes
1
F₂
50 bytes
1
If both active:
F₁’s packet finishes at time 100
F₂’s at 50 → send F₂ first
Next F₁ → total bandwidth shared 50/50
If F₂ sends more, F₁ still gets equal share over time.
Tiny Code (Conceptual Simulation)
#include <stdio.h>typedefstruct{int flow, length, finish;} Packet;int main(){ Packet p[]={{1,100,100},{2,50,50},{1,100,200},{2,50,100}};int n =4;// Sort by finish time (simplified)for(int i =0; i < n -1; i++)for(int j = i +1; j < n; j++)if(p[j].finish < p[i].finish){ Packet t = p[i]; p[i]= p[j]; p[j]= t;}for(int i =0; i < n; i++) printf("Send packet from Flow %d (finish=%d)\n", p[i].flow, p[i].finish);}
Output:
Send packet from Flow 2 (finish=50)
Send packet from Flow 2 (finish=100)
Send packet from Flow 1 (finish=100)
Send packet from Flow 1 (finish=200)
Why It Matters
Bandwidth fairness: no flow hogs the link
Delay fairness: small packets don’t starve
Isolation: misbehaving flows can’t harm others
Used in: routers, disks, and OS process schedulers
Trade-offs:
Computational cost: must track virtual finish times
Approximation needed for high-speed systems
Large number of flows increases overhead
A Gentle Proof (Why It Works)
Fair Queuing approximates an ideal fluid system where each flow gets \[
\frac{w_i}{\sum_j w_j}
\] of the link rate continuously.
Each packet \(p_{i,k}\) has a finish time: \[
F_{i,k} = \max(F_{i,k-1}, V(A_{i,k})) + \frac{L_{i,k}}{w_i}
\]
By transmitting packets in increasing order of \(F_{i,k}\), we ensure at any time, the service lag between any two flows is bounded by one packet size.
Compare with FIFO, note difference under contention.
Add new flow mid-stream, verify smooth integration.
Test Cases
Flows
Weights
Packets
Result
2
1:1
Equal
Fair 50/50
2
1:2
Equal
Weighted 1:2 share
3
1:1:1
Varied size
Equal service over time
Complexity
Decision time: \(O(\log n)\) (priority queue by \(F_{i,k}\))
Space: \(O(n)\) active flows
Accuracy: within 1 packet of ideal fair share
Fair Queuing is the balancer of schedulers, a calm arbiter ensuring every flow gets its due. Like a maestro, it interleaves packets just right, letting all voices share the line in harmony.
Section 84. Caching and Replacement
831 LRU (Least Recently Used)
Least Recently Used (LRU) is one of the most classic cache replacement policies. It removes the least recently accessed item when space runs out, assuming that recently used items are likely to be used again soon (temporal locality).
It’s simple, intuitive, and widely used in memory management, CPU caches, web caches, and databases.
What Problem Are We Solving?
Caches have limited space. When they’re full, we need to decide which item to evict to make room for a new one.
If we evict the wrong item, we might need it again soon, causing cache misses.
LRU chooses to evict the item that hasn’t been used for the longest time, based on the principle of recency:
“What you haven’t used recently is less likely to be used next.”
This policy adapts well to temporal locality, where recently accessed data tends to be accessed again.
How Does It Work (Plain Language)
Track the order of accesses. When the cache is full:
Remove the oldest (least recently used) item.
Insert the new item at the most recent position.
Each time an item is accessed:
Move it to the front (most recent position).
Think of it as a queue (or list) ordered by recency:
Front = most recently used
Back = least recently used
Example Walkthrough
Suppose cache size = 3.
Access sequence: A, B, C, A, D
Step
Access
Cache (Front → Back)
Eviction
1
A
A
-
2
B
B A
-
3
C
C B A
-
4
A
A C B
-
5
D
D A C
B
Explanation
After accessing A again, move A to front.
When inserting D, cache is full, evict B, least recently used.
Not optimal for cyclic access patterns (e.g., sequential scans larger than cache)
O(1) implementations require hash + list
A Gentle Proof (Why It Works)
Let \(S\) be the access sequence and \(C\) the cache capacity.
If an item \(x\) hasn’t been used in the last \(C\) distinct accesses, then adding a new item means \(x\) would not be reused soon (in most practical workloads).
LRU approximates Belady’s optimal policy under the stack property:
Increasing cache size under LRU never increases misses.
So, LRU’s performance is monotonic, a rare and valuable property.
Try It Yourself
Simulate LRU for capacity = 3 and sequence A, B, C, D, A, B, E, A, B, C, D, E.
Compare hit ratio with FIFO and Random replacement.
Observe how temporal locality improves hits.
Try a cyclic pattern (A, B, C, D, A, B, C, D, ...), see weakness.
Implement LRU with a stack or ordered map.
Test Cases
Cache Size
Sequence
Hits
Misses
3
A B C A D
1
4
3
A B C A B C
3
3
2
A B A B A B
4
2
Complexity
Access: \(O(1)\) (with hash + list)
Space: \(O(n)\) (cache + metadata)
LRU is the memory of recency, a cache that learns from your habits. What you touch often stays close; what you forget, it lets go.
832 LFU (Least Frequently Used)
Least Frequently Used (LFU) is a cache replacement policy that evicts the least frequently accessed item first. Instead of looking at recency (like LRU), LFU focuses on frequency, assuming that items accessed often are likely to be accessed again.
It’s a natural fit for workloads with hot items that stay popular over time.
What Problem Are We Solving?
In many systems, some data is repeatedly reused (hot items), while others are rarely needed. If we use LRU, a single burst of sequential access might flush out popular items, a phenomenon called cache pollution.
LFU solves this by tracking access counts, so frequently accessed items are protected from eviction.
Eviction rule: Remove the item with the lowest access frequency.
How Does It Work (Plain Language)
Every time an item is accessed:
Increase its frequency count
Reorder or reclassify it by that count
When cache is full:
Evict item(s) with smallest frequency
Think of it as a priority queue ranked by access frequency:
Items that appear often rise to the top
Rarely accessed ones drift down and out
Example Walkthrough
Cache capacity = 3 Access sequence: A, B, C, A, B, A, D
Step
Access
Frequency Counts
Eviction
1
A
A:1
-
2
B
A:1, B:1
-
3
C
A:1, B:1, C:1
-
4
A
A:2, B:1, C:1
-
5
B
A:2, B:2, C:1
-
6
A
A:3, B:2, C:1
-
7
D
A:3, B:2, C:1
Evict C
Eviction: C (lowest frequency)
Final cache: A (3), B (2), D (1)
Tiny Code (C)
#include <stdio.h>#include <stdlib.h>#define CAP 3typedefstruct{char key;int freq;} Entry;Entry cache[CAP];int size =0;void access(char key){for(int i =0; i < size; i++){if(cache[i].key == key){ cache[i].freq++; printf("Access %c (hit, freq=%d)\n", key, cache[i].freq);return;}} printf("Access %c (miss)\n", key);if(size < CAP){ cache[size++]=(Entry){key,1};return;}// Evict LFUint min_i =0;for(int i =1; i < size; i++)if(cache[i].freq < cache[min_i].freq) min_i = i; printf("Evict %c (freq=%d)\n", cache[min_i].key, cache[min_i].freq); cache[min_i]=(Entry){key,1};}int main(){char seq[]={'A','B','C','A','B','A','D'};for(int i =0; i <7; i++) access(seq[i]);}
Output:
Access A (miss)
Access B (miss)
Access C (miss)
Access A (hit, freq=2)
Access B (hit, freq=2)
Access A (hit, freq=3)
Access D (miss)
Evict C (freq=1)
Why It Matters
Protects hot data that stays popular
Reduces cache pollution from one-time scans
Great for skewed workloads (Zipfian distributions)
Trade-offs:
Harder to implement efficiently (needs priority by freq)
Old popular items may linger forever (no aging)
Heavy bookkeeping if naive
Variants like LFU with aging or TinyLFU solve staleness by decaying frequencies over time.
A Gentle Proof (Why It Works)
Let \(f(x)\) be the frequency count of item \(x\). LFU keeps in cache all items with highest \(f(x)\) under capacity \(C\).
If access distribution is stable, then the top \(C\) frequent items minimize cache misses.
In probabilistic terms, for access probabilities \(p(x)\), the optimal steady-state cache holds the \(C\) items with largest \(p(x)\), LFU approximates this by counting.
Try It Yourself
Run sequence A B C A B A D (capacity 3).
Try a streaming pattern (A B C D E F …), watch LFU degenerate.
Add aging (divide all frequencies by 2 periodically).
Compare results with LRU and Random.
Visualize counts over time, see persistence of hot items.
Test Cases
Cache Size
Sequence
Evicted
Result
3
A B C A B A D
C
A(3), B(2), D(1)
2
A B A B C
C
C replaces A
3
A B C D E
A,B,C
all 1, arbitrary evict
Complexity
Access: \(O(\log n)\) (heap) or \(O(1)\) (frequency buckets)
Space: \(O(n)\) for frequency tracking
LFU is the statistician of caches, watching patterns, counting faithfully, and keeping only what history proves to be popular.
833 FIFO Cache (First-In, First-Out)
First-In, First-Out (FIFO) is one of the simplest cache replacement policies. It evicts the oldest item in the cache, the one that has been there the longest, regardless of how often or recently it was used.
It’s easy to implement with a simple queue, but doesn’t consider recency or frequency, making it prone to anomalies.
What Problem Are We Solving?
When a cache is full, we must remove something to make space. FIFO answers the question with a simple rule:
“Evict the item that entered first.”
This works well when data follows streaming patterns (old items are unlikely to be reused), but fails when older items are still hot (reused frequently).
It’s mostly used when simplicity is preferred over precision.
How Does It Work (Plain Language)
Maintain a queue of items in insertion order.
When a new item is accessed:
If it’s already in cache → hit (do nothing)
If not → miss, insert at rear
If the cache is full → evict item at front (oldest)
No reordering happens on access, unlike LRU.
Example Walkthrough
Cache capacity = 3 Access sequence: A, B, C, A, D, B
Step
Access
Cache (Front → Back)
Eviction
1
A
A
-
2
B
A B
-
3
C
A B C
-
4
A
A B C
-
5
D
B C D
Evict A
6
B
B C D
-
Observation: Even though A was used again, FIFO evicted it because it was oldest, not least used.
Tiny Code (C)
#include <stdio.h>#define CAP 3char cache[CAP];int size =0;int front =0;int in_cache(char key){for(int i =0; i < size; i++)if(cache[i]== key)return1;return0;}void access(char key){if(in_cache(key)){ printf("Access %c (hit)\n", key);return;} printf("Access %c (miss)\n", key);if(size < CAP){ cache[size++]= key;}else{ printf("Evict %c\n", cache[front]); cache[front]= key; front =(front +1)% CAP;}}int main(){char seq[]={'A','B','C','A','D','B'};for(int i =0; i <6; i++) access(seq[i]);}
Output:
Access A (miss)
Access B (miss)
Access C (miss)
Access A (hit)
Access D (miss)
Evict A
Access B (hit)
Why It Matters
Simplicity: easy to implement (just a queue)
Deterministic: predictable behavior
Useful for FIFO queues and streams
Trade-offs:
No recency awareness: evicts recently used data
Belady’s anomaly: increasing cache size may increase misses
Poor temporal locality handling
A Gentle Proof (Why It Works)
FIFO operates as a sliding window of recent insertions. Each new access pushes out the oldest, regardless of usage.
Formally, at time \(t\), cache holds the last \(C\) unique insertions. If a reused item’s insertion lies outside that window, it’s gone, hence poor performance for repeating patterns.
Try It Yourself
Simulate capacity = 3, sequence A B C A B C A B C.
Note that every access after the third is a miss, no reuse captured.
Compare with LRU, which reuses past data.
Test with streaming data (e.g. sequential blocks), FIFO shines.
Visualize queue evolution step by step.
Test Cases
Cache Size
Sequence
Miss Count
Note
3
A B C A D B
4
Evicts by age
3
A B C A B C
6
Belady’s anomaly
2
A B A B A B
2
Works if reuse fits window
Complexity
Access: \(O(1)\) (queue operations)
Space: \(O(C)\) (cache array)
FIFO is the assembly line of caches, it moves steadily forward, never looking back to see what it’s letting go.
834 CLOCK Algorithm
The CLOCK algorithm is an efficient approximation of LRU (Least Recently Used). Instead of maintaining a full recency list, CLOCK keeps a circular buffer and a use bit for each page, achieving near-LRU performance with much lower overhead.
It’s widely used in operating systems (e.g., page replacement in virtual memory) due to its simplicity and speed.
What Problem Are We Solving?
A true LRU cache needs to track exact access order, which can be expensive:
Updating position on every access
Maintaining linked lists or timestamps
The CLOCK algorithm approximates LRU using a single reference bit, rotating like a clock hand to find victims lazily.
This reduces overhead while maintaining similar hit rates.
How Does It Work (Plain Language)
Imagine pages arranged in a circle, each with a use bit (0 or 1). A clock hand moves around the circle, pointing at candidates for eviction.
When a page is accessed:
Set its use bit = 1
When cache is full and eviction is needed:
Look at the page under the clock hand.
If use bit = 0, evict it.
If use bit = 1, set it to 0 and advance the hand.
Repeat until a 0 is found.
This ensures recently used pages get a second chance.
Example Walkthrough
Cache capacity = 3
Step
Access
Cache
Use Bits
Hand
Eviction
1
A
A
[1]
→A
-
2
B
A B
[1 1]
→A
-
3
C
A B C
[1 1 1]
→A
-
4
D
A B C
[0 1 1]
→B
Evict A
5
B
A B C
[0 1 1]
→B
-
6
E
D B C
[1 1 1]
→D
Evict next 0
Detailed steps:
When full, D arrives. Hand points to A with use=1 → set 0 → move → B (1) → set 0 → move → C (1) → set 0 → move → back to A (0) → evict A → insert D.
Tiny Code (C)
#include <stdio.h>#define CAP 3typedefstruct{char key;int use;} Page;Page cache[CAP];int size =0, hand =0;int find(char key){for(int i =0; i < size; i++)if(cache[i].key == key)return i;return-1;}void access(char key){int i = find(key);if(i !=-1){ printf("Access %c (hit)\n", key); cache[i].use =1;return;} printf("Access %c (miss)\n", key);if(size < CAP){ cache[size++]=(Page){key,1};return;}// CLOCK evictionwhile(1){if(cache[hand].use ==0){ printf("Evict %c\n", cache[hand].key); cache[hand]=(Page){key,1}; hand =(hand +1)% CAP;break;} cache[hand].use =0; hand =(hand +1)% CAP;}}int main(){char seq[]={'A','B','C','D','B','E'};for(int i =0; i <6; i++) access(seq[i]);}
Why It Matters
LRU-like performance, simpler data structures
Constant-time eviction using circular pointer
OS standard: used in UNIX, Linux, Windows VM
Trade-offs:
Approximation, not perfect LRU
May favor pages with frequent but spaced-out accesses
Still needs per-page bit storage
A Gentle Proof (Why It Works)
Let each page have a use bit \(u_i \in {0, 1}\). When a page is referenced, \(u_i = 1\). The hand cycles through pages, clearing \(u_i = 0\) if it was 1.
A page survives one full rotation only if it was accessed, so only unreferenced pages are eventually replaced.
Hence, CLOCK guarantees: \[
\text{Eviction order approximates recency order}
\] with \(O(1)\) maintenance cost.
Try It Yourself
Simulate sequence A B C D A B E.
Track use bits and hand pointer.
Compare with LRU results, usually same evictions.
Increase capacity, verify similar trends.
Extend to Second-Chance FIFO (CLOCK = improved FIFO).
Test Cases
Cache Size
Sequence
Evicted
Result
3
A B C D B E
A, C
Matches LRU
2
A B A C A
B
Similar to LRU
3
A A B C D
A, B
Preserves recent A
Complexity
Access: \(O(1)\)
Evict: \(O(1)\) (amortized)
Space: \(O(C)\)
The CLOCK algorithm is the gentle LRU, quietly spinning, giving second chances, and evicting only what truly rests forgotten.
835 ARC (Adaptive Replacement Cache)
Adaptive Replacement Cache (ARC) is a self-tuning caching algorithm that dynamically balances recency and frequency. It combines the strengths of LRU (recently used items) and LFU (frequently used items), and adapts automatically as access patterns change.
ARC was introduced by IBM Research and is used in systems like ZFS for intelligent caching under mixed workloads.
What Problem Are We Solving?
Traditional caches choose a fixed policy:
LRU favors recency but fails under cyclic scans.
LFU favors frequency but forgets recency.
Real-world workloads fluctuate, sometimes new items matter more, sometimes reused ones do.
ARC solves this by tracking both and adapting dynamically based on which side yields more hits.
“If recency helps, favor LRU. If frequency helps, favor LFU.”
How Does It Work (Plain Language)
ARC maintains four lists:
List
Meaning
T₁
Recent items seen once (LRU)
T₂
Frequent items seen twice+ (LFU)
B₁
Ghost list of recently evicted T₁ items
B₂
Ghost list of recently evicted T₂ items
The ghost lists don’t store data, only keys, to remember what was evicted.
ARC dynamically adjusts a target size p:
If a ghost hit occurs in B₁, recency is under-allocated → increase p (favor T₁)
If a ghost hit occurs in B₂, frequency is under-allocated → decrease p (favor T₂)
Thus, ARC learns online how to divide cache between recency (T₁) and frequency (T₂).
Example Flow
Cache size = 4
Access A B C D → fills T₁ = [D C B A]
Access A again → move A from T₁ → T₂
Access E → evict least recent (D) from T₁ → record in B₁
Later, access D again → ghost hit in B₁ → increase p → favor T₁
ARC continuously balances recency vs frequency using these ghost signals.
Tiny Code (Pseudocode)
on access(x):if x in T1 or x in T2:// cache hit move x to front of T2elseif x in B1:// ghost hit in recency increase p replace(x) move x to front of T2elseif x in B2:// ghost hit in frequency decrease p replace(x) move x to front of T2else:// new itemif T1.size + B1.size == c:if T1.size < c: remove oldest from B1 replace(x)else remove oldest from T1elseif total size >= c:if total size ==2c: remove oldest from B2 insert x to front of T1
Why It Matters
Self-tuning: adjusts between LRU and LFU automatically
Workload adaptive: handles scans + hot items gracefully
No manual tuning needed: p updates online
Trade-offs:
More complex bookkeeping
Higher metadata overhead (four lists)
Patented (IBM; used with license in ZFS)
A Gentle Proof (Why It Works)
ARC approximates the optimal adaptive mix of LRU and LFU.
Let:
\(c\) = total cache size
\(p\) = partition target for recency (T₁)
At any moment:
T₁ stores \(\min(p, c)\) most recent items
T₂ stores remaining \(\max(0, c - p)\) most frequent items
Ghost lists (B₁, B₂) serve as feedback channels. If \(|B₁| > |B₂|\), more recency hits → increase \(p\) (favor LRU). If \(|B₂| > |B₁|\), more frequency hits → decrease \(p\) (favor LFU).
Thus, ARC converges to a policy close to OPT for given access distribution.
Try It Yourself
Simulate sequence A B C D A B E A B C D (capacity 4).
Watch p shift as ghost hits occur.
Compare with pure LRU and LFU, ARC adapts better.
Introduce a long scan (A…Z) → ARC shifts toward frequency.
ARC is the smart hybrid, watching both history and frequency, learning which pattern rules the moment, and shifting its balance like a living system.
836 Two-Queue (2Q)
Two-Queue (2Q) caching is a clever and lightweight enhancement over plain LRU. It separates recently accessed pages from frequently reused ones, reducing cache pollution caused by one-time accesses, a problem LRU alone struggles with.
You can think of 2Q as a simplified, practical cousin of ARC, same spirit, fewer moving parts.
What Problem Are We Solving?
LRU evicts the least recently used page, but in workloads with large sequential scans, those newly seen pages can push out hot pages that are still needed.
2Q prevents that by keeping new pages in a probationary queue first. Only items accessed twice move into the main queue.
“New items must earn their place in the main cache.”
This drastically improves performance when there’s a mix of one-shot and repeated data.
How Does It Work (Plain Language)
2Q maintains two LRU queues:
Queue
Meaning
Behavior
A1 (In)
Recently seen once
Temporary holding area
Am (Main)
Seen at least twice
Long-term cache
Flow of access:
On miss: insert page into A1 (if not full).
On hit in A1: promote page to Am.
On hit in Am: move to MRU (front).
If A1 is full → evict oldest (LRU) from A1.
If Am is full → evict oldest from Am.
Example Walkthrough
Cache capacity = 4 (A1 = 2, Am = 2) Access sequence: A, B, C, A, D, B, E, A
Step
Access
A1 (Recent)
Am (Frequent)
Eviction
1
A
A
-
-
2
B
B A
-
-
3
C
C B
-
Evict A (oldest A1)
4
A
B
A
Promote A to Am
5
D
D B
A
-
6
B
D
A B
Promote B to Am
7
E
E D
A B
Evict oldest A1
8
A
E D
A B
Hit in Am
Result: hot pages A, B persist; cold scans flushed harmlessly.
Tiny Code (C)
#include <stdio.h>#include <string.h>#define A1_SIZE 2#define AM_SIZE 2char A1[A1_SIZE][2], Am[AM_SIZE][2];int a1_len =0, am_len =0;int in_list(char list[][2],int len,char k){for(int i =0; i < len; i++)if(list[i][0]== k)return i;return-1;}void access(char k){int i;if((i = in_list(Am, am_len, k))!=-1){ printf("Access %c (hit in Am)\n", k);return;}if((i = in_list(A1, a1_len, k))!=-1){ printf("Access %c (promote to Am)\n", k);// remove from A1 memmove(&A1[i],&A1[i+1],(a1_len - i -1)*2); a1_len--;// insert into Amif(am_len == AM_SIZE){ printf("Evict %c from Am\n", Am[AM_SIZE-1][0]); am_len--;} memmove(&Am[1],&Am[0], am_len *2); Am[0][0]= k; am_len++;return;} printf("Access %c (miss, insert in A1)\n", k);if(a1_len == A1_SIZE){ printf("Evict %c from A1\n", A1[A1_SIZE-1][0]); a1_len--;} memmove(&A1[1],&A1[0], a1_len *2); A1[0][0]= k; a1_len++;}int main(){char seq[]={'A','B','C','A','D','B','E','A'};for(int i =0; i <8; i++) access(seq[i]);}
Why It Matters
Mitigates LRU’s weakness: avoids flushing cache during scans
Lightweight: simpler than ARC, easy to implement
Adapts automatically to reuse frequency
Used in: PostgreSQL, InnoDB, and OS caches
Trade-offs:
Two parameters (sizes of A1 and Am) need tuning
Slightly more metadata than plain LRU
Doesn’t fully capture long-term frequency like LFU
A Gentle Proof (Why It Works)
Let:
\(A_1\) track items seen once
\(A_m\) track items seen at least twice
Assume access probabilities \(p_i\). Under steady state:
\(A_1\) filters one-timers (low \(p_i\))
\(A_m\) holds high-\(p_i\) items so 2Q approximates a frequency-aware policy with linear overhead.
Formally, 2Q’s miss rate is lower than LRU when \(p(\text{one-timers}) > 0\) because such pages are quickly cycled out before polluting the main cache.
Try It Yourself
Simulate with sequences that mix hot and cold pages.
Compare LRU vs 2Q under a long sequential scan.
Adjust A1/Am ratio (e.g., 25/75, 50/50), observe sensitivity.
Add repeated hot items among cold ones, see 2Q adapt.
Measure hit ratio vs LRU and LFU.
Test Cases
Sequence
Capacity
Policy
Hit Ratio
Winner
A B C D A B C D
4
LRU
Low
-
A B C D A B
4
2Q
Higher
2Q
Hot + Cold mix
4
2Q
Better stability
2Q
Complexity
Access: \(O(1)\) (with linked lists or queues)
Space: \(O(C)\) (two queues)
Adaptation: static ratio (manual tuning)
2Q is the street-smart LRU, it doesn’t trust new pages right away. They must prove their worth before joining the inner circle of trusted, frequently used data.
837 LIRS (Low Inter-reference Recency Set)
LIRS (Low Inter-reference Recency Set) is a high-performance cache replacement algorithm that refines LRU by distinguishing truly hot pages from temporarily popular ones. It measures reuse distance rather than just recency, capturing deeper temporal patterns in access behavior.
Invented by Jiang and Zhang (2002), LIRS achieves lower miss rates than LRU and ARC in workloads with irregular reuse patterns.
What Problem Are We Solving?
LRU ranks pages by time since last access, assuming that recently used pages will be used again soon. But this fails when some pages are accessed once per long interval, they still appear “recent” but aren’t hot.
LIRS improves upon this by ranking pages by their reuse distance:
“How recently has this page been reused, compared to others?”
The smaller the reuse distance, the more likely the page will be reused again.
How Does It Work (Plain Language)
LIRS maintains two sets:
Set
Meaning
Content
LIR (Low Inter-reference Recency)
Frequently reused pages
Kept in cache
HIR (High Inter-reference Recency)
Rarely reused or new pages
Some resident, most only recorded
All pages (resident or not) are tracked in a stack S, ordered by recency. A subset Q of S contains the resident HIR pages.
When a page is accessed:
If it’s in LIR → move it to the top (most recent).
If it’s in HIR (resident) → promote it to LIR; demote the least recent LIR to HIR.
If it’s not resident → add it as HIR; evict oldest resident HIR if needed.
Trim the bottom of stack S to remove obsolete entries (non-resident and non-referenced).
Thus, LIRS dynamically adjusts based on actual reuse distance, not just latest access.
Example Walkthrough
Cache size = 3 Access sequence: A, B, C, D, A, B, E, A, B, C
Step
Access
LIR
HIR (Resident)
Eviction
1
A
A
-
-
2
B
A B
-
-
3
C
A B C
-
-
4
D
B C
D
Evict A
5
A
C D
A
Evict B
6
B
D A
B
Evict C
7
E
A B
E
Evict D
8
A
B E
A
Evict none (A reused)
9
B
E A
B
-
10
C
A B
C
Evict E
Hot pages (A,B) stay stable in LIR; cold ones cycle through HIR.
(Simplified logic; real LIRS maintains stack S and set Q explicitly.)
Why It Matters
Higher hit ratio than LRU for mixed workloads
Handles scans gracefully (avoids LRU thrashing)
Adapts dynamically without manual tuning
Used in: database buffer pools, SSD caching, OS kernels
Trade-offs:
More bookkeeping (stack and sets)
Slightly higher overhead than LRU
Harder to implement from scratch
A Gentle Proof (Why It Works)
Let \(r(p)\) be the inter-reference recency of page \(p\) (the number of distinct pages accessed between two uses of \(p\)). LIRS retains pages with low \(r(p)\) (frequent reuse).
Over time, the algorithm maintains:
LIR set = pages with small \(r(p)\)
HIR set = pages with large \(r(p)\)
Since low-\(r\) pages minimize expected misses, LIRS approximates OPT (Belady’s optimal) more closely than LRU.
Formally, the LIR set approximates: \[
\text{LIR} = \arg\min_{|S| = C} \sum_{p \in S} r(p)
\]
Try It Yourself
Simulate LIRS on sequence A B C D A B E A B C.
Compare with LRU, LIRS avoids excessive evictions.
Add a large scan (A B C D E F G A), note fewer misses.
Visualize reuse distances, see how LIR stabilizes.
Experiment with different cache sizes.
Test Cases
Sequence
Policy
Miss Rate
Winner
Repeating (A,B,C,A,B,C)
LRU
Low
Tie
Scanning (A,B,C,D,E)
LRU
High
LIRS
Mixed reuse
LIRS
Lowest
LIRS
Complexity
Access: \(O(1)\) (amortized, with stack trimming)
Space: \(O(C)\) (cache + metadata)
Adaptivity: automatic
LIRS is the strategist of cache algorithms, it doesn’t just remember when you used something, it remembers how regularly you come back to it, and prioritizes accordingly.
838 TinyLFU (Tiny Least Frequently Used)
TinyLFU is a modern, probabilistic cache admission policy that tracks item frequencies using compact counters instead of storing full histories. It decides which items deserve to enter the cache rather than which ones to evict directly, often combined with another policy like LRU or ARC.
TinyLFU powers real-world systems like Caffeine (Java cache) and modern web caching layers, achieving near-optimal hit rates with minimal memory.
What Problem Are We Solving?
Classic LFU requires storing a frequency counter for every cached item, which is space-expensive. Also, it updates counters even for items that are soon evicted, wasting memory and CPU.
TinyLFU solves both problems by using approximate counting with a fixed-size sketch to record frequencies of recently seen items. It makes admission decisions probabilistically, keeping only items that are more popular than the one being evicted.
“Don’t just evict wisely, admit wisely.”
How Does It Work (Plain Language)
TinyLFU consists of two key ideas:
Frequency Sketch
A compact counter structure (like a count-min sketch) tracks how often items are seen in a sliding window.
Old frequencies are decayed periodically.
Admission Policy
When the cache is full and a new item arrives:
Estimate its frequency f_new
Compare to the victim’s frequency f_victim
If f_new > f_victim, admit the new item
Else, reject it (keep the old one)
Thus, TinyLFU doesn’t blindly replace, it asks:
“Is this newcomer really more popular than the current tenant?”
Example Walkthrough
Cache capacity = 3 Access sequence: A, B, C, D, A, E, A, F
Step
Access
Frequency Estimates
Decision
1
A
A:1
Admit
2
B
B:1
Admit
3
C
C:1
Admit
4
D
D:1, Victim C:1
Reject (equal)
5
A
A:2
Keep
6
E
E:1, Victim B:1
Reject
7
A
A:3
Keep
8
F
F:1, Victim B:1
Reject
Result: Cache stabilizes with A (3), B (1), C (1). Most frequent item A dominates, efficient memory use.
The sketch here is a fixed-size table of hashed counters, updated probabilistically to save space.
Why It Matters
Space-efficient: approximate counters, no per-item state
High performance: near-optimal hit rate for dynamic workloads
Works with others: often paired with LRU or CLOCK for recency tracking
Resistant to scan pollution: ignores rare or one-shot items
Trade-offs:
Probabilistic errors: counter collisions cause small inaccuracies
Extra computation: requires hashing and frequency comparison
No strict ordering: depends on approximate popularity
A Gentle Proof (Why It Works)
TinyLFU estimates frequencies over a recent access window of size \(W\). Each counter represents a bin in a count-min sketch, so for an item \(x\), frequency is approximated as:
\[
\hat{f}(x) = \min_{i} C[h_i(x)]
\]
where \(C\) are the counters and \(h_i\) are hash functions.
This ensures the cache tends toward frequency-optimal content while keeping space complexity \(O(\log n)\) instead of \(O(n)\).
Since frequencies decay periodically, the sketch naturally forgets old data, keeping focus on recent popularity.
Try It Yourself
Implement a count-min sketch (4 hash functions, 1024 counters).
Run sequence A B C D A E A F.
Observe which items are admitted vs rejected.
Compare with LFU, see how TinyLFU mimics it with 1% of the space.
Integrate with LRU for eviction, “W-TinyLFU”.
Test Cases
Sequence
Policy
Miss Rate
Comment
A B C D A E A F
LFU
0.50
baseline
A B C D A E A F
TinyLFU
0.33
better hit rate
Streaming + hot
TinyLFU
lowest
adaptively filtered
Complexity
Access: \(O(1)\) (hashing + counter ops)
Space: \(O(k \log W)\) for sketch
Eviction: paired with LRU or CLOCK
TinyLFU is the gatekeeper of modern caches, small, fast, and smart enough to remember what truly matters, admitting only the worthy into memory.
839 Random Replacement
Random Replacement is the simplest cache eviction strategy imaginable. When the cache is full and a new item arrives, it evicts a randomly chosen existing item. No recency, no frequency, just chance.
It sounds naive, but surprisingly, Random Replacement performs reasonably well for certain workloads and serves as a baseline for evaluating smarter policies like LRU or ARC.
What Problem Are We Solving?
In constrained environments (hardware caches, embedded systems, or high-speed switches), maintaining detailed metadata for LRU or LFU can be too expensive. Updating access timestamps or maintaining linked lists every hit costs time and memory.
Random Replacement eliminates that cost. It trades precision for simplicity and speed.
“When you can’t decide who to evict, let randomness decide.”
How Does It Work (Plain Language)
On every access:
If the item is in the cache, it’s a hit.
If not, it’s a miss.
If cache is full:
Pick a random slot.
Replace that item with the new one.
Otherwise, just insert the new item.
No ordering, no tracking, no statistics.
Example Walkthrough
Cache capacity = 3 Access sequence: A, B, C, D, B, A, E
Step
Access
Cache Contents
Eviction
1
A
A
-
2
B
A B
-
3
C
A B C
-
4
D
B C D
Evict random (A)
5
B
B C D
-
6
A
C D A
Evict random (B)
7
E
D A E
Evict random (C)
Each eviction is uniform at random, so behavior varies per run, but overall, Random keeps the cache populated with recent and older items in roughly balanced proportions.
Poor locality capture, ignores recency and frequency
Unpredictable performance
Can evict hot pages accidentally
Still, Random is common in hardware caches (like TLBs) where simplicity and speed are crucial.
A Gentle Proof (Why It Works)
In steady state, every item in the cache has an equal chance \(1/C\) of being evicted on each miss. For items with uniform access probability, the hit ratio is roughly:
\[
H = 1 - \frac{1}{C + 1}
\]
This means Random Replacement achieves about the same hit rate as LRU for uniform random access, but performs worse under temporal locality.
As cache size increases, variance smooths out and hit rate converges.
Try It Yourself
Simulate 100 random access sequences (e.g., over 10 items, cache size 3).
Measure average hit ratio.
Compare with LRU and LFU, note how Random stays consistent but never optimal.
Try with skewed access patterns (Zipfian), Random falls behind.
Measure standard deviation, Random’s variance is small in large systems.
Test Cases
Sequence
Policy
Cache Size
Miss Rate
Winner
Uniform Random Access
Random
3
Low
≈LRU
Hot + Cold Mix
Random
3
Higher
LRU
Sequential Scan
Random
3
Similar
FIFO
Complexity
Access: \(O(1)\)
Space: \(O(C)\)
No bookkeeping overhead
Random Replacement is the coin flip of caching, it doesn’t remember or predict, yet it stays surprisingly stable. When simplicity and speed are king, randomness is enough.
840 Belady’s Optimal Algorithm (OPT)
Belady’s Optimal Algorithm, often called OPT, is the theoretical gold standard of cache replacement. It knows the future: at each eviction, it removes the item that will not be used for the longest time in the future.
No real cache can implement this perfectly (since we can’t see the future), but OPT serves as the benchmark against which all practical algorithms (LRU, LFU, ARC, etc.) are measured.
What Problem Are We Solving?
Every cache replacement policy tries to minimize cache misses. But all practical algorithms rely only on past and present access patterns.
Belady’s OPT assumes perfect foresight, it can see the entire future access sequence and thus make the globally optimal choice at every step.
“If you know what’s coming next, you’ll never make the wrong eviction.”
It gives the lowest possible miss rate for any given access trace and cache size.
How Does It Work (Plain Language)
When the cache is full and a new item must be inserted:
Look ahead in the future access sequence.
For each item currently in the cache, find when it will be used next.
Evict the item whose next use is farthest in the future.
If an item is never used again, it’s the perfect victim.
That’s it, simple in concept, impossible in practice.
Example Walkthrough
Cache capacity = 3 Access sequence: A, B, C, D, A, B, E, A, B, C, D, E
Step
Access
Cache (before)
Future Uses
Evict
Cache (after)
1
A
-
-
-
A
2
B
A
-
-
A B
3
C
A B
-
-
A B C
4
D
A B C
A(5), B(6), C(9)
C (farthest)
A B D
5
A
A B D
B(6), D(10)
-
A B D
6
B
A B D
A(8), D(10)
-
A B D
7
E
A B D
A(8), B(9), D(10)
D (farthest)
A B E
8
A
A B E
B(9), E(12)
-
A B E
9
B
A B E
A(10), E(12)
-
A B E
10
C
A B E
A(10), B(11), E(12)
E (farthest)
A B C
11
D
A B C
A(10), B(11), C(-)
C (never reused)
A B D
12
E
A B D
A(-), B(-), D(-)
any
A B E
Result: minimal possible misses, no other policy can do better.
Tiny Code (C, for simulation)
#include <stdio.h>#define CAP 3#define SEQ_LEN 12char seq[]={'A','B','C','D','A','B','E','A','B','C','D','E'};char cache[CAP];int size =0;int find(char key){for(int i =0; i < size; i++)if(cache[i]== key)return i;return-1;}int next_use(char key,int pos){for(int i = pos +1; i < SEQ_LEN; i++)if(seq[i]== key)return i;return999;// infinity (never used again)}void access(int pos){char key = seq[pos];if(find(key)!=-1){ printf("Access %c (hit)\n", key);return;} printf("Access %c (miss)\n", key);if(size < CAP){ cache[size++]= key;return;}// OPT evictionint victim =0, farthest =-1;for(int i =0; i < size; i++){int next = next_use(cache[i], pos);if(next > farthest){ farthest = next; victim = i;}} printf("Evict %c\n", cache[victim]); cache[victim]= key;}int main(){for(int i =0; i < SEQ_LEN; i++) access(i);}
Why It Matters
Defines the theoretical lower bound of cache misses
Benchmark for evaluating other policies
Helps prove optimality gaps (how close LRU or ARC gets)
Conceptually simple, analytically powerful
Trade-offs:
Requires complete future knowledge (impossible in real systems)
Offline only, but essential for simulation and research
A Gentle Proof (Why It Works)
Let the future access sequence be \(S = [s_1, s_2, \dots, s_n]\) and cache capacity \(C\).
At any time \(t\), OPT evicts the page \(p\) whose next reference \(r(p)\) satisfies:
Use OPT as ground truth for evaluating custom algorithms.
Test Cases
Cache Size
Sequence
LRU Misses
FIFO Misses
OPT Misses
3
A B C D A B E A B C D E
9
10
7
2
A B A B A B
2
2
2
4
A B C D A B C D
4
4
4
Complexity
Access: \(O(C)\) (linear scan of cache)
Next-use lookup: \(O(N)\) per access (can be precomputed)
Overall: \(O(NC)\) simulation
Belady’s OPT is the oracle of caching, perfect in hindsight, impossible in foresight. Every practical algorithm strives to approximate its wisdom, one cache miss at a time.
Section 85. Networking
841 Dijkstra’s Routing Algorithm
Dijkstra’s Algorithm finds the shortest path from a source node to all other nodes in a graph with non-negative edge weights. It forms the basis of link-state routing protocols like OSPF (Open Shortest Path First), used in modern computer networks to determine efficient routes for data packets.
What Problem Are We Solving?
In a network of routers, each link has a cost, representing delay, congestion, or distance. The goal is to compute the shortest (lowest-cost) path from one router to all others, ensuring that packets travel efficiently.
Formally, given a graph \(G = (V, E)\) with edge weights \(w(u, v) \ge 0\), find the minimal total cost from source \(s\) to every vertex \(v \in V\).
“We need to know how to go everywhere, and how far it really is.”
How Does It Work (Plain Language)
Dijkstra’s algorithm grows a tree of shortest paths starting from the source. At each step, it expands the closest unvisited node and relaxes its neighbors.
Start with the source node \(s\), distance \(d[s] = 0\).
All other nodes get \(d[v] = \infty\) initially.
Repeatedly:
Pick the node \(u\) with the smallest tentative distance.
Mark it as visited (its distance is now final).
For each neighbor \(v\) of \(u\):
If \(d[v] > d[u] + w(u, v)\), update \(d[v]\).
Continue until all nodes are visited.
Example Walkthrough
Graph:
Edge
Weight
A → B
4
A → C
2
B → C
3
B → D
2
C → D
4
D → E
1
Start: A
Step
Visited
Tentative Distances (A, B, C, D, E)
Chosen Node
1
-
(0, ∞, ∞, ∞, ∞)
A
2
A
(0, 4, 2, ∞, ∞)
C
3
A, C
(0, 4, 2, 6, ∞)
B
4
A, C, B
(0, 4, 2, 6, ∞)
D
5
A, C, B, D
(0, 4, 2, 6, 7)
E
Shortest paths from A:
A→C = 2
A→B = 4
A→D = 6
A→E = 7
Tiny Code (C)
#include <stdio.h>#include <limits.h>#define V 5int minDistance(int dist[],int visited[]){int min = INT_MAX, min_index =-1;for(int v =0; v < V; v++)if(!visited[v]&& dist[v]<= min) min = dist[v], min_index = v;return min_index;}void dijkstra(int graph[V][V],int src){int dist[V], visited[V]={0};for(int i =0; i < V; i++) dist[i]= INT_MAX; dist[src]=0;for(int count =0; count < V -1; count++){int u = minDistance(dist, visited); visited[u]=1;for(int v =0; v < V; v++)if(!visited[v]&& graph[u][v]&& dist[u]+ graph[u][v]< dist[v]) dist[v]= dist[u]+ graph[u][v];}for(int i =0; i < V; i++) printf("A -> %c = %d\n",'A'+ i, dist[i]);}
Why It Matters
Foundation of routing protocols like OSPF and IS-IS
Guaranteed optimal paths for non-negative weights
Predictable and efficient, runs in polynomial time
Used beyond networks: maps, logistics, games, planning
Trade-offs:
Slower for large graphs (without heaps)
Requires all edges to have non-negative weights
Needs global topology knowledge (each router must know the map)
A Gentle Proof (Why It Works)
Each node is permanently labeled (finalized) when it is the closest possible unvisited node. Suppose a shorter path existed through another unvisited node, it would contradict that we picked the minimum.
Formally, if \(d[u]\) is finalized, then:
\[
d[u] = \min_{P: s \to u} \text{cost}(P)
\]
The invariant is preserved since every relaxation keeps \(d[v]\) as the minimum known distance at all times.
Hence, when all nodes are processed, \(d[v]\) holds the true shortest distance.
Try It Yourself
Run Dijkstra on the graph:
A→B=1, A→C=4, B→C=2, C→D=1
Find all shortest paths from A.
Modify one edge weight to test stability.
Add a negative edge (e.g., -2) and observe the failure, it no longer works.
Compare with Bellman–Ford (842) on the same graph.
Test Cases
Graph
Source
Result (shortest distances)
A→B=4, A→C=2, B→D=2, C→D=4, D→E=1
A
A:(0), B:(4), C:(2), D:(6), E:(7)
A→B=1, B→C=1, C→D=1
A
A:(0), B:(1), C:(2), D:(3)
A→B=3, A→C=1, C→B=1
A
A:(0), B:(2), C:(1)
Complexity
Time:
\(O(V^2)\) using arrays
\(O((V + E)\log V)\) with priority queue (heap)
Space: \(O(V)\) for distances and visited set
Dijkstra’s algorithm is the navigator of the network world, it maps every possible route, finds the fastest one, and proves that efficiency is just organized exploration.
842 Bellman–Ford Routing Algorithm
Bellman–Ford is the foundation of distance-vector routing protocols such as RIP (Routing Information Protocol). It computes the shortest paths from a single source node to all other nodes, even when edge weights are negative, something Dijkstra’s algorithm cannot handle. It does this through iterative relaxation, propagating distance estimates through the network until convergence.
What Problem Are We Solving?
In a network, routers often only know their immediate neighbors, not the entire topology. They need a distributed way to discover routes and adapt when costs change, even if some links become “negative” (e.g., reduced cost or incentive path).
Bellman–Ford allows each router to find minimal path costs using only local communication with neighbors.
Formally, given a weighted graph \(G = (V, E)\) and a source \(s\), find \(d[v] = \min \text{cost}(s \to v)\) for all \(v \in V\), even when some \(w(u, v) < 0\), as long as there are no negative cycles.
How Does It Work (Plain Language)
Bellman–Ford updates all edges repeatedly, “relaxing” them, until no further improvements are possible.
Initialize all distances: \(d[s] = 0\), and \(d[v] = \infty\) for all other \(v\).
Repeat \(V - 1\) times:
For every edge \((u, v)\) with weight \(w(u, v)\):
#include <stdio.h>#include <limits.h>#define V 4#define E 4struct Edge {int u, v, w;};void bellmanFord(struct Edge edges[],int src){int dist[V];for(int i =0; i < V; i++) dist[i]= INT_MAX; dist[src]=0;for(int i =0; i < V -1; i++)for(int j =0; j < E; j++){int u = edges[j].u, v = edges[j].v, w = edges[j].w;if(dist[u]!= INT_MAX && dist[u]+ w < dist[v]) dist[v]= dist[u]+ w;}for(int j =0; j < E; j++){int u = edges[j].u, v = edges[j].v, w = edges[j].w;if(dist[u]!= INT_MAX && dist[u]+ w < dist[v]){ printf("Negative cycle detected\n");return;}}for(int i =0; i < V; i++) printf("A -> %c = %d\n",'A'+ i, dist[i]);}int main(){struct Edge edges[E]={{0,1,4},{0,2,5},{1,2,-2},{2,3,3}}; bellmanFord(edges,0);}
Why It Matters
Works with negative weights, unlike Dijkstra
Core of distance-vector routing protocols (RIP)
Easy to distribute, each node only needs neighbor info
Detects routing loops (negative cycles)
Trade-offs:
Slower than Dijkstra (\(O(VE)\))
Can oscillate in unstable networks if updates are asynchronous
Sensitive to delayed or inconsistent updates between nodes
A Gentle Proof (Why It Works)
Each iteration of Bellman–Ford guarantees that all shortest paths with at most \(k\) edges are correctly computed after \(k\) relaxations. Since any shortest path can contain at most \(V-1\) edges (in a graph with \(V\) vertices), after \(V-1\) iterations all distances are final.
After \(V-1\) iterations, \(d[v] = \min_{P: s \to v} \text{cost}(P)\) for all \(v\).
If a shorter path exists afterward, it must include a cycle, and if that cycle decreases cost, it’s negative.
Try It Yourself
Build your own graph and run Bellman–Ford manually.
Introduce a negative edge (e.g., B→A = -5) and see how it updates.
Add a negative cycle (A→B=-2, B→A=-2), watch it detect instability.
Compare number of relaxations with Dijkstra’s algorithm.
Implement the distributed version (used in RIP).
Test Cases
Graph
Negative Edge
Negative Cycle
Result
A→B=4, A→C=5, B→C=-2, C→D=3
Yes
No
OK (shortest paths found)
A→B=3, B→C=4, C→A=-8
Yes
Yes
Negative cycle detected
A→B=2, A→C=1, C→D=3
No
No
Works fine
Complexity
Time: \(O(VE)\)
Space: \(O(V)\)
For dense graphs, slower than Dijkstra; for sparse graphs, still efficient.
Bellman–Ford is the patient messenger of networking, it doesn’t rush, it just keeps updating until everyone knows the truth.
843 Link-State Routing (OSPF)
Link-State Routing is the core principle behind modern routing protocols like OSPF (Open Shortest Path First) and IS-IS. Unlike distance-vector protocols (e.g. RIP), where routers share only distances with neighbors, link-state protocols share the entire topology, every node knows the full network map and computes optimal paths using Dijkstra’s algorithm.
What Problem Are We Solving?
Routers must find the most efficient path to every other router in a large, changing network. Distance-vector routing (like Bellman–Ford) works but is slow to converge and prone to loops.
Link-State Routing solves this by giving every router a complete and consistent view of the network topology, so each can independently compute shortest paths.
“Instead of gossiping about distances, every router holds a map of the world.”
How Does It Work (Plain Language)
Each router performs five key steps:
Discover neighbors: Send “hello” packets to directly connected routers.
Measure link costs: Determine delay or cost to each neighbor.
Build Link-State Packets (LSP): Describe all neighbors and link costs (like a mini adjacency list).
Flood LSPs through the network: Each router forwards others’ LSPs to ensure everyone has the same map.
Run Dijkstra’s Algorithm: Compute shortest paths from itself to all destinations.
When the network changes (a link fails or cost changes), only updated LSPs are flooded, keeping convergence fast and reliable.
Example Walkthrough
Suppose we have a small network:
A --1-- B --2-- C
\ |
4 3
\ |
D--E
Each router advertises its local links:
Router
Advertised Links
A
A–B(1), A–D(4)
B
B–A(1), B–C(2), B–E(3)
C
C–B(2)
D
D–A(4), D–E(1)
E
E–B(3), E–D(1)
After exchanging LSPs, all routers have the same global map.
Router A then runs Dijkstra to compute shortest paths:
Destination
Shortest Path
Total Cost
B
A→B
1
C
A→B→C
3
D
A→D
4
E
A→B→E
4
Tiny Code (Conceptual Simulation in Python)
import heapqdef dijkstra(graph, src): dist = {v: float('inf') for v in graph} dist[src] =0 pq = [(0, src)]while pq: d, u = heapq.heappop(pq)if d > dist[u]:continuefor v, w in graph[u]:if dist[v] > d + w: dist[v] = d + w heapq.heappush(pq, (dist[v], v))return distgraph = {'A': [('B',1), ('D',4)],'B': [('A',1), ('C',2), ('E',3)],'C': [('B',2)],'D': [('A',4), ('E',1)],'E': [('B',3), ('D',1)]}print(dijkstra(graph, 'A'))
Output: {'A': 0, 'B': 1, 'C': 3, 'D': 4, 'E': 4}
Why It Matters
Fast convergence: all routers update instantly with consistent topology
Scalability: supports hierarchical routing (areas in OSPF)
Flexibility: adapts to cost metrics beyond distance (delay, load, reliability)
Trade-offs:
Requires more memory to store global topology
Flooding overhead for large networks
Complexity in managing link-state databases
A Gentle Proof (Why It Works)
Each router eventually receives an identical set of link-state advertisements, forming the same graph \(G=(V,E)\). Running Dijkstra from its own node \(s\) yields:
Because every router computes from the same consistent map, the resulting routing tables are loop-free and globally consistent.
If link costs change, only local LSPs are updated, convergence time is proportional to network diameter, not path length.
Try It Yourself
Create a 5-node network and manually flood link-state messages.
Build a small adjacency list from each router’s perspective.
Run Dijkstra’s algorithm from each router.
Simulate a link failure, e.g., remove A–B, and recompute routes.
Observe how quickly the network re-stabilizes.
Test Cases
Network
Protocol
Convergence
Loop-Free
5 routers, full mesh
OSPF
Fast
Yes
10 routers, single failure
OSPF
Very fast
Yes
5 routers, RIP
Slower
Possibly No
Complexity
Flooding: \(O(E)\) for link propagation
Computation: \(O(V^2)\) (or \(O((V+E)\log V)\) with a heap)
Memory: \(O(V + E)\) for topology storage
Link-State Routing is the cartographer of the Internet, every router keeps a copy of the world map and recalculates its best path as soon as the landscape changes.
844 Distance-Vector Routing (RIP)
Distance-Vector Routing is one of the earliest and simplest distributed routing algorithms, forming the basis of RIP (Routing Information Protocol). Each router shares its current view of the world, its “distance vector”, with neighbors, and they iteratively update their own distances until everyone converges on the shortest paths.
What Problem Are We Solving?
In large, decentralized networks, routers can’t see the entire topology. They only know how far away their neighbors are and how costly it is to reach them.
The question is:
“If I only know my neighbors and their distances, can I still find the shortest paths to all destinations?”
Distance-Vector Routing answers yes, through repeated local updates, routers eventually agree on globally shortest paths.
How Does It Work (Plain Language)
Each router maintains a distance vector \(D[v]\), which holds its current estimate of the shortest path cost to every other node.
Initialize:
\(D[s] = 0\) for itself,
\(D[v] = \infty\) for all others.
Periodically, each router sends its vector to all neighbors.
When a router receives a vector from a neighbor \(N\), it updates: \[
D[v] = \min(D[v], \text{cost}(s, N) + D_N[v])
\] where \(D_N[v]\) is the neighbor’s advertised cost to \(v\).
Repeat until no updates occur (network converges).
This is essentially the distributed form of Bellman–Ford.
Example Walkthrough
Network:
A --1-- B --2-- C
\ /
\--5-- D--/
Initial distances:
Router
A
B
C
D
A
0
1
∞
5
B
1
0
2
∞
C
∞
2
0
3
D
5
∞
3
0
Step 1: A receives vector from B A learns: \(D[A,C] = 1 + 2 = 3\) (better than ∞)
Step 2: A receives vector from D A learns: \(D[A,C] = \min(3, 5 + 3) = 3\)
After convergence, A’s table:
Destination
Next Hop
Cost
B
B
1
C
B
3
D
D
5
Tiny Code (C)
#include <stdio.h>#include <limits.h>#define V 4#define INF 999int dist[V][V]={{0,1, INF,5},{1,0,2, INF},{INF,2,0,3},{5, INF,3,0}};int table[V][V];void distanceVector(){for(int i =0; i < V; i++)for(int j =0; j < V; j++) table[i][j]= dist[i][j];int updated;do{ updated =0;for(int i =0; i < V; i++)for(int j =0; j < V; j++)for(int k =0; k < V; k++)if(table[i][j]> dist[i][k]+ table[k][j]){ table[i][j]= dist[i][k]+ table[k][j]; updated =1;}}while(updated);for(int i =0; i < V; i++){ printf("Router %c: ",'A'+i);for(int j =0; j < V; j++) printf("%c(%d) ",'A'+j, table[i][j]); printf("\n");}}int main(){ distanceVector();}
Why It Matters
Simple and decentralized, no global map needed
Core of RIP, one of the earliest Internet routing protocols
Each router only talks to its neighbors
Automatically adapts to link failures
Trade-offs:
Slow convergence, may take many iterations to stabilize
Count-to-infinity problem, loops form during failures
Initially, only direct links are known. With each exchange, distance estimates propagate one hop further. After at most \(V-1\) rounds, all shortest paths (up to length \(V-1\)) are known, matching Bellman–Ford’s logic.
Convergence is guaranteed if no link costs change during updates.
Try It Yourself
Simulate three routers: A–B=1, B–C=1. Watch how C’s distance to A stabilizes after two rounds.
Remove a link temporarily, see “count-to-infinity” occur.
Implement “split horizon” or “poison reverse” to fix it.
Compare update speed with link-state routing (OSPF).
Test Cases
Network
Protocol
Converges?
Notes
Small mesh (A–B–C–D)
RIP
Yes
Slower, stable
With link failure
RIP
Eventually
Count-to-infinity issue
Fully connected 5-node
RIP
Fast
Stable after few rounds
Complexity
Time: \(O(V \times E)\) (per iteration)
Message Overhead: periodic exchanges per neighbor
Memory: \(O(V)\) per router
Distance-Vector Routing is the gossip protocol of networks, each router shares what it knows, listens to others, and slowly, the whole network learns the best way to reach everyone else.
845 Path Vector Routing (BGP)
Path Vector Routing is the foundation of the Internet’s interdomain routing system, the Border Gateway Protocol (BGP). It extends distance-vector routing by including not just the cost, but the entire path to each destination, preventing routing loops and enabling policy-based routing between autonomous systems (ASes).
What Problem Are We Solving?
In global Internet routing, we don’t just need the shortest path, we need controllable, loop-free, and policy-respecting paths between independently managed networks (ASes).
Distance-vector algorithms (like RIP) can’t prevent loops or respect policies because they only share costs. We need a richer model, one that includes the path itself.
“Don’t just tell me how far, tell me which road you’ll take.”
How Does It Work (Plain Language)
Each node (Autonomous System) advertises:
Destination prefixes it can reach
The entire path of AS numbers used to reach them
When a router receives an advertisement:
It checks the path list to ensure no loops (its own AS number shouldn’t appear).
It may apply routing policies (e.g., prefer customer routes over peer routes).
It updates its routing table if the new path is better by local preference.
Then it re-advertises the route to its neighbors, adding its own AS number to the front of the path.
Implement loop detection by checking for AS repetition.
Compare convergence time with OSPF or RIP.
Test Cases
Scenario
Result
Notes
Normal propagation
Loop-free
Stable
Policy conflict
Oscillation
Converges slowly
Route hijack (fake AS)
Wrong path
Requires security measures
AS path filtering
Loop-free
Correct convergence
Complexity
Per router processing: \(O(N \log N)\) for \(N\) neighbors
Message size: proportional to AS path length
Convergence time: variable (depends on policy conflicts)
Path Vector Routing is the diplomat of the Internet, routers don’t just exchange distances, they exchange trust, policies, and stories of how to get there.
846 Flooding Algorithm
Flooding is the most primitive and reliable way to propagate information across a network, send a message to all neighbors, who then forward it to all of theirs, and so on, until everyone has received it. It’s used as a building block in many systems: link-state routing (OSPF), peer discovery, and epidemic protocols.
What Problem Are We Solving?
When a node has new information, say, a new link-state update or a broadcast message, it must ensure every node in the network eventually learns it. Without global knowledge of the topology, the simplest approach is just to flood the message.
“Tell everyone you know, and ask them to tell everyone they know.”
Flooding guarantees eventual delivery as long as the network is connected.
How Does It Work (Plain Language)
Each node maintains a record of messages it has already seen. When a new message arrives:
Check ID: If it’s already seen → discard it.
Otherwise: Forward it to all neighbors except the one it came from.
Mark it as delivered.
This process continues until the message has spread everywhere. No central coordination is needed.
A received 101
B received 101
D received 101
C received 101
E received 101
F received 101
Why It Matters
Guaranteed delivery (if network is connected)
Simple and decentralized
Used in OSPF link-state advertisements, peer discovery, gossip protocols
No routing tables required
Trade-offs:
Redundant messages: exponential in dense graphs
Loops: prevented by tracking message IDs
High bandwidth usage: not scalable to large networks
Optimizations:
Sequence numbers (unique IDs)
Hop limits (TTL)
Selective forwarding (spanning tree–based)
A Gentle Proof (Why It Works)
Let the network be a connected graph \(G = (V, E)\). For a source node \(s\), every node \(v\) is reachable via some path \(P(s, v)\). Flooding guarantees that at least one message traverses each edge in \(P(s, v)\) before duplicates are suppressed.
Hence, all nodes reachable from \(s\) will receive the message exactly once.
Formally:
\[
\forall v \in V, \exists P(s, v) \text{ such that } \text{msg propagates along } P
\]
and since nodes drop duplicates, termination is guaranteed.
Try It Yourself
Simulate flooding on a 5-node mesh.
Add message IDs, count duplicates before suppression.
Introduce a TTL (e.g. 3 hops), see how it limits spread.
Build a tree overlay, observe reduced redundancy.
Compare with controlled flooding (used in link-state routing).
Test Cases
Network
Method
Duplicate Handling
Result
4-node chain
Naive Flooding
None
Exponential messages
6-node mesh
With ID tracking
Yes
Exact delivery
10-node
With TTL=3
Yes
Partial reach
Complexity
Time: \(O(E)\) (each link used at most once)
Messages: up to \(O(E)\) with ID tracking, \(O(2^V)\) without
Space: \(O(M)\) for tracking seen message IDs
Flooding is the shout across the network, inefficient, noisy, but certain. It’s the seed from which smarter routing grows.
847 Spanning Tree Protocol (STP)
Spanning Tree Protocol (STP) is a distributed algorithm that prevents loops in Ethernet networks by dynamically building a loop-free subset of links called a spanning tree. It’s used in switches and bridges to ensure frames don’t circulate forever when redundant links exist.
What Problem Are We Solving?
Ethernet networks often have redundant connections for fault tolerance. However, redundant paths can create broadcast storms, frames endlessly looping through switches.
STP solves this by disabling certain links so that the resulting network forms a tree (no cycles) but still stays connected.
“Keep every switch reachable, but only one path to each.”
How Does It Work (Plain Language)
STP elects one switch as the Root Bridge, then calculates the shortest path to it from every other switch. Links not on any shortest path are blocked to remove cycles.
Steps:
Root Bridge Election:
Each switch has a unique Bridge ID (priority + MAC address).
The lowest Bridge ID wins as the root.
Path Cost Calculation:
Each switch calculates its distance (path cost) to the root.
Designated Ports and Blocking:
On each link, only one side (the one closest to the root) forwards frames.
All others are placed in blocking state.
If topology changes (e.g., a link failure), STP recalculates the tree and reactivates backup links.
Example Walkthrough
Network of four switches:
S1
/ \
S2--S3
\ /
S4
Each link cost = 1.
Step 1: S1 has the lowest Bridge ID → becomes Root Bridge. Step 2: S2, S3, and S4 compute shortest paths to S1. Step 3: Links not on the shortest path (e.g., S2–S3, S3–S4) are blocked.
Resulting Tree:
S1
├── S2
│
└── S3
\
S4
Network remains connected but has no cycles.
Tiny Code (Python Simulation)
graph = {'S1': ['S2', 'S3'],'S2': ['S1', 'S3', 'S4'],'S3': ['S1', 'S2', 'S4'],'S4': ['S2', 'S3']}root ='S1'parent = {root: None}visited =set([root])def build_spanning_tree(node):for neighbor in graph[node]:if neighbor notin visited: visited.add(neighbor) parent[neighbor] = node build_spanning_tree(neighbor)build_spanning_tree(root)print("Spanning tree connections:")for n in parent:if parent[n]:print(f"{parent[n]} -> {n}")
Allows redundancy, links are only temporarily disabled
Adapts automatically to topology changes
Forms the backbone of Layer 2 network stability
Trade-offs:
Convergence delay after link failure
All traffic follows a single tree → some links underutilized
Improved variants like RSTP (Rapid STP) fix convergence speed.
A Gentle Proof (Why It Works)
STP ensures three key invariants:
Single Root: The switch with the smallest Bridge ID is elected globally.
Acyclic Topology: Every switch selects exactly one shortest path to the root.
Connectivity: No switch is isolated from the root.
The protocol’s message exchange (Bridge Protocol Data Units, or BPDUs) guarantees all switches eventually agree on the same root and consistent forwarding roles.
Formally, if \(G=(V,E)\) is the network graph, STP selects a subset of edges \(T \subseteq E\) such that:
\[
T \text{ is a spanning tree of } G \quad \text{(connected, acyclic, minimal cost)}.
\]
Try It Yourself
Draw a small network of 5 switches with loops.
Assign unique Bridge IDs.
Determine the Root Bridge (lowest ID).
Compute path costs to the root from each switch.
Identify which ports forward and which block.
Disconnect the root temporarily, observe how a new tree forms.
Test Cases
Network
Root
Blocked Links
Result
4-switch mesh
S1
S2–S3, S3–S4
Loop-free
Ring topology
S1
One link blocked
Single spanning tree
Single link failure
S1
Recomputed
Restored connectivity
Complexity
Message complexity: \(O(E)\) (BPDUs flood through edges)
Spanning Tree Protocol is the gardener of Ethernet, it trims redundant loops, keeps the network tidy, and regrows paths when the topology changes.
848 Congestion Control (AIMD)
Additive Increase, Multiplicative Decrease (AIMD) is the classic algorithm used in TCP congestion control. It dynamically adjusts the sender’s transmission rate to balance efficiency (maximize throughput) and fairness (share bandwidth) without overwhelming the network.
What Problem Are We Solving?
In computer networks, multiple senders share limited bandwidth. If everyone sends as fast as possible, routers overflow and packets drop, causing congestion collapse. We need a distributed, self-regulating mechanism that adapts each sender’s rate based on network feedback.
“Send more when it’s quiet, slow down when it’s crowded.”
AIMD provides exactly that, graceful adaptation based on implicit congestion signals.
How Does It Work (Plain Language)
Each sender maintains a congestion window (\(cwnd\)), which limits the number of unacknowledged packets in flight. The rules:
Additive Increase:
Each round-trip time (RTT) without loss → increase \(cwnd\) by a constant (usually 1 packet).
Encourages probing for available bandwidth.
Multiplicative Decrease:
When congestion (packet loss) is detected → reduce \(cwnd\) by a fixed ratio (usually half).
Quickly relieves pressure on the network.
This creates the famous “sawtooth” pattern of TCP throughput, gradual rise, sudden fall, repeat.
Example Walkthrough
Start with initial \(cwnd = 1\) packet.
Round
Event
\(cwnd\) (packets)
Action
1
Start
1
Initial slow start
2
ACK received
2
Increase linearly
3
ACK received
3
Additive increase
4
Packet loss
1.5
Multiplicative decrease
5
ACK received
2.5
Additive again
6
Packet loss
1.25
Decrease again
This process continues indefinitely, gently oscillating around the optimal sending rate.
Tiny Code (Python Simulation)
import matplotlib.pyplot as pltrounds =30cwnd = [1]for r inrange(1, rounds):if r %8==0: # simulate packet loss every 8 rounds cwnd.append(cwnd[-1] /2)else: cwnd.append(cwnd[-1] +1)plt.plot(range(rounds), cwnd)plt.xlabel("Round-trip (RTT)")plt.ylabel("Congestion Window (cwnd)")plt.title("AIMD Congestion Control")plt.show()
The plot shows the sawtooth growth pattern, linear climbs, multiplicative drops.
Why It Matters
Prevents congestion collapse (saves the Internet)
Fairness: multiple flows converge to equal bandwidth share
Stability: simple feedback control loop
Foundation for TCP Reno, NewReno, Cubic, BBR
Trade-offs:
Reacts slowly in high-latency networks
Drops can cause underutilization
Not ideal for modern fast, long-haul links (Cubic, BBR improve this)
A Gentle Proof (Why It Works)
Let each sender’s window size evolve as:
\[
cwnd(t+1) =
\begin{cases}
cwnd(t) + \alpha, & \text{if no loss},\\
\beta \cdot cwnd(t), & \text{if loss occurs.}
\end{cases}
\]
where \(\alpha > 0\) (additive step) and \(0 < \beta < 1\) (multiplicative reduction).
At equilibrium, congestion signals occur often enough that:
where \(p\) = packet loss probability. Thus, as the network becomes more congested, each flow’s rate naturally decreases, keeping the system stable and fair.
Try It Yourself
Set \(\alpha = 1\), \(\beta = 0.5\), plot cwnd growth.
Double RTTs, observe slower convergence.
Simulate two senders, see them converge to equal cwnd sizes.
Introduce random losses, see steady oscillation.
Compare with BBR (rate-based) or Cubic (nonlinear growth).
Test Cases
Condition
Behavior
Outcome
No losses
Linear growth
Bandwidth probe
Periodic losses
Sawtooth oscillation
Stable throughput
Random losses
Variance in cwnd
Controlled adaptation
Two flows
Equal share
Fair convergence
Complexity
Control time: \(O(1)\) per RTT (constant-time update)
Computation: minimal, arithmetic per ACK/loss
Space: \(O(1)\) (store current cwnd and threshold)
AIMD is the heartbeat of the Internet, a rhythm of sending, sensing, and slowing, keeping millions of connections in harmony through nothing but self-restraint and math.
849 Random Early Detection (RED)
Random Early Detection (RED) is a proactive congestion-avoidance algorithm used in routers and switches. Instead of waiting for queues to overflow (and then dropping packets suddenly), RED begins to randomly drop or mark packets early as the queue builds up, signaling senders to slow down before the network collapses.
What Problem Are We Solving?
Traditional routers use tail drop, packets are accepted until the buffer is full, then all new arrivals are dropped. This leads to:
Global synchronization (many TCP flows slow down simultaneously)
Burst losses
Unfair bandwidth sharing
RED smooths this by predicting congestion early and randomly notifying some flows to back off, maintaining high throughput with low delay.
“Don’t wait for the dam to break, release a few drops early.”
How Does It Work (Plain Language)
The router maintains an average queue size using an exponential moving average:
So, each incoming packet has a 5% chance of being dropped. This random early drop gives TCP flows time to reduce sending rates before the queue overflows.
RED is the traffic whisperer of the Internet, it senses the crowd forming, taps a few on the shoulder, and keeps the flow moving before chaos begins.
850 Explicit Congestion Notification (ECN)
Explicit Congestion Notification (ECN) is a modern congestion-control enhancement that allows routers to signal congestion without dropping packets. Instead of relying on loss as a feedback signal (like traditional TCP), ECN marks packets in-flight, letting endpoints slow down before buffers overflow.
What Problem Are We Solving?
Traditional TCP interprets packet loss as a sign of congestion. But packet loss is a crude signal, it wastes bandwidth, increases latency, and can destabilize queues.
We want routers to communicate congestion explicitly, without destroying data, so that endpoints can adjust smoothly.
“Instead of dropping the package, just stamp it: ‘Hey, slow down a bit.’”
That’s what ECN does, it preserves packets but delivers the same message.
How Does It Work (Plain Language)
ECN operates by marking packets using two bits in the IP header and feedback bits in the TCP header.
Sender: marks packets as ECN-capable (ECT(0) or ECT(1)).
Router: when detecting congestion (e.g., queue exceeds threshold):
Instead of dropping packets, it sets the CE (Congestion Experienced) bit.
Receiver: sees CE mark, sets the ECE (Echo Congestion Experienced) flag in TCP ACK.
Sender: on receiving ECE, reduces its congestion window (like AIMD’s multiplicative decrease), and sends a CWR (Congestion Window Reduced) flag to acknowledge the signal.
No loss occurs, but congestion control behavior still happens.
Example Walkthrough
Normal flow:
Sender → Router → Receiver (packets marked ECT).
Router queue grows → starts marking packets CE.
Feedback:
Receiver → Sender (ACKs with ECE bit set).
Sender halves its congestion window, sends ACK with CWR.
Stabilization:
Queue drains → router stops marking.
Flow resumes normal additive increase.
Timeline sketch:
Sender: ECT(0) → ECT(0) → CE → ECT(0)
Router: Marks CE when avg queue > threshold
Receiver: ACK (ECE) → ACK (ECE) → ACK (no ECE)
Sender: cwnd ↓ (on ECE)
Tiny Code (Python Simulation of Signaling)
cwnd =10ecn_enabled =Truefor rtt inrange(1, 10): queue =5+ rtt # simulate buildupif ecn_enabled and queue >8:print(f"RTT {rtt}: CE marked -> cwnd reduced from {cwnd} to {cwnd//2}") cwnd //=2else: cwnd +=1print(f"RTT {rtt}: normal -> cwnd={cwnd}")
Output:
RTT 1: normal -> cwnd=11
RTT 2: normal -> cwnd=12
RTT 3: normal -> cwnd=13
RTT 4: normal -> cwnd=14
RTT 5: CE marked -> cwnd reduced from 14 to 7
RTT 6: normal -> cwnd=8
...
Why It Matters
Avoids packet loss, better for delay-sensitive traffic
Lower latency, queues stay short
Smooth feedback, reduces oscillation in throughput
Energy and resource efficient, no retransmissions needed
Works with RED (RFC 3168), routers mark instead of drop
Trade-offs:
Requires end-to-end ECN support (sender, receiver, and routers)
Some middleboxes still strip ECN bits
Needs careful configuration to avoid false positives
A Gentle Proof (Why It Works)
Routers implementing RED or AQM (Active Queue Management) set a marking probability \(p\) instead of a drop probability. Marking probability follows:
\[
p = p_{max} \frac{avg - min_{th}}{max_{th} - min_{th}}
\]
When a packet is marked with CE, the TCP sender reduces its congestion window:
This maintains the same AIMD equilibrium as loss-based control but avoids loss events. Since marking is early and gentle, ECN stabilizes queue lengths and minimizes delay.
Try It Yourself
Enable ECN in Linux:
sysctl -w net.ipv4.tcp_ecn=1
Run an iperf3 test between ECN-capable hosts.
Compare throughput and latency with ECN off.
Visualize packet captures, look for CE and ECE bits in headers.
Combine ECN with RED for a complete congestion control loop.
Test Cases
Scenario
Behavior
Result
Router queue exceeds min_th
Marks CE
Smooth slowdown
ECN disabled
Drops packets
Loss-based control
ECN enabled end-to-end
Marks packets
Low latency, stable throughput
One router not ECN-capable
Mixed behavior
Partial benefit
Complexity
Per packet: \(O(1)\) (marking decision)
Overhead: negligible, bit toggling only
Deployment: incremental, backward compatible with non-ECN flows
ECN is the gentle hand of congestion control, it doesn’t punish with loss, it warns with a mark, keeping the Internet fast, fair, and calm even under pressure.
Section 86. Distributed Consensus
851 Basic Paxos
Basic Paxos is the cornerstone algorithm for reaching agreement among distributed nodes, even if some fail. It allows a group of machines to agree on a single value, safely and consistently, without requiring perfect reliability or synchrony.
What Problem Are We Solving?
In distributed systems, multiple nodes might propose values (e.g., “commit transaction X” or “leader is node 3”). They can crash, restart, or have delayed messages. How do we ensure everyone eventually agrees on the same value, even with partial failures?
Paxos answers this fundamental question:
“How can a system of unreliable participants reach a consistent decision?”
It ensures safety (no two nodes decide different values) and liveness (eventually a decision is reached, assuming stability).
How Does It Work (Plain Language)
Paxos separates roles into three participants:
Proposers: suggest values to agree on
Acceptors: vote on proposals (a majority is enough)
Learners: learn the final chosen value
The algorithm proceeds in two phases.
Phase 1: Prepare
A proposer picks a unique proposal number n and sends a prepare(n) message to all acceptors.
Each acceptor:
If n is greater than any proposal number it has already seen, it promises not to accept proposals below n.
It replies with any previously accepted proposal (n', value') (if any).
Phase 2: Accept
The proposer, after receiving responses from a majority:
Chooses the value with the highest-numbered prior acceptance (or its own value if none).
Sends accept(n, value) to all acceptors.
Acceptors:
Accept the proposal (n, value) if they have not already promised a higher number.
Once a majority accepts the same (n, value), that value is chosen. Learners are then notified of the decision.
Example Timeline
Step
Node
Message
Note
1
P1
prepare(1) → all
Proposer starts round
2
A1, A2
Promise n ≥ 1
No previous proposal
3
P1
accept(1, X) → all
Proposes value X
4
A1, A2, A3
Accept (1, X)
Majority agrees
5
P1
Announce decision X
Consensus reached
If another proposer (say P2) tries later with prepare(2), Paxos ensures it cannot override the chosen value.
Tiny Code (Python Pseudocode)
class Acceptor:def__init__(self):self.promised_n =0self.accepted =Nonedef prepare(self, n):if n >self.promised_n:self.promised_n = nreturn ("promise", self.accepted)return ("reject", None)def accept(self, n, value):if n >=self.promised_n:self.accepted = (n, value)return"accepted"return"rejected"
This captures the core idea: acceptors promise to honor only higher-numbered proposals, preserving safety.
Why It Matters
Fault tolerance: Works with up to ⌊(N−1)/2⌋ failures.
Safety first: Even with crashes, no inconsistent state arises.
Foundation: Forms the basis of Raft, Multi-Paxos, Zab, EPaxos, and more.
Used in: Google Chubby, Zookeeper, etcd, CockroachDB, Spanner.
Trade-offs:
Complex message flow and numbering.
High latency (two rounds per decision).
Not designed for high churn or partitions, Raft simplifies this.
A Gentle Proof (Why It Works)
Let:
\(Q_1\), \(Q_2\) be any two majorities of acceptors.
Then \(Q_1 \cap Q_2 \ne \varnothing\) (any two majorities intersect).
Thus, if a value \(v\) is accepted by \(Q_1\), any later proposal must contact at least one acceptor in \(Q_1\) during prepare phase, which ensures it learns about \(v\) and continues proposing it.
Hence, once a value is chosen, no other can ever be chosen.
Safety invariant:
If a value is chosen, every higher proposal preserves it.
Try It Yourself
Simulate three acceptors and two proposers.
Let both propose at once, see how numbering resolves conflict.
Drop some messages, ensure consistency still holds.
Extend to five nodes and test with random delays.
Observe how learners only need one consistent quorum response.
Test Cases
Scenario
Expected Behavior
One proposer
Value chosen quickly
Two concurrent proposers
Higher-numbered wins
Node crash and restart
Safety preserved
Network delay
Eventual consistency, no conflict
Complexity
Message rounds: 2 (prepare + accept)
Message complexity: \(O(N^2)\) for N participants
Fault tolerance: up to ⌊(N−1)/2⌋ node failures
Storage: \(O(1)\) state per acceptor (promised number + accepted value)
Paxos is the mathematical heart of distributed systems, a quiet, stubborn agreement that holds even when the world around it falls apart.
852 Multi-Paxos
Multi-Paxos is an optimized version of Basic Paxos that allows a distributed system to reach many consecutive agreements (like a log of decisions) efficiently. Where Basic Paxos handles a single value, Multi-Paxos extends this to a sequence of values, ideal for replicated logs in systems like databases and consensus clusters.
What Problem Are We Solving?
In practice, systems rarely need to agree on just one value. They need to agree repeatedly, for example:
Each log entry in a replicated state machine
Each transaction commit
Each configuration update
Running Basic Paxos for every single decision would require two message rounds each time, which is inefficient.
Multi-Paxos reduces this overhead by reusing a stable leader to coordinate many decisions.
“Why elect a new leader every time when one can stay in charge for a while?”
How Does It Work (Plain Language)
Multi-Paxos builds directly on Basic Paxos but adds leadership and log indexing.
Key idea: Leader election
One proposer becomes a leader (using a higher proposal number).
Once chosen, the leader skips the prepare phase for subsequent proposals.
Process
Leader election
A proposer performs the prepare phase once and becomes leader.
All acceptors promise not to accept lower proposal numbers.
Steady state
For each new log entry (index \(i\)), the leader sends accept(i, value) directly.
Acceptors respond with “accepted”.
Learning and replication
When a majority accept, the leader notifies learners.
The value becomes committed at position \(i\).
If the leader fails, another proposer starts a new prepare phase with a higher proposal number, reclaiming leadership.
Example Timeline
Step
Action
Description
1
P1 starts prepare(1)
Becomes leader
2
P1 proposes accept(1, “A”)
Value for log index 1
3
Majority accept
“A” chosen
4
P1 proposes accept(2, “B”)
Next log entry, no prepare needed
5
Leader fails
P2 runs prepare(2), takes over
So instead of two rounds per value, Multi-Paxos uses:
Two rounds only once (for leader election)
One round thereafter for each new decision
Tiny Code (Simplified Python Simulation)
class MultiPaxosLeader:def__init__(self):self.proposal_n =0self.log = []def elect_leader(self, acceptors):self.proposal_n +=1 promises = [a.prepare(self.proposal_n) for a in acceptors]ifsum(1for p, _ in promises if p =="promise") >len(acceptors) //2:print("Leader elected")returnTruereturnFalsedef propose(self, index, value, acceptors): accepted = [a.accept(self.proposal_n, value) for a in acceptors]if accepted.count("accepted") >len(acceptors) //2:self.log.append(value)print(f"Committed log[{index}] = {value}")
Why It Matters
High throughput: amortizes prepare cost over many decisions
Foundation for replicated logs: underpins Raft, Zab, Chubby, etcd
Fault tolerance: still works with up to ⌊(N−1)/2⌋ node failures
Consistency: all replicas apply operations in the same order
Trade-offs:
Needs stable leader to avoid churn
Slight delay on leader failover
Complex implementation in practice (timeouts, heartbeats, elections)
A Gentle Proof (Why It Works)
Let each log index \(i\) represent a separate Paxos instance.
All instances share the same acceptors.
Once a leader is established with proposal number \(n\), every future accept(i, value) message from that leader uses the same \(n\).
The safety invariant of Paxos still holds per index:
Once a value is chosen for position \(i\), no other value can be chosen.
Because the leader is fixed, overlapping prepares are eliminated, ensuring a consistent prefix ordering of the log.
Formally, if majority sets \(Q_1, Q_2\) intersect, then: \[
\forall i, ; \text{chosen}(i, v) \Rightarrow \text{future proposals at } i \text{ must propose } v
\]
Try It Yourself
Elect one node as leader; let it propose 5 log entries in a row.
Kill the leader mid-way; watch another proposer take over.
Observe that committed log entries remain intact.
Extend simulation to show log replication across 5 acceptors.
Verify no inconsistency even after restarts.
Test Cases
Scenario
Behavior
Result
Stable leader
Fast single-round commits
Efficient agreement
Leader crash
New prepare phase
Safe recovery
Two leaders
Higher proposal wins
Safety preserved
Delayed messages
Consistent prefix log
No divergence
Complexity
Message rounds: 2 for election, then 1 per value
Message complexity: \(O(N)\) per decision
Fault tolerance: up to ⌊(N−1)/2⌋ failures
Log structure: \(O(K)\) for K decisions
Multi-Paxos turns the single agreement of Paxos into a stream of ordered, fault-tolerant decisions, the living heartbeat of consensus-based systems.
853 Raft
Raft is a consensus algorithm designed to be easier to understand and implement than Paxos, while providing the same safety and fault tolerance. It keeps a distributed system of servers in agreement on a replicated log, ensuring that all nodes execute the same sequence of commands, even when some crash or reconnect.
What Problem Are We Solving?
Consensus is the foundation of reliable distributed systems: databases, cluster managers, and replicated state machines all need it.
Paxos guarantees safety but is notoriously hard to implement correctly. Raft was introduced to make consensus understandable, modular, and practical, by decomposing it into three clear subproblems:
Leader election – choose one server to coordinate.
Log replication – leader appends commands and replicates them.
Safety – ensure logs remain consistent even after failures.
“Raft isn’t simpler because it does less, it’s simpler because it explains more.”
How Does It Work (Plain Language)
Raft maintains the same fundamental safety property as Paxos:
At most one value is chosen for each log index.
But it enforces this via leadership and term-based coordination.
Roles
Leader: handles all client requests and replication.
Follower: passive node, responds to leader messages.
Candidate: a follower that times out and runs for election.
Terms
Time is divided into terms. Each term starts with an election and can have at most one leader.
Leader Election
A follower times out (no heartbeat) and becomes a candidate.
It increments its term and sends RequestVote(term, id, lastLogIndex, lastLogTerm) to all servers.
Servers vote for the candidate if:
Candidate’s log is at least as up-to-date as theirs.
If the candidate gets a majority, it becomes the leader.
The leader then starts sending AppendEntries (heartbeats).
Log Replication
Clients send commands to the leader.
The leader appends the command to its log and sends AppendEntries(term, index, entry) to followers.
When a majority acknowledges, the leader commits the entry and applies it to the state machine.
Safety Rule
Before granting a vote, a node ensures that the candidate’s log is at least as complete as its own. This ensures that all committed entries are preserved across leadership changes.
Raft turns consensus into a rhythmic heartbeat: leaders rise and fall, logs march forward in unison, and even when chaos strikes, the cluster remembers, exactly, what was decided.
854 Viewstamped Replication (VR)
Viewstamped Replication (VR) is a consensus and replication algorithm developed before Raft and Multi-Paxos. It was designed to make fault-tolerant state machine replication easier to understand and implement. VR organizes time into views (epochs) led by a primary replica, ensuring that a group of servers maintains a consistent, ordered log of client operations even when some nodes fail.
What Problem Are We Solving?
In distributed systems, we need to ensure that:
All replicas execute the same sequence of operations.
The system continues to make progress even if some replicas crash.
No two primaries (leaders) can make conflicting decisions.
Paxos solved this but was hard to explain; VR re-frames the same problem with primary-backup terminology that developers already understand.
“Think of it as a primary that keeps a careful diary, and a quorum that makes sure it never forgets what it wrote.”
How Does It Work (Plain Language)
VR uses three phases that repeat through views:
1. Normal Operation
One replica acts as primary; others are backups.
The primary receives client requests, assigns log sequence numbers, and sends Prepare messages to backups.
Backups reply with PrepareOK.
Once the primary collects acknowledgments from a majority, it commits the operation and responds to the client.
2. View Change (Leader Election)
If backups don’t hear from the primary within a timeout, they initiate a view change.
Each replica sends its log and view number to others.
The highest view number’s candidate becomes the new primary.
The new primary merges the most up-to-date logs and starts a new view.
3. Recovery
A crashed or slow replica can rejoin by requesting the current log from others.
It synchronizes up to the most recent committed operation.
Conceptually clear: Primary-backup model with explicit views.
Fault tolerant: Works with up to ⌊(N−1)/2⌋ failures.
Consistent and safe: All committed operations appear in the same order on all replicas.
Foundation: Inspired Raft and modern replicated log protocols (Zab, PBFT, etc.).
Recovery-friendly: Supports crash recovery via log replay.
Trade-offs:
Slightly more communication than Multi-Paxos.
Requires explicit view management.
Doesn’t separate safety and liveness as cleanly as Raft.
A Gentle Proof (Why It Works)
Each view has a single primary that coordinates commits.
Let \(V_i\) be view \(i\), and \(Q_1\), \(Q_2\) be any two majorities of replicas.
When the primary in \(V_i\) commits an operation, it has acknowledgments from \(Q_1\).
During the next view change, the new primary gathers logs from a majority \(Q_2\).
Because \(Q_1 \cap Q_2 \ne \varnothing\), the committed entry appears in at least one replica in \(Q_2\).
The new primary merges it, ensuring all future logs include it.
Hence, committed operations are never lost.
Formally: \[
\forall v, i: \text{committed}(v, i) \Rightarrow \forall v' > v, \text{primary}(v') \text{ includes } i
\]
Try It Yourself
Simulate a 3-replica system (P1, P2, P3).
Let P1 act as primary; send operations sequentially.
Kill P1 mid-operation, observe how P2 initiates a new view.
Reintroduce P1 and verify that it synchronizes the log.
Repeat with message drops and recoveries.
Test Cases
Scenario
Behavior
Result
Normal operation
Primary commits via majority
Linearizable log
Primary crash
View change elects new primary
Safe recovery
Network partition
Only majority proceeds
No conflicting commits
Recovery after crash
Replica syncs log
Eventual consistency
Complexity
Message rounds: 2 per operation (prepare + commit)
View change: 1 additional round during leader election
Fault tolerance: up to ⌊(N−1)/2⌋ replicas
State: log entries, view number, primary flag
Viewstamped Replication is the bridge between Paxos and Raft, same mathematical core, but framed as a story of views and primaries, where leadership changes are graceful and memory never fades.
855 Practical Byzantine Fault Tolerance (PBFT)
Practical Byzantine Fault Tolerance (PBFT) is a consensus algorithm that tolerates Byzantine failures, arbitrary or even malicious behavior by some nodes, while still ensuring correctness and liveness. It allows a distributed system of \(3f + 1\) replicas to continue operating correctly even if up to \(f\) of them act incorrectly or dishonestly.
What Problem Are We Solving?
Classic consensus algorithms like Paxos or Raft assume crash faults: nodes may stop working, but they never lie. In real-world distributed systems, especially open or adversarial environments (like blockchains, financial systems, or untrusted datacenters), nodes can behave arbitrarily:
Send conflicting messages
Forge responses
Replay or reorder messages
PBFT ensures the system still reaches agreement and executes operations in the same order, even if some nodes are malicious.
“It’s consensus in a world where some players cheat, and honesty still wins.”
How Does It Work (Plain Language)
PBFT operates in views coordinated by a primary (leader), with replicas as backups. Each client request passes through three phases, all authenticated by digital signatures or message digests.
Roles
Primary: coordinates request ordering.
Replicas: validate and agree on the primary’s proposals.
Client: sends requests and waits for a quorum of replies.
Low latency: only three message rounds in the common case.
Influence: forms the basis for modern blockchain consensus protocols (Tendermint, HotStuff, PBFT-SMART, LibraBFT).
Trade-offs:
High communication cost (\(O(n^2)\) messages per phase).
Assumes authenticated channels.
Performance degrades beyond a small cluster (typically ≤ 20 replicas).
A Gentle Proof (Why It Works)
PBFT ensures safety and liveness through quorum intersection and authentication.
Each decision requires agreement by \(2f + 1\) nodes.
Any two quorums intersect in at least one correct node: \[
(2f + 1) + (2f + 1) > 3f + 1
\] So there’s always at least one honest replica linking past and future decisions.
Because all messages are signed, a faulty node cannot impersonate or forge votes.
Hence, even with up to \(f\) malicious nodes, conflicting commits are impossible.
Measure total messages exchanged for 1 request vs Raft.
Test Cases
Scenario
Behavior
Result
No faults
3-phase commit
Agreement achieved
Primary fails
View change
New primary elected
One replica sends bad data
Ignored by quorum
Safety preserved
Replay attack
Rejected (timestamp/digest)
Integrity preserved
Complexity
Message complexity: \(O(n^2)\) per request
Message rounds: 3 (pre-prepare, prepare, commit)
Fault tolerance: up to \(f\) Byzantine failures with \(3f + 1\) replicas
Latency: 3 network RTTs (normal case)
PBFT is consensus in an adversarial world, where honesty must be proven by quorum, and agreement arises not from trust, but from the mathematics of intersection and integrity.
856 Zab (Zookeeper Atomic Broadcast)
Zab, short for Zookeeper Atomic Broadcast, is a consensus and replication protocol used by Apache Zookeeper to maintain a consistent distributed state across all servers. It’s designed specifically for leader-based coordination services, ensuring that all updates (state changes) are delivered in the same order to every replica, even across crashes and restarts.
What Problem Are We Solving?
Zookeeper provides guarantees that every operation:
Executes in a total order (same sequence everywhere).
Survives server crashes and recoveries.
Doesn’t lose committed updates even when the leader fails.
Zab solves the challenge of combining:
Atomic broadcast (all or nothing delivery), and
Crash recovery (no double-commit or rollback).
“If one server says it happened, then it happened everywhere, exactly once, in the same order.”
How Does It Work (Plain Language)
Zab builds on the leader-follower model but extends it to guarantee total order broadcast and durable recovery. It works in three major phases:
1. Discovery Phase
A new leader is elected (using an external mechanism like Fast Leader Election).
The leader determines the latest committed transaction ID (zxid) among servers.
The leader chooses the most up-to-date history as the system’s official prefix.
2. Synchronization Phase
The leader synchronizes followers’ logs to match the chosen prefix.
Followers truncate or fill in missing proposals to align with the leader’s state.
Once synchronized, followers move to the broadcast phase.
3. Broadcast Phase
The leader receives client transactions, assigns a new zxid (monotonically increasing ID), and sends a PROPOSAL to followers.
Followers persist the proposal to disk and reply with ACK.
When a quorum acknowledges, the leader sends a COMMIT message.
All followers apply the transaction in order.
Example Timeline
Phase
Message
Description
Discovery
Leader election
New leader collects latest zxids
Sync
PROPOSAL sync
Aligns logs among followers
Broadcast
PROPOSAL/ACK/COMMIT
Steady-state replication
If the leader crashes mid-commit, the next leader uses the discovery phase to find the most advanced log, ensuring no committed transaction is lost or replayed.
and \[
\text{If } T_i \text{ committed in epoch } e, \text{ then all future epochs } e' > e \text{ contain } T_i
\]
Hence, atomicity and total order are preserved across view changes.
Try It Yourself
Simulate three servers: one leader, two followers.
Let the leader propose transactions (T1, T2, T3).
Kill the leader after committing T2, start a new leader.
Observe how T3 (uncommitted) is discarded, but T1–T2 persist.
Replay the process and verify all nodes converge to identical logs.
Test Cases
Scenario
Behavior
Result
Normal broadcast
Leader proposes, quorum ACKs
Total order commit
Leader crash after commit
Recovery preserves state
No rollback
Leader crash before commit
Uncommitted proposal dropped
No duplication
Log divergence
New leader syncs highest prefix
Consistency restored
Complexity
Message rounds: 2 (proposal + commit)
Message complexity: \(O(N)\) per transaction
Fault tolerance: up to ⌊(N−1)/2⌋ failures
Storage: log entries + zxid sequence
Latency: 2 network RTTs per transaction
Zab is the quiet metronome behind Zookeeper’s reliability, a single leader broadcasting heartbeat and order, ensuring that every replica, everywhere, hears the same story in the same rhythm.
857 EPaxos (Egalitarian Paxos)
EPaxos, short for Egalitarian Paxos, is a fast, leaderless consensus algorithm that generalizes Paxos to allow any replica to propose and commit commands concurrently, without waiting for a fixed leader. It optimizes latency and throughput by exploiting command commutativity (independent operations that can execute in any order) and fast quorum commits.
What Problem Are We Solving?
In leader-based protocols (like Paxos, Raft, Zab):
One node (the leader) coordinates every command.
This creates a bottleneck and extra latency (2 RTTs to commit).
EPaxos eliminates the leader and lets multiple nodes propose concurrently. It still guarantees total order of non-commuting operations while skipping coordination for independent ones.
“If commands don’t conflict, why make them wait in line?”
How Does It Work (Plain Language)
EPaxos generalizes Paxos’s quorum logic with dependency tracking. Each replica can propose commands, and dependencies determine ordering.
Key Components
Command
Each client request: a command C with unique ID and operation.
Dependencies
Each command tracks a set of conflicting commands that must precede it.
Two commands commute if they don’t access overlapping keys or objects.
Quorums
EPaxos uses fast quorums, slightly larger than a majority, to commit in 1 RTT when there’s no conflict.
Protocol Overview
1. Propose Phase
A replica receives a client command C.
It sends a PreAccept(C, deps) message to all others with an initial dependency set (empty or guessed).
Each replica adds conflicting commands it knows of, returning an updated dependency set.
2. Fast Path (no conflicts)
If all responses agree on the same dependency set:
The command is committed immediately in one round-trip (fast path).
Otherwise, proceed to the slow path.
3. Slow Path (conflicts)
The proposer collects responses and picks the maximum dependency set.
Sends an Accept message to ensure quorum agreement.
Once a quorum of Accepts is received, the command is committed.
4. Execution
Commands are executed respecting the dependency graph (a partial order).
If A depends on B, execute B before A.
Example Timeline (5 replicas, fast quorum = 3)
Step
Replica
Action
Note
1
R1
PreAccept(C1, {})
Propose command C1
2
R2, R3
Reply with same deps
No conflicts
3
R1
Fast commit (1 RTT)
Command C1 committed
4
R4
PreAccept(C2, {C1})
Conflicting command
5
R2, R5
Add dependency
Requires slow path
6
R4
Commit C2 after quorum Accept
Ordered after C1
Tiny Code (Simplified Python Sketch)
class EPaxosReplica:def__init__(self, id):self.id=idself.deps = {}self.log = {}def preaccept(self, cmd, deps):self.deps[cmd] = deps.copy()# add local conflictsfor c inself.log:ifself.conflicts(cmd, c):self.deps[cmd].add(c)returnself.deps[cmd]def conflicts(self, c1, c2):# simple key-based conflict detectionreturn c1.key == c2.key
Each replica maintains dependencies and merges them to form a global partial order.
Why It Matters
Leaderless: No single coordinator or bottleneck.
Low latency: One RTT commit in the common case.
Parallelism: Multiple replicas propose and commit concurrently.
Consistency: Serializability for dependent commands, commutativity for independent ones.
High availability: Survives up to ⌊(N−1)/2⌋ failures.
Trade-offs:
More complex dependency tracking and message state.
Fast path requires more replicas than majority quorum.
Log recovery (after crash) is trickier.
A Gentle Proof (Why It Works)
Let \(Q_f\) be a fast quorum, \(Q_s\) a slow quorum, and \(N\) replicas.
\(|Q_f| + |Q_s| > N\) ensures intersection.
Any two quorums intersect in at least one correct replica, ensuring agreement.
For each command \(C\):
The dependency set \(\text{deps}(C)\) defines a partial order \(<\).
If \(C_1\) and \(C_2\) conflict, then either \(C_1 < C_2\) or \(C_2 < C_1\) is enforced via dependencies.
Hence, the execution order: \[
C_i < C_j ;\Rightarrow; \text{execute}(C_i) \text{ before } \text{execute}(C_j)
\] maintains linearizability for conflicting commands and high concurrency for independent ones.
Try It Yourself
Simulate 5 replicas; let two propose commands on disjoint keys concurrently.
Observe both commit in 1 RTT (fast path).
Introduce a conflicting command; watch it fall back to the 2-phase slow path.
Draw the dependency graph and verify topological execution order.
Fail one replica and confirm quorum intersection still ensures agreement.
Dependency tracking: \(O(K)\) per command (for K conflicts)
EPaxos makes consensus truly egalitarian, no single leader, no fixed rhythm, just replicas cooperating in harmony, deciding together, each aware of dependencies, yet free to move fast when they can.
858 VRR (Virtual Ring Replication)
Virtual Ring Replication (VRR) is a distributed consensus and replication protocol that organizes replicas into a logical ring to provide high-throughput, fault-tolerant log replication with a simpler structure than fully connected quorum systems. It ensures that all replicas deliver updates in the same order while efficiently handling failures and recovery.
What Problem Are We Solving?
Traditional replication protocols like Paxos or Raft rely on majority quorums and broadcast communication, which can become expensive as the cluster grows. VRR instead arranges replicas into a virtual ring where messages flow in one direction, reducing coordination overhead.
The goal is:
Consistent state replication across all replicas.
Efficient communication using ring topology.
Fault tolerance through virtual successor and predecessor mapping.
“Rather than shouting to everyone, VRR whispers around the circle, and the message still reaches all.”
How Does It Work (Plain Language)
VRR extends the primary-backup model with a logical ring overlay among replicas.
Components
Primary replica: initiates client requests and broadcasts updates around the ring.
Backups: relay and confirm messages along the ring.
Ring order: determines the sequence of replication and acknowledgment.
View number: identifies the current configuration (like a Paxos term).
Phases
Normal Operation
The primary receives a client request.
It assigns a sequence number n and sends an UPDATE(n, op) to its successor on the ring.
Each node forwards the message to its successor until it completes a full circle.
When the update returns to the primary, it is committed.
Every node applies the operation in order.
Failure Handling
If a node fails to forward the update, its successor detects timeout and initiates a view change.
The next node in the ring becomes the new primary and continues operation.
The ring is virtually rewired to skip failed nodes.
Recovery
Failed nodes can rejoin later by replaying missed updates from the ring or a checkpoint.
Example Timeline
Step
Phase
Action
1
Normal
Primary P1 sends UPDATE(n=1, X) to P2
2
Normal
P2 → P3 → P4 (update circulates)
3
Normal
P4 → P1 (full ring)
4
Normal
P1 commits X
5
Failure
P3 crashes; P4 times out
6
View Change
P4 becomes new primary, rebuilds ring excluding P3
Tiny Code (Simplified Python Simulation)
class Replica:def__init__(self, id, successor=None):self.id=idself.successor = successorself.log = []def update(self, seq, op):self.log.append((seq, op))print(f"{self.id} applied op {op}")ifself.successor:self.successor.update(seq, op)
This model demonstrates circular propagation of updates through a virtual ring.
Why It Matters
Low communication overhead: Only one message per hop, not all-to-all.
High throughput: Ideal for stable, low-failure environments.
Scalable: Works well with many replicas.
Fault-tolerant: Handles failures by rerouting around missing nodes.
Foundation: Inspired later protocols like Corfu and chain replication.
Trade-offs:
Slightly higher latency (depends on ring length).
Single primary at a time, reconfiguration cost on failure.
Requires stable network connectivity between ring neighbors.
A Gentle Proof (Why It Works)
Let replicas be arranged in a ring \(R = [r_1, r_2, \dots, r_N]\). Each operation \(op_i\) is assigned a sequence number \(n_i\) by the primary \(r_p\).
Total order: updates circulate the ring in the same order for all.
Durability: a commit is acknowledged after returning to the primary, ensuring that all replicas have applied the update.
Fault tolerance: when a node fails, a new ring \(R'\) is formed excluding it.
For any two updates \(op_i, op_j\), \[
n_i < n_j \Rightarrow \text{deliver}(op_i) \text{ before } \text{deliver}(op_j)
\] and \[
\forall R', R'': \text{prefix}(R') = \text{prefix}(R'') \Rightarrow \text{consistent state}
\]
This preserves total order and prefix consistency under failures.
Try It Yourself
Create a ring of four replicas (A → B → C → D → A).
Let A be primary and broadcast updates (1, 2, 3).
Kill C mid-update, observe timeout and view change.
Rebuild ring as A → B → D → A, continue replication.
Reintroduce C, synchronize from D’s log.
Test Cases
Scenario
Behavior
Result
Normal operation
Sequential forwarding
Ordered replication
One node crash
View change
Ring reformed
Late node recovery
Log replay
Full synchronization
Network delay
Sequential consistency
Eventual delivery
Complexity
Message rounds: \(O(N)\) per update (one hop per replica)
Message complexity: \(O(N)\)
Fault tolerance: up to ⌊(N−1)/2⌋ via reformation
Storage: log per node + checkpoint
Latency: proportional to ring length
Virtual Ring Replication is like a well-choreographed relay race, each runner passes the baton in perfect sequence, and even if one stumbles, the circle reforms and the race goes on.
859 Two-Phase Commit with Consensus
Two-Phase Commit with Consensus (2PC+C) merges two powerful mechanisms, the atomic commit protocol from databases and distributed consensus from Paxos/Raft, to achieve fault-tolerant transactional commits across multiple nodes or services. It ensures that a transaction is either committed by all participants or aborted by all, even in the presence of failures or partitions.
What Problem Are We Solving?
The classic Two-Phase Commit (2PC) protocol coordinates distributed transactions across nodes, but it has a fatal flaw:
If the coordinator fails after participants vote yes but before announcing commit, all participants block indefinitely.
To fix this, 2PC needs consensus, a way for participants to agree on the outcome (commit or abort) even if the coordinator dies.
2PC+C integrates consensus (like Paxos or Raft) to make the decision durable, available, and recoverable.
“If one node falls silent, consensus finishes the sentence.”
How Does It Work (Plain Language)
The protocol has two main roles:
Coordinator: orchestrates the transaction (can be replicated using consensus).
Participants: local databases or services that prepare and commit work.
The algorithm proceeds in two logical phases, each made reliable by consensus.
1. Prepare Phase (Voting)
Coordinator proposes a transaction T.
Sends PREPARE(T) to all participants.
Each participant:
Validates local constraints.
Logs READY(T) to durable storage.
Replies YES (ready to commit) or NO (abort).
Coordinator collects votes.
If any NO → outcome = ABORT.
If all YES → outcome = COMMIT.
2. Commit Phase (Decision via Consensus)
Coordinator proposes the final decision (COMMIT or ABORT) via consensus.
Once a majority of nodes in the consensus group agree:
Decision is replicated and durable.
All participants are informed of the final result and apply it locally.
This way, even if the coordinator crashes mid-decision, another node can recover the log and complete the transaction safely.
Example Timeline
Step
Phase
Action
Result
1
Prepare
Coordinator sends PREPARE(T)
Participants vote
2
Prepare
All reply YES
Decision: commit
3
Consensus
Decision proposed via Paxos
Majority accept
4
Commit
Decision broadcast
All commit
5
Crash recovery
Coordinator restarts
Learns decision from log
Tiny Code (Simplified Pseudocode)
class Participant:def__init__(self):self.state ="INIT"def prepare(self, tx):ifself.can_commit(tx):self.state ="READY"return"YES"else:self.state ="ABORT"return"NO"def commit(self):ifself.state =="READY":self.state ="COMMIT"print("Committed")else:print("Aborted")
The coordinator’s decision (commit or abort) is stored via consensus among replicas, ensuring fault tolerance.
Why It Matters
No blocking: coordinator failure doesn’t stall participants.
Atomicity: all-or-nothing commit across distributed systems.
Durability: decisions survive crashes and restarts.
Integrates databases + consensus systems: basis for Spanner, CockroachDB, TiDB, Yugabyte, etc.
General-purpose: works across heterogeneous services (microservices, key-value stores, message queues).
Trade-offs:
More message complexity (adds consensus layer).
Slightly higher latency.
Consensus replicas must remain available (quorum required).
A Gentle Proof (Why It Works)
Let \(D \in {\text{COMMIT}, \text{ABORT}}\) be the global decision.
All participants vote and log their decisions durably.
The coordinator uses consensus to record \(D\) on a majority of replicas.
Any recovery process consults the consensus log: \[
\text{learn}(D) = \text{argmax}_n(\text{accepted}_n)
\] ensuring that all replicas converge on the same outcome.
Because consensus ensures a single agreed value, no two participants can observe conflicting decisions.
Safety: \[
\text{If one node commits } T, \text{ then every node eventually commits } T
\]
Liveness (under partial synchrony): \[
\text{If quorum available and all vote YES, } T \text{ eventually commits.}
\]
Try It Yourself
Simulate three participants and one consensus cluster (3 nodes).
Let all vote YES. Commit via consensus → durable success.
Crash coordinator mid-phase → restart → read outcome from log.
Try with one participant voting NO → global abort.
Observe that no node blocks indefinitely.
Test Cases
Scenario
Behavior
Result
All participants vote YES
Commit via consensus
Consistent commit
One participant votes NO
Global abort
Safe abort
Coordinator crash mid-commit
Recovery via consensus log
No blocking
Network partition
Majority side decides
Consistency preserved
Complexity
Message rounds: 2PC (2) + Consensus (2) = 4 in total
Message complexity: \(O(N + M)\) for N participants, M replicas in consensus
Fault tolerance: up to ⌊(M−1)/2⌋ coordinator replica failures
Storage: logs for votes and decisions
2PC with Consensus turns the fragile “yes/no” dance of distributed transactions into a robust orchestration, where even silence, failure, or chaos cannot stop the system from deciding, together, and forever.
860 Chain Replication
Chain Replication is a fault-tolerant replication technique that arranges servers in a linear chain, ensuring strong consistency, high throughput, and predictable failure recovery. It’s widely used in large-scale storage systems and coordination services where updates must be processed in total order without requiring all-to-all communication.
What Problem Are We Solving?
Traditional quorum-based replication (like Paxos or Raft) requires majority acknowledgments, which can increase latency. In contrast, Chain Replication guarantees:
Linearizability (same order of operations on all replicas)
High throughput by pipelining updates down a chain
Low latency for reads (served from the tail)
The key idea:
Arrange replicas in a chain: head → middle → tail. Writes flow forward, reads flow backward.
How Does It Work (Plain Language)
Setup
A cluster has \(N\) replicas, ordered logically: \[
r_1 \rightarrow r_2 \rightarrow r_3 \rightarrow \dots \rightarrow r_N
\]
Head: handles all client write requests.
Tail: handles all client read requests.
Middle nodes: forward writes and acknowledgments.
Write Path
Client sends WRITE(x, v) to the head.
The head applies the update locally and forwards it to its successor.
Each node applies the update and forwards it down the chain.
The tail applies the update, sends ACK upstream.
Once the head receives ACK, the write is committed and acknowledged to the client.
Read Path
Client sends READ(x) to the tail, which has the most up-to-date committed state.
Example Timeline (3-node chain)
Step
Node
Action
Note
1
Head
WRITE(x=5)
Apply locally
2
Head → Mid
Forward update
3
Mid
Apply, forward to Tail
4
Tail
Apply, send ACK
Commit point
5
Mid → Head
ACKs propagate back
6
Head
Confirms to client
Commit complete
Failure Handling
Head Failure: next node becomes new head.
Tail Failure: previous node becomes new tail.
Middle Failure: chain is reconnected, skipping the failed node.
The coordinator or control service reconfigures the chain dynamically.
Recovery: a restarted node can rejoin by fetching the tail’s state and reentering the chain.
Failures never violate consistency, only temporarily reduce availability.
Used in: Microsoft’s FAWN-KV, Ceph RADOS, Azure Cosmos DB, and more.
Trade-offs:
Single chain per data partition, not suitable for high contention global writes.
Reconfiguration during failures adds brief downtime.
Requires external coordination (e.g., Zookeeper) to manage membership.
A Gentle Proof (Why It Works)
Let \(U = {u_1, u_2, \dots, u_n}\) be a sequence of updates applied along the chain.
Each update \(u_i\) is:
Applied in order at each node (FIFO forwarding).
Committed when acknowledged by the tail.
Linearizability: For any two operations \(u_i\) and \(u_j\): \[
u_i \text{ completes before } u_j \text{ begins} \Rightarrow u_i \text{ visible before } u_j
\]
All nodes see writes in identical order because forwarding is sequential: \[
\text{order}(r_1) = \text{order}(r_2) = \dots = \text{order}(r_N)
\]
Hence, reads from the tail always reflect the latest committed value.
Try It Yourself
Simulate a chain of three replicas.
Send sequential writes to the head.
Verify that each node applies updates in the same order.
Crash the middle node, reconnect chain (head → tail).
Confirm that consistency is preserved.
Restart the failed node and resynchronize it.
Test Cases
Scenario
Behavior
Result
Sequential writes
Ordered propagation
Strong consistency
Tail read after commit
Latest value visible
Linearizable
Middle node crash
Chain reconfigures
No inconsistency
Simultaneous writes
FIFO at head
Deterministic order
Complexity
Write latency: proportional to chain length (\(O(N)\) hops)
Read latency: \(O(1)\) (tail only)
Message complexity: \(2N\) per write (forward + ack)
Fault tolerance: up to ⌊(N−1)/2⌋ with reconfiguration
Storage: full state replica per node
Chain Replication is the assembly line of consistency, each node adds reliability in sequence, the message flows down the line, and the final product is a perfectly ordered, fault-tolerant log of truth.
Section 87. Load Balancing and Rate Limiting
861 Round Robin Load Balancing
Round Robin Load Balancing is one of the simplest and most widely used methods for distributing requests among multiple servers. It cycles through a list of servers in order, sending each incoming request to the next server in line. Once the end of the list is reached, it starts again from the beginning.
This algorithm is deterministic, fair, and stateless, ideal for evenly spreading load when servers have similar capacities.
What Problem Are We Solving?
In distributed systems and web architectures, a single server cannot handle all incoming traffic. We need a load balancer that divides requests among multiple servers so that:
No single server becomes overloaded
Requests are processed efficiently
The system remains scalable and responsive
Round Robin provides a simple way to achieve fair load distribution without tracking server state or performance.
How Does It Work (Plain Language)
Imagine you have three servers: S1, S2, S3. Requests arrive as R1, R2, R3, R4, R5, R6.
The Round Robin balancer routes them like this:
Request
Server
R1
S1
R2
S2
R3
S3
R4
S1
R5
S2
R6
S3
Each server receives requests in turn, producing a smooth rotation of work.
If a server fails, it is removed from the rotation until it recovers.
Tiny Code (Python Example)
servers = ["S1", "S2", "S3"]index =0def get_server():global index server = servers[index] index = (index +1) %len(servers)return server# Simulate incoming requestsfor r inrange(1, 7):print(f"Request {r} → {get_server()}")
#include <stdio.h>int main(){constchar*servers[]={"S1","S2","S3"};int n =3, index =0;for(int r =1; r <=6; r++){ printf("Request %d → %s\n", r, servers[index]); index =(index +1)% n;}return0;}
Why It Matters
Simplicity: No need to track metrics or states.
Fairness: Every server gets roughly the same number of requests.
Scalability: Easy to extend, just add servers to the list.
Statelessness: Load balancer doesn’t need session memory.
Common Use Cases:
DNS round robin
HTTP load balancers (NGINX, HAProxy)
Task queues and job schedulers
Trade-offs:
Does not account for server load differences.
May overload slow or busy servers if they vary in capacity.
Works best when all servers are homogeneous.
A Gentle Proof (Why It Works)
Let there be \(n\) servers and \(m\) requests. Each request \(r_i\) is assigned to a server according to:
Thus, the difference between any two servers’ loads is at most one request, ensuring near-perfect balance for homogeneous servers.
Try It Yourself
Implement the Python or C version above.
Add or remove servers and observe the assignment pattern.
Introduce random delays on one server and note that Round Robin doesn’t adapt, that’s where Weighted or Least-Connections algorithms come in.
Visualize requests on a timeline, notice the perfect rotation.
Test Cases
Scenario
Input
Behavior
Output
3 servers, 6 requests
S1, S2, S3
Perfect rotation
S1,S2,S3,S1,S2,S3
Add 4th server
S1..S4
Repeats every 4
S1,S2,S3,S4,S1,S2
Remove server S2
S1,S3
Alternate between 2
S1,S3,S1,S3
Unequal processing time
S1 slow
Still evenly assigned
S1 overloads
Complexity
Time: \(O(1)\) per request (simple modulo arithmetic)
Space: \(O(n)\) for the list of servers
Fairness error: ≤ 1 request difference
Fault tolerance: depends on health check logic
Round Robin is the clockwork of load balancing, simple, rhythmic, and predictable. It may not be clever, but it is reliable, and in distributed systems, reliability is often the most elegant solution of all.
862 Weighted Round Robin
Weighted Round Robin (WRR) is an extension of simple Round Robin that assigns different weights to servers based on their capacity or performance. Instead of treating all servers equally, it proportionally distributes requests so that faster or more capable servers receive more load.
This algorithm is widely used in web servers, load balancers, and content delivery systems to handle heterogeneous clusters efficiently.
What Problem Are We Solving?
Classic Round Robin assumes that every server has the same capacity. But in reality:
Some servers have more CPU cores or memory.
Some are located closer to clients (less latency).
Some might be replicas handling read-heavy workloads.
Weighted Round Robin ensures that each server receives load proportional to its weight, keeping utilization balanced and throughput optimal.
How Does It Work (Plain Language)
Each server has a weight \(w_i\) that represents how many requests it should handle per cycle. The algorithm cycles through the server list, but repeats each server according to its weight.
Example:
Server
Weight
Assigned Requests (per cycle)
S1
1
1
S2
2
2
S3
3
3
For 6 requests: Sequence → S1, S2, S2, S3, S3, S3
This ensures heavier servers receive proportionally more traffic.
#include <stdio.h>typedefstruct{constchar*name;int weight;} Server;int main(){ Server servers[]={{"S1",1},{"S2",2},{"S3",3}};int total =6;int count =0;while(count < total){for(int i =0; i <3&& count < total; i++){for(int w =0; w < servers[i].weight && count < total; w++){ printf("Request %d → %s\n", count +1, servers[i].name); count++;}}}return0;}
Why It Matters
Weighted Round Robin is ideal when:
Servers have different capacities.
Some nodes should preferentially handle more requests.
The environment remains mostly stable (no rapid load shifts).
Advantages:
Simple and deterministic.
Adapts to heterogeneous clusters.
Easy to implement and reason about.
Trade-offs:
Still static, does not react to real-time load or queue length.
Requires periodic weight tuning based on performance metrics.
A Gentle Proof (Why It Works)
Let servers \(S_1, S_2, \dots, S_n\) have weights \(w_1, w_2, \dots, w_n\). Define total weight: \[
W = \sum_{i=1}^n w_i
\]
Each server \(S_i\) should handle: \[
f_i = \frac{w_i}{W} \times m
\] where \(m\) is the total number of requests.
Over a full cycle, the algorithm ensures: \[
|\text{load}(S_i) - f_i| \le 1
\]
Thus, the load deviation between actual and ideal distribution is bounded by 1 request, maintaining fairness proportional to capacity.
Try It Yourself
Assign weights to three servers: 1, 2, 3.
Generate 12 requests and observe the distribution pattern.
Increase S3’s weight to 5, notice how most requests now go there.
Implement a dynamic version that adjusts weights based on latency or CPU load.
Test Cases
Scenario
Weights
Requests
Distribution
Result
Equal weights
[1,1,1]
6
2,2,2
Same as Round Robin
Unequal weights
[1,2,3]
6
1,2,3
Proportional
Add server
[1,2,3,1]
7
1,2,3,1
Fair rotation
Remove slow server
[0,2,3]
5
0,2,3
Load shifts to others
Complexity
Time: \(O(1)\) per request (precomputed sequence possible)
Space: \(O(n)\) for servers and weights
Fairness error: ≤ 1 request difference
Scalability: excellent for up to hundreds of servers
Weighted Round Robin is like an orchestra conductor assigning musical phrases to instruments, each plays according to its strength, creating harmony instead of overload.
863 Least Connections
Least Connections is a dynamic load balancing algorithm that always sends a new request to the server with the fewest active connections. Unlike Round Robin or Weighted Round Robin, it reacts to real-time load rather than assuming all servers are equally busy.
This makes it one of the most efficient algorithms for systems with variable request times, for example, when some requests last much longer than others.
What Problem Are We Solving?
In practice, not all requests are equal:
Some complete in milliseconds, others take seconds or minutes.
Some servers may be temporarily overloaded.
Round Robin might keep sending requests to a busy server.
Least Connections solves this by dynamically picking the least busy node each time, balancing current load rather than static assumptions.
How Does It Work (Plain Language)
At any given moment, the load balancer keeps track of the number of active connections per server.
When a new request arrives:
The balancer checks all servers’ active connection counts.
It chooses the one with the fewest ongoing connections.
When a request completes, the count for that server decreases.
Example:
Server
Active Connections
S1
5
S2
2
S3
3
→ New request goes to S2, because it has the fewest active connections.
Can oscillate if all servers frequently change load.
Needs thread-safe counters in distributed settings.
A Gentle Proof (Why It Works)
Let there be \(n\) servers with active connection counts \(c_1, c_2, \dots, c_n\). When a new connection arrives, it is assigned to:
\[
S_k = \arg\min_{i}(c_i)
\]
After assignment:
\[
c_k \leftarrow c_k + 1
\]
Over time, the difference between any two servers’ loads satisfies:
\[
|c_i - c_j| \le 1
\]
under the assumption that connection durations follow similar distributions. This ensures that the variance in connection counts remains minimal, leading to even utilization.
Try It Yourself
Start with servers: S1=5, S2=3, S3=3.
Assign 10 random requests using the “least connections” rule.
Simulate completion (subtract counts).
Compare with Round Robin, notice smoother balancing.
Add weight support (Weighted Least Connections) and observe improvements.
Test Cases
Scenario
Initial Connections
Next Target
Result
Equal load
[2,2,2]
Any
Tie-breaking
Unequal load
[5,2,3]
S2
Chooses least busy
Dynamic
[3,4,5]
S1
Always lowest
Completion
S2 finishes
Count decreases
Balances load
Complexity
Time: \(O(n)\) per request (scan servers)
Space: \(O(n)\) (store active counts)
Adaptivity: dynamic, reacts to real-time load
Fairness: excellent under varying workloads
Optimized implementations use heaps or priority queues for \(O(\log n)\) selection.
Least Connections is the smart intuition of load balancing, it doesn’t just count servers, it listens to them. By watching who’s busiest and who’s free, it quietly keeps the system in perfect rhythm.
864 Consistent Hashing
Consistent Hashing is a clever technique for distributing data or requests across many servers while minimizing movement when nodes are added or removed. It’s the backbone of scalable systems like CDNs, distributed caches (Memcached, Redis Cluster), and distributed hash tables (DHTs) in systems like Cassandra or DynamoDB.
Instead of rebalancing everything when a server joins or leaves, consistent hashing ensures that only a small fraction of keys move, keeping the system stable and efficient.
What Problem Are We Solving?
Traditional modulo-based hashing looks simple but fails under scaling:
\[
\text{server} = \text{hash}(key) \bmod N
\]
When the number of servers \(N\) changes, almost all keys are remapped, a disaster for cache systems or large databases.
Consistent hashing fixes this by making the hash space independent of the number of servers and mapping both keys and servers into the same space.
How Does It Work (Plain Language)
Think of a hash space arranged in a circle from 0 to \(2^{32}-1\) (a ring).
Each server is assigned a position on the ring (via hashing its name or ID).
Each key is hashed to a point on the same ring.
The key is stored at the first server clockwise from its position.
If a server leaves or joins, only the keys between its predecessor and itself move, the rest remain untouched.
Example:
Server
Hash Position
S1
0.1
S2
0.4
S3
0.8
A key hashed to 0.35 goes to S2. A key hashed to 0.75 goes to S3.
When S2 leaves, only keys from 0.1 to 0.4 shift to S3, others stay put.
Tiny Code (Python Example)
import hashlibdef hash_key(key):returnint(hashlib.md5(key.encode()).hexdigest(), 16) %360servers = [30, 150, 270] # positions on the ringdef get_server(key): h = hash_key(key)for s insorted(servers):if h < s:return sreturn servers[0] # wrap aroundprint("Key 'apple' → Server", get_server("apple"))print("Key 'banana' → Server", get_server("banana"))
This maps keys around a circular ring of hash values.
Tiny Code (C Version, Simplified)
#include <stdio.h>#include <string.h>int hash(constchar*key){int h =0;while(*key) h =(h *31+*key++)%360;return h;}int main(){int servers[]={30,150,270};int n =3;constchar*keys[]={"apple","banana","peach"};for(int i =0; i <3; i++){int h = hash(keys[i]);int chosen = servers[0];for(int j =0; j < n; j++)if(h < servers[j]){ chosen = servers[j];break;} printf("Key %s → Server %d\n", keys[i], chosen);}return0;}
Why It Matters
Consistent hashing is essential for large distributed systems because:
It reduces data movement to \(O(1/N)\) when scaling.
It naturally supports horizontal scalability.
It eliminates single points of failure when combined with replication.
It powers load-balanced request routing, sharded databases, and distributed caches.
Trade-offs:
Load imbalance if servers are unevenly spaced, solved by “virtual nodes.”
Slight overhead in maintaining the ring and hash lookups.
A Gentle Proof (Why It Works)
Let \(K\) be the total number of keys and \(N\) servers. Each key is assigned to the next server clockwise on the ring. When a new server joins, it takes responsibility for only a fraction:
\[
\frac{K}{N+1}
\]
of the keys, not all of them.
Expected remapping cost is:
\[
O\left(\frac{1}{N}\right)
\]
so as \(N\) grows, rebalancing becomes negligible.
When servers are distributed uniformly (or via virtual nodes), the load per server is approximately equal:
\[
\text{Expected load} \approx \frac{K}{N}
\]
Try It Yourself
Create 3 servers: S1, S2, S3 on a 360-degree ring.
Hash 10 keys and assign them.
Add S4 at position 90, recalculate.
Count how many keys moved (only those between 30 and 90).
Add “virtual nodes”, hash each server multiple times to smooth load.
Test Cases
Scenario
Servers
Keys Moved
Result
Add new server
3 → 4
~25%
Minimal rebalance
Remove server
4 → 3
~25%
Controlled movement
Hash ring skew
Uneven spacing
Unbalanced
Fix with virtual nodes
Uniform ring
Equal spacing
Balanced
Ideal distribution
Complexity
Time: \(O(\log N)\) per lookup (binary search on ring)
Space: \(O(N)\) for ring structure
Data movement: \(O(1/N)\) per server join/leave
Scalability: excellent, used in web-scale systems
Consistent Hashing is the geometry of scalability, a simple circle that turns chaos into balance. Every node finds its place, every key its home, and the system keeps spinning, smoothly, indefinitely.
865 Power of Two Choices
Power of Two Choices is a probabilistic load balancing algorithm that dramatically improves load distribution with minimal overhead. Instead of checking all servers like Least Connections, it randomly samples just two and sends the request to the less loaded one.
This tiny tweak, from one random choice to two, reduces imbalance exponentially, making it one of the most elegant results in distributed systems.
What Problem Are We Solving?
Pure random load balancing (like uniform random choice) can produce uneven load:
Some servers get overloaded by chance.
The imbalance worsens as the number of servers increases.
But checking all servers (like Least Connections) is expensive. The Power of Two Choices gives you almost the same balance as checking every server, while inspecting only two, a brilliant compromise.
How Does It Work (Plain Language)
When a new request arrives:
Randomly pick two servers (or more generally, d servers).
Compare their current loads (active connections, queue length, etc.).
Assign the request to the one with fewer connections.
That’s it, the algorithm self-balances with almost no coordination.
Example:
Server
Active Connections
S1
10
S2
12
S3
15
S4
8
If we pick S2 and S4 randomly, S4 wins because it has fewer active connections.
Each decision compares only two servers, but over time, the load evens out beautifully.
Tiny Code (C Version)
#include <stdio.h>#include <stdlib.h>#include <time.h>int main(){ srand(time(NULL));int connections[]={10,12,15,8};constchar*servers[]={"S1","S2","S3","S4"};int a = rand()%4, b = rand()%4;int chosen =(connections[a]<= connections[b])? a : b; printf("Next request → %s\n", servers[chosen]);return0;}
Why It Matters
Almost optimal balance with minimal computation.
Scales gracefully to thousands of servers.
Used in systems like Google Maglev, AWS ELB, Kubernetes, and distributed hash tables.
Trade-offs:
Slight randomness may cause short-term fluctuations.
Needs visibility of basic per-server metrics (connection count).
A Gentle Proof (Why It Works)
Let there be \(n\) servers and \(m\) requests. Each request chooses two random servers and selects the one with the smaller load.
Research by Mitzenmacher and colleagues shows that:
For random assignment (1 choice), maximum load ≈ \(\frac{\log n}{\log \log n}\).
For two choices, maximum load ≈ \(\log \log n\), an exponential improvement.
Formally, the gap between the most and least loaded server shrinks dramatically: \[
E[\text{max load}] = \frac{\log \log n}{\log d} + O(1)
\] where \(d\) is the number of random choices (usually 2).
This is one of the most striking results in randomized algorithms.
Try It Yourself
Simulate 100 requests over 10 servers.
Compare Random vs Power of Two Choices balancing.
Plot number of connections per server after assignment.
Observe that “Power of Two” produces a far tighter spread.
Optional: experiment with d = 3 or d = 4, diminishing returns but even better balance.
Test Cases
Scenario
Servers
Policy
Result
10 servers, 100 requests
Random
High imbalance
Uneven load
10 servers, 100 requests
Power of 2
Balanced
Small deviation
1000 servers, 10k requests
Random
\(\max - \min \approx 20\)
Unbalanced
1000 servers, 10k requests
Power of 2
\(\max - \min \approx 3\)
Near-optimal
Complexity
Time: \(O(1)\) (just two random lookups)
Space: \(O(n)\) (track connection counts)
Balance quality: exponential improvement over random
Scalability: excellent for distributed environments
Power of Two Choices is what happens when mathematics meets elegance, a single extra glance, and chaos becomes order. It’s proof that sometimes, just two options are all you need for balance.
866 Random Load Balancing
Random Load Balancing assigns each incoming request to a server chosen uniformly at random. It’s the simplest possible algorithm, no tracking, no weights, no metrics, yet surprisingly effective for large homogeneous systems.
Think of it as the “coin toss” approach to distributing work: easy to implement, statistically fair in the long run, and fast enough to handle millions of requests per second.
What Problem Are We Solving?
When a cluster has many servers and requests arrive rapidly, the balancer must decide where to send each request.
Random load balancing solves this with minimal computation:
No connection state
No server monitoring
O(1) decision time
It’s ideal for systems where all servers are identical and request times are short and uniform, for example, stateless web servers or content delivery nodes.
How Does It Work (Plain Language)
When a request arrives:
Randomly pick one server index between \(0\) and \(N - 1\).
Send the request to that server.
Repeat for every new request.
Example with 3 servers (S1, S2, S3):
Request
Random Server
R1
S2
R2
S3
R3
S1
R4
S3
R5
S2
Over time, each server gets roughly the same number of requests.
#include <stdio.h>#include <stdlib.h>#include <time.h>int main(){ srand(time(NULL));constchar*servers[]={"S1","S2","S3"};int n =3;for(int r =1; r <=5; r++){int idx = rand()% n; printf("Request %d → %s\n", r, servers[idx]);}return0;}
Why It Matters
Fast and simple: ideal for large-scale, stateless systems.
No state tracking: works without shared memory or coordination.
Statistically fair: each server gets approximately equal load over time.
Used in:
DNS-level load balancing (round-robin or random DNS replies).
CDNs and caching systems (e.g., selecting edge nodes).
Cloud routing services when servers are uniform.
Trade-offs:
Can be unbalanced in the short term.
Not adaptive to real-time server load.
Performs poorly when requests have variable duration or cost.
A Gentle Proof (Why It Works)
Let there be \(N\) servers and \(M\) requests. Each request is assigned independently and uniformly:
\[
P(\text{request on } S_i) = \frac{1}{N}
\]
Expected number of requests per server:
\[
E[L_i] = \frac{M}{N}
\]
By the law of large numbers, as \(M \to \infty\):
\[
L_i \to \frac{M}{N}
\]
Variance of load between servers shrinks as \(\sigma^2 = \frac{M}{N}(1 - \frac{1}{N})\), meaning imbalance becomes negligible in large systems.
Try It Yourself
Simulate 10,000 requests across 10 servers.
Count how many requests each gets.
Compute variance, notice how small it is.
Add random request durations to see when imbalance starts to matter.
Test Cases
Scenario
Servers
Requests
Result
3 servers
6
Roughly 2 each
Balanced
5 servers
1000
±20 deviation
Acceptable
10 servers
10000
±1% deviation
Smooth
Mixed latency
uneven
Unbalanced
Poor under skew
Complexity
Time: \(O(1)\) per request (single random draw)
Space: \(O(1)\)
Fairness: long-term statistical balance
Adaptivity: none, purely random
Random Load Balancing is the coin toss of distributed systems, simple, fair, and surprisingly powerful when chaos is your ally. It reminds us that sometimes, randomness is the cleanest form of order.
867 Token Bucket
Token Bucket is a rate-limiting algorithm that allows bursts of traffic up to a defined capacity while maintaining a steady average rate over time. It’s widely used in networking, APIs, and operating systems to control request rates, bandwidth, and fairness between clients.
What Problem Are We Solving?
Without rate control, clients can:
Flood a server with too many requests.
Consume unfair amounts of shared bandwidth.
Cause latency spikes or denial of service.
We need a mechanism to allow short bursts (for responsiveness) but enforce a limit on long-term request rates.
That’s exactly what the Token Bucket algorithm provides.
How Does It Work (Plain Language)
Imagine a bucket that fills with tokens at a fixed rate, say \(r\) tokens per second. Each request consumes one token. If the bucket is empty, the request must wait (or be rejected).
Key parameters:
Rate \(r\): how fast tokens are added.
Capacity \(C\): maximum number of tokens the bucket can hold (the burst size).
At any time:
If tokens are available → allow the request and remove one.
Thus, over any long time period \(\tau\): \[
\text{Requests allowed} \le r \tau + C
\]
This guarantees that the average rate never exceeds \(r\), while up to \(C\) requests can burst instantly, providing both control and flexibility.
Try It Yourself
Set rate=2, capacity=5.
Send 10 requests at 0.2-second intervals.
Observe how early requests are allowed (tokens available), then throttling begins.
Slow down the request rate, see how tokens accumulate again.
Adjust capacity to experiment with burst tolerance.
Test Cases
Rate
Capacity
Request Interval
Behavior
1/s
5
0.5s
Bursts allowed, steady later
2/s
2
0.1s
Frequent throttling
5/s
10
0.1s
Smooth flow
1/s
1
1s
Strict rate limit
Complexity
Time: \(O(1)\) per request (constant-time update)
Space: \(O(1)\) (track rate, capacity, timestamp)
Fairness: high for bursty workloads
Adaptivity: excellent for variable request rates
Token Bucket is the heartbeat of rate limiting, it doesn’t block the pulse, it regulates the rhythm, allowing bursts of life while keeping the system calm and steady.
868 Leaky Bucket
Leaky Bucket is a classic rate-limiting and traffic-shaping algorithm that enforces a fixed output rate, smoothing bursts of incoming requests or packets. It’s widely used in network routers, API gateways, and distributed systems to maintain predictable performance under fluctuating load.
What Problem Are We Solving?
Real-world traffic is rarely steady, it comes in bursts. Without control, bursts can cause:
Queue overflows
Latency spikes
Packet drops or request failures
We need a way to turn bursty traffic into steady flow, like water leaking at a constant rate from a bucket with unpredictable inflow.
That’s the idea behind the Leaky Bucket.
How Does It Work (Plain Language)
Picture a bucket with a small hole at the bottom:
Incoming requests (or packets) are poured into the bucket.
The bucket leaks at a fixed rate \(r\), representing processing or sending capacity.
If the bucket overflows (more incoming than it can hold), excess requests are dropped.
Parameters:
Leak rate \(r\): how fast tokens (or packets) leave the bucket.
Capacity \(C\): maximum number of pending requests the bucket can hold.
Rules:
On each incoming request:
If the bucket is not full → enqueue the request.
If full → drop or delay it.
A scheduler continuously leaks (processes) requests at rate \(r\).
Example
Time
Incoming
Bucket Size
Action
0s
3
3
Added
1s
2
3
1 leaked, 2 added
2s
5
3
Overflow → drop 3
3s
0
2
Leaked steady
The flow out of the bucket remains constant at rate \(r\), even if inputs fluctuate.
Let \(I(t)\) be incoming requests per second, \(L(t)\) be leak rate, and \(B(t)\) be bucket fill level.
The bucket’s evolution over time is:
\[
\frac{dB(t)}{dt} = I(t) - L(t)
\]
subject to bounds: \[
0 \le B(t) \le C
\]
and constant leak rate: \[
L(t) = r
\]
Hence, the output rate is always constant, independent of burstiness in \(I(t)\), as long as the bucket doesn’t empty. If \(B(t)\) exceeds capacity \(C\), overflow requests are dropped, enforcing a hard rate limit.
Try It Yourself
Set rate=1, capacity=3.
Send 10 requests per second. Watch how the bucket overflows.
Lower input rate to 1 per second, stable flow.
Compare with Token Bucket to see how one allows bursts and the other smooths them out.
Leaky Bucket is the metronome of flow control, steady, unwavering, and disciplined. It doesn’t chase bursts; it keeps the rhythm, ensuring the system moves in time with its true capacity.
869 Sliding Window Counter
Sliding Window Counter is a rate-limiting algorithm that maintains a dynamic count of recent events within a rolling time window, rather than resetting at fixed intervals. It’s a more accurate version of the fixed window approach and is often used in APIs, authentication systems, and distributed gateways to enforce fair usage policies while avoiding burst unfairness near window boundaries.
What Problem Are We Solving?
A fixed window counter resets at exact intervals, say, every 60 seconds. That can lead to unfairness:
A client could send 100 requests at the end of one window and 100 more at the start of the next → 200 requests in a few seconds.
We need a smoother, time-aware limit that counts requests within the last N seconds, not just since the last clock tick.
That’s the Sliding Window Counter.
How Does It Work (Plain Language)
Instead of resetting periodically, this algorithm:
Tracks timestamps of each incoming request.
On each new request, it removes timestamps older than the window size (e.g., 60 seconds).
If the number of timestamps still inside the window is below the limit, the new request is accepted; otherwise, it’s rejected.
So the window “slides” continuously with time.
Example: 60-second window, limit = 100 requests
Time (s)
Request Count
Action
0–58
80
Allowed
59
+10
Still allowed
61
Old requests expire, count drops to 50
Window slides
62
+10 more
Allowed again
Tiny Code (Python Example)
import timefrom collections import dequeclass SlidingWindowCounter:def__init__(self, window_size, limit):self.window_size = window_sizeself.limit = limitself.requests = deque()def allow(self): now = time.time()# Remove timestamps outside the windowwhileself.requests and now -self.requests[0] >self.window_size:self.requests.popleft()iflen(self.requests) <self.limit:self.requests.append(now)returnTruereturnFalsewindow = SlidingWindowCounter(window_size=60, limit=5)for i inrange(10):print(f"Request {i+1}: {'Allowed'if window.allow() else'Blocked'}") time.sleep(10)
Tiny Code (C Version)
#include <stdio.h>#include <time.h>#include <unistd.h>#define LIMIT 5#define WINDOW 60double requests[LIMIT];int count =0;int allow(){double now = time(NULL);int i, new_count =0;for(i =0; i < count; i++){if(now - requests[i]<= WINDOW) requests[new_count++]= requests[i];} count = new_count;if(count < LIMIT){ requests[count++]= now;return1;}return0;}int main(){for(int i =0; i <10; i++){ printf("Request %d: %s\n", i +1, allow()?"Allowed":"Blocked"); sleep(10);}}
Why It Matters
Smooth rate limiting: avoids reset bursts at window boundaries.
Precise control: always enforces “N requests per second/minute” continuously.
Used in:
API gateways (AWS, Cloudflare, Google).
Authentication systems (login attempt throttling).
Distributed systems enforcing fairness or quotas.
Trade-offs:
Slightly more complex than fixed counters.
Requires storing recent request timestamps.
Memory usage grows with request rate (bounded by limit).
A Gentle Proof (Why It Works)
Let \(t_i\) be the timestamp of the \(i\)-th request. At time \(t\), define active requests as:
\[
R(t) = { t_i \mid t - t_i \le W }
\]
A request is allowed if:
\[
|R(t)| < L
\]
where \(W\) is the window size and \(L\) the limit.
Since \(R(t)\) continuously updates with time, the constraint holds for every possible interval of length W, not just fixed ones, ensuring true sliding-window fairness.
This prevents burst spikes that could occur in fixed windows, while maintaining the same long-term rate.
Try It Yourself
Set window_size = 60, limit = 5.
Send requests at 10-second intervals, all should be allowed.
Send 6 requests in rapid succession, the 6th should be blocked.
Wait for 60 seconds, old timestamps expire, and new requests are accepted again.
Visualize the number of requests in the active window over time.
Test Cases
Window
Limit
Pattern
Result
60s
5
1 per 10s
All allowed
60s
5
6 in 2s
6th blocked
10s
3
1 per 5s
Smooth rotation
60s
100
200 at start
Half blocked
Complexity
Time: \(O(1)\) amortized (pop old timestamps)
Space: \(O(L)\) for recent request timestamps
Precision: exact within window
Adaptivity: high, continuous enforcement
Sliding Window Counter is the chronometer of rate limiting, it doesn’t count by the clock, but by the moment, ensuring fairness second by second, without the jagged edges of time.
870 Fixed Window Counter
Fixed Window Counter is one of the simplest rate-limiting algorithms. It divides time into equal-sized windows (like 1 second or 1 minute) and counts how many requests arrive within the current window. Once the count exceeds a limit, new requests are blocked until the next window begins.
It’s easy to implement and efficient, but it has one flaw, bursts near window boundaries can briefly double the allowed rate. Still, for many applications, simplicity wins.
What Problem Are We Solving?
Systems need a way to control request frequency, for example:
Limit each user to 100 API calls per minute.
Prevent brute-force login attempts.
Control event emission rates in distributed pipelines.
A fixed counter approach provides an efficient and predictable method of enforcing these limits.
How Does It Work (Plain Language)
Time is split into fixed-length windows of size \(W\) (for example, 60 seconds). For each window, we maintain a counter of how many requests have been received.
Algorithm steps:
Determine the current time window (e.g., using integer division of current time by \(W\)).
Increment the counter for that window.
If the count exceeds the limit, reject the request.
When time moves into the next window, reset the counter to zero.
Simplicity: fast, predictable, easy to implement in both centralized and distributed systems.
Efficiency: only a counter per window, no timestamp tracking needed.
Used in:
API gateways and rate-limit middleware.
Authentication systems.
Network routers and firewalls.
Trade-offs:
Allows short bursts across boundaries (double spending issue).
Doesn’t smooth out usage, use Sliding Window or Token Bucket for that.
Resets abruptly at window boundaries.
A Gentle Proof (Why It Works)
Let:
\(W\) = window size (seconds)
\(L\) = limit (requests per window)
\(R(t)\) = requests received at time \(t\)
We allow a request at time \(t\) if: \[
C\left(\left\lfloor \frac{t}{W} \right\rfloor\right) < L
\] where \(C(k)\) is the counter for window \(k\).
At the next boundary: \[
C(k+1) = 0
\]
Thus, the algorithm enforces: \[
\frac{\text{requests}}{\text{time}} \le \frac{L}{W}
\] for most intervals, though bursts may occur across boundaries.
Try It Yourself
Set window_size = 10, limit = 5.
Send 5 requests quickly, all accepted.
Wait until the window resets, the counter clears.
Send 5 more immediately, all accepted again (burst effect).
Combine with Sliding Window Counter to fix the boundary issue.
Test Cases
Window
Limit
Pattern
Behavior
10s
5
1 per 2s
Smooth, all allowed
10s
5
6 in 5s
6th blocked
10s
5
5 before + 5 after boundary
Burst of 10
60s
100
Uniform
Perfect enforcement
Complexity
Time: \(O(1)\) per request (simple integer check)
Space: \(O(1)\) per user or key
Precision: coarse, stepwise resets
Performance: extremely high
Fixed Window Counter is the metronome of rate control, steady, reliable, and mechanical. It’s not graceful, but it’s predictable, and in distributed systems, predictability is gold.
Section 88. Search and Indexing
871 Inverted Index Construction
An inverted index maps each term to the list of documents (and often positions) where it appears. It is the core data structure behind web search, log search, and code search engines.
What Problem Are We Solving?
Given a corpus of documents, we want to answer queries like cat AND dog or phrases like "deep learning". Scanning all documents per query is too slow. An inverted index precomputes a term → postings list mapping so queries can jump directly to matching documents.
How Does It Work (Plain Language)
Parse and normalize each document
Tokenize text into terms
Lowercase, strip punctuation, optionally stem or lemmatize
Remove stopwords if desired
Emit postings while scanning
For each term \(t\) found in document \(d\), record one of:
Document-level: \((t \to d)\)
Positional-level: \((t \to (d, p))\) where \(p\) is the position
Group and sort
Group postings by term
Sort each term’s postings by document id, then by position
Compress and store
Gap encode doc ids and positions
Apply integer compression (e.g., Variable Byte, VByte, or Elias–Gamma)
Persist dictionary (term lexicon) and postings
Answering queries
Boolean queries: intersect or union sorted postings
to: \({(d_1, 1), (d_1, 5), (d_2, 1)}\) (assuming positions start at 1)
Tiny Code (Python, positional index)
import refrom collections import defaultdictdef tokenize(text):return re.findall(r"[a-z0-9]+", text.lower())def build_inverted_index(docs):# docs: dict {doc_id: text} index = defaultdict(lambda: defaultdict(list))for d, text in docs.items(): terms = tokenize(text)for pos, term inenumerate(terms, start=1): index[term][d].append(pos)# convert to regular dicts and sort postings final = {}for term, posting in index.items(): final[term] = {doc: sorted(pos_list) for doc, pos_list insorted(posting.items())}return finaldocs = {1: "to be or not to be",2: "to seek truth"}inv = build_inverted_index(docs)for term insorted(inv):print(term, "->", inv[term])
Output sketch:
be -> {1: [2, 6]}
not -> {1: [4]}
or -> {1: [3]}
seek -> {2: [2]}
to -> {1: [1, 5], 2: [1]}
truth -> {2: [3]}
Tiny Code (C, document-level index, minimal)
#include <stdio.h>#include <string.h>#include <ctype.h>#define MAX_TERMS 1024#define MAX_TERM_LEN 32#define MAX_DOCS 128typedefstruct{char term[MAX_TERM_LEN];int docs[MAX_DOCS];int ndocs;} Posting;Posting index_[MAX_TERMS];int nterms =0;void add_posting(constchar*term,int doc){// find termfor(int i =0; i < nterms; i++){if(strcmp(index_[i].term, term)==0){// add doc if newif(index_[i].ndocs ==0|| index_[i].docs[index_[i].ndocs -1]!= doc) index_[i].docs[index_[i].ndocs++]= doc;return;}}// new term strncpy(index_[nterms].term, term, MAX_TERM_LEN -1); index_[nterms].docs[0]= doc; index_[nterms].ndocs =1; nterms++;}void tokenize_and_add(constchar*text,int doc){char buf[256], term[MAX_TERM_LEN];int j =0, k =0;for(int i =0;; i++){char c = text[i];if(isalnum(c)){if(k < MAX_TERM_LEN -1) term[k++]= tolower(c);}else{if(k >0){ term[k]='\0'; add_posting(term, doc); k =0;}if(c =='\0')break;}}}int main(){ tokenize_and_add("to be or not to be",1); tokenize_and_add("to seek truth",2);for(int i =0; i < nterms; i++){ printf("%s ->", index_[i].term);for(int j =0; j < index_[i].ndocs; j++) printf(" %d", index_[i].docs[j]); printf("\n");}return0;}
Why It Matters
Speed: sublinear query time via direct term lookups
Scalability: supports billions of documents with compression and sharding
Extensible: can store payloads like term frequencies, positions, field ids
Tradeoffs
Build time and memory during indexing
Requires maintenance on updates (incremental or batch)
More complex storage formats for compression and skipping
A Gentle Proof (Why It Works)
Let the corpus be \(D = {d_1, \dots, d_N}\) and the vocabulary \(V = {t_1, \dots, t_M}\). Define the postings list for term \(t\) as: \[
P(t) = { (d, p_1, p_2, \dots) \mid t \text{ occurs in } d \text{ at positions } p_i }.
\] For Boolean retrieval on a conjunctive query \(q = t_a \land t_b\), the result set is: \[
R(q) = { d \mid d \in \pi_{\text{doc}}(P(t_a)) \cap \pi_{\text{doc}}(P(t_b)) }.
\] Since document ids within each postings list are sorted, intersection runs in time proportional to the sum of postings lengths, which is sublinear in the total corpus size when terms are selective.
For phrase queries with positions, we additionally check position offsets, preserving correctness by requiring aligned positional gaps.
Try It Yourself
Extend the Python code to store term frequencies per document and compute \(tf\) and \(df\).
Implement Boolean AND by intersecting sorted doc id lists.
Add phrase querying using positional lists: verify "to be" returns only \(d_1\).
Add gap encoding for doc ids and positions.
Benchmark intersection cost versus corpus size.
Test Cases
Query
Expected Behavior
to
Returns \(d_1, d_2\) with correct positions
be AND truth
Empty set
"to be"
Returns \(d_1\) only
seek OR truth
Returns \(d_2\)
Complexity
Build time: \(O(\sum_d |d| \log Z)\) if inserting into a balanced dictionary of size \(Z\), or \(O(\sum_d |d|)\) with hashing plus a final sort per term
Space: \(O(T)\) where \(T\) is total term occurrences, reduced by compression
Query
Boolean AND: \(O(|P(t_a)| + |P(t_b)|)\) using galloping or merge
Phrase query: above cost plus positional merge
I/O: reduced by skip lists, block posting, and compression (VByte, PForDelta, Elias–Gamma)
An inverted index turns a pile of text into a fast lookup structure: terms become keys, documents become coordinates, and search becomes a matter of intersecting sorted lists instead of scanning the world.
872 Positional Index Build
A positional index extends the inverted index by storing the exact positions of terms in each document. This allows phrase queries (like "machine learning") and proximity queries (like "data" within 5 words of "model").
It’s the foundation of modern search engines, where relevance depends not only on term presence but also on their order and distance.
What Problem Are We Solving?
A standard inverted index can only answer “which documents contain a term.” But if a user searches for "deep learning", we need documents where deep is immediately followed by learning. We also want to support queries like "neural" NEAR/3 "network".
To do that, we must store term positions within documents.
How Does It Work (Plain Language)
Tokenize documents and assign each word a position number (starting from 1).
For each term \(t\) in document \(d\), record all positions \(p\) where \(t\) appears.
Use position alignment to evaluate phrase queries.
Example:
Doc ID
Text
1
“to be or not to be”
2
“to seek truth”
Positional Index:
be -> {1: [2, 6]}
not -> {1: [4]}
seek -> {2: [2]}
to -> {1: [1, 5], 2: [1]}
truth -> {2: [3]}
Phrase query "to be" checks for positions where be.pos = to.pos + 1.
Tiny Code (Python Example)
import refrom collections import defaultdictdef tokenize(text):return re.findall(r"[a-z0-9]+", text.lower())def build_positional_index(docs): index = defaultdict(lambda: defaultdict(list))for doc_id, text in docs.items(): terms = tokenize(text)for pos, term inenumerate(terms, start=1): index[term][doc_id].append(pos)return indexdocs = {1: "to be or not to be",2: "to seek truth"}index = build_positional_index(docs)for term insorted(index):print(term, "->", index[term])
Output:
be -> {1: [2, 6]}
not -> {1: [4]}
seek -> {2: [2]}
to -> {1: [1, 5], 2: [1]}
truth -> {2: [3]}
Phrase Query (Example)
def phrase_query(index, term1, term2): results = []if term1 notin index or term2 notin index:return resultsfor d in index[term1]:if d in index[term2]: pos1 = index[term1][d] pos2 = index[term2][d]for p1 in pos1:if (p1 +1) in pos2: results.append(d)breakreturn resultsprint("Phrase 'to be':", phrase_query(index, "to", "be"))
The algorithm slides through position lists in increasing order, checking for consecutive offsets. Because positions are sorted, the check runs in \(O(|P(t_1,d)| + ... + |P(t_k,d)|)\) time, efficient in practice.
Try It Yourself
Modify the Python version to support 3-word phrases (e.g., "to be or").
Extend to proximity queries: "data" NEAR/3 "model".
Compress positions using delta encoding.
Add document frequency statistics.
Measure query speed as the corpus grows.
Test Cases
Query
Expected
Reason
"to be"
doc 1
consecutive terms
"to seek"
doc 2
valid
"be or not"
doc 1
triple phrase
"truth be"
none
not consecutive
Complexity
Stage
Time
Space
Index build
\(O(\sum_d | d | )\)
\(O(T)\)
Phrase query
\(O(\sum_i | P(t_i) | )\)
\(O(T)\)
Storage (uncompressed)
proportional to term occurrences
high but compressible
A positional index transforms plain search into structured understanding: not just what appears, but where. It captures the geometry of language, the shape of meaning in text.
873 TF–IDF Scoring
TF–IDF (Term Frequency–Inverse Document Frequency) is one of the most influential scoring models in information retrieval. It quantifies how important a word is to a document in a collection, balancing frequency within the document and rarity across the corpus.
What Problem Are We Solving?
In search, not all words are equal. Common terms like the, is, or data appear everywhere, while rare terms like quantization or Bayes carry much more meaning. We want a way to assign weights to words that reflect how well a document matches a query.
TF–IDF gives us exactly that balance.
How Does It Work (Plain Language)
TF–IDF combines two simple ideas:
Term Frequency (TF): Measures how often a term appears in a document. The more times a word occurs, the more relevant it may be. \[
TF(t, d) = \frac{f_{t,d}}{\max_k f_{k,d}}
\] where \(f_{t,d}\) is the raw count of term \(t\) in document \(d\).
Inverse Document Frequency (IDF): Measures how rare the term is across all documents. Rare terms get higher weight. \[
IDF(t, D) = \log \frac{N}{n_t}
\] where \(N\) is the total number of documents, and \(n_t\) is the number of documents containing \(t\).
A query’s score for document \(d\) is then: \[
\text{score}(q, d) = \sum_{t \in q} TF(t, d) \times IDF(t, D)
\]
Example
Corpus:
Doc ID
Text
1
“deep learning for vision”
2
“deep learning for language”
3
“classical machine learning”
Compute IDF for each term:
Term
Appears In
IDF
deep
2
\(\log(3/2) = 0.176\)
learning
3
\(\log(3/3) = 0\)
vision
1
\(\log(3/1) = 1.099\)
language
1
\(\log(3/1) = 1.099\)
classical
1
\(\log(3/1) = 1.099\)
machine
1
\(\log(3/1) = 1.099\)
Then TF–IDF for term “vision” in doc 1 (assuming TF = 1): \[
w_{vision,1} = 1 \times 1.099 = 1.099
\]
A query “deep vision” ranks doc 1 highest because both terms appear and vision is rare.
Tiny Code (Python)
import mathfrom collections import Counterdef tfidf(corpus): N =len(corpus) tf = [] df = Counter()# Compute term frequencies and document frequenciesfor text in corpus: words = text.lower().split() counts = Counter(words) tf.append(counts)for term in counts: df[term] +=1 idf = {t: math.log(N / df[t]) for t in df}# Combine TF and IDF weights = []for counts in tf: w = {t: counts[t] * idf[t] for t in counts} weights.append(w)return weightsdocs = ["deep learning for vision","deep learning for language","classical machine learning"]weights = tfidf(docs)for i, w inenumerate(weights, 1):print(f"Doc {i}: {w}")
Tiny Code (C Simplified)
#include <stdio.h>#include <math.h>#include <string.h>#define MAX_TERMS 100#define MAX_DOCS 10char terms[MAX_TERMS][32];int df[MAX_TERMS]={0};int nterms =0;int find_term(constchar*t){for(int i =0; i < nterms; i++)if(strcmp(terms[i], t)==0)return i; strcpy(terms[nterms], t);return nterms++;}void count_df(char docs[][256],int ndocs){for(int d =0; d < ndocs; d++){int seen[MAX_TERMS]={0};char*tok = strtok(docs[d]," ");while(tok){int i = find_term(tok);if(!seen[i]){ df[i]++; seen[i]=1;} tok = strtok(NULL," ");}}}int main(){char docs[3][256]={"deep learning for vision","deep learning for language","classical machine learning"};int N =3; count_df(docs, N);for(int i =0; i < nterms; i++) printf("%s -> IDF=%.3f\n", terms[i], log((double)N / df[i]));}
Why It Matters
Core of modern ranking algorithms.
Foundation of vector space models and cosine similarity search.
Forms the basis of BM25, TF–IDF + embeddings, and hybrid search.
Efficient and interpretable, no training needed.
Tradeoffs:
Doesn’t consider word order or semantics.
Can overemphasize long documents.
Simple, but remarkably effective for many tasks.
A Gentle Proof (Why It Works)
TF increases with term relevance within a document. IDF penalizes terms that appear in too many documents. Their product amplifies rare but frequent words within a document.
This is equivalent to projecting both \(q\) and \(d\) into a weighted vector space and computing their dot product. It approximates how specific the document is to the query’s rare words.
Try It Yourself
Compute TF–IDF for “machine learning” across your favorite research abstracts.
Compare ranking when using raw term counts vs. log-scaled TF.
Extend the formula to use cosine similarity: \[
\cos(\theta) = \frac{\mathbf{v_q} \cdot \mathbf{v_d}}{||\mathbf{v_q}|| , ||\mathbf{v_d}||}
\]
Integrate stopword filtering and stemming to improve quality.
Plot top-weighted terms per document.
Test Cases
Term
Doc 1
Doc 2
Doc 3
deep
0.176
0.176
0.000
learning
0.000
0.000
0.000
vision
1.099
0.000
0.000
language
0.000
1.099
0.000
machine
0.000
0.000
1.099
classical
0.000
0.000
1.099
Complexity
Step
Time
Space
Build DF
\(O(\sum_d | d | )\)
\(O(V)\)
Compute TF
\(O(\sum_d | d | )\)
\(O(V)\)
Query Scoring
\(O( | q | \times | D | )\)
\(O(V)\)
TF–IDF turns plain text into numbers that speak, simple frequencies that reveal meaning, a bridge between counting words and understanding ideas.
874 BM25 Ranking
BM25 (Best Match 25) is the modern evolution of TF–IDF. It refines the scoring by introducing document length normalization and saturation control, making it the gold standard for keyword-based ranking in search engines like Lucene, Elasticsearch, and PostgreSQL full-text search.
What Problem Are We Solving?
TF–IDF treats all documents and frequencies equally, which leads to two main issues:
Long documents unfairly score higher because they naturally contain more terms.
Overfrequent terms (e.g., a word appearing 100 times) shouldn’t increase the score indefinitely.
BM25 fixes both by dampening TF growth and adjusting for document length relative to the average.
How Does It Work (Plain Language)
BM25 builds on TF–IDF with two key corrections:
Term frequency saturation – frequent terms give diminishing returns.
Length normalization – long documents are penalized relative to the average length.
Standard baseline for modern IR and search engines.
Handles document length naturally.
Provides excellent balance of simplicity, performance, and accuracy.
Works with sparse indexes and precomputed postings.
Used in: Lucene, Elasticsearch, Solr, PostgreSQL FTS, Vespa, Whoosh, and OpenSearch.
A Gentle Proof (Why It Works)
TF–IDF’s linear scaling overweights frequent terms. BM25 adds a saturation function that flattens TF growth and a normalization term that adjusts for document length.
This ensures bounded contribution per term, preventing dominance by repetition.
Meanwhile, for very long documents (\(|d| \gg avgdl\)), the denominator grows, reducing effective weight, balancing precision and recall.
Try It Yourself
Implement BM25 on your own text dataset.
Experiment with \(b=0\) (no length normalization) and \(b=1\) (full normalization).
Compare BM25 and TF–IDF on queries with short vs. long documents.
Tune \(k_1\) between 1.0 and 2.0 to observe TF saturation.
Visualize how the score changes as \(f_{t,d}\) increases.
Test Cases
Term
\(f_{t,d}\)
\(|d|\)
\(avgdl\)
Score Effect
1
short doc
avg
highest
3
long doc
high
moderate
10
long doc
very high
flattens
rare term
small \(n_t\)
,
boosts score
frequent term
large \(n_t\)
,
lowers score
Complexity
Step
Time
Space
Build DF
\(O(\sum_d | d | )\)
\(O(V)\)
Query Score
\(O( | q | \times | D | )\)
\(O(V)\)
Tuning Impact
\(k_1, b\) affect balance only
negligible
BM25 is the sweet spot between mathematics and engineering, a compact formula that’s powered decades of search, where meaning meets ranking, and simplicity meets precision.
875 Boolean Retrieval
Boolean Retrieval is the simplest and oldest form of search logic, it treats documents as sets of words and queries as logical expressions using AND, OR, and NOT. It doesn’t rank results, a document either matches the query or it doesn’t. Yet this binary model is the foundation upon which all modern ranking models (like TF–IDF and BM25) were built.
What Problem Are We Solving?
Early information retrieval systems needed a precise way to find documents that exactly matched a combination of terms. For example:
“machine AND learning” → documents that contain both.
“neural OR probabilistic” → documents that contain either.
“data AND NOT synthetic” → documents about data but excluding synthetic.
This is fast, simple, and exact, ideal for legal search, filtering, or structured archives.
How Does It Work (Plain Language)
Each document is represented by the set of words it contains.
We build an inverted index, mapping each term to the list of documents containing it.
The query is parsed into a logical expression tree of AND/OR/NOT operators.
We combine the posting lists according to the Boolean logic.
Foundation of IR: The earliest and simplest model.
Fast and deterministic: Ideal for structured queries or filtering.
Still widely used: Databases, Lucene filters, search engines, and information retrieval courses.
Transparent: Users know exactly why a document matches.
Tradeoffs:
No ranking, all results are equal.
Sensitive to exact terms (no fuzziness).
Returns empty results if terms are too strict.
A Gentle Proof (Why It Works)
Let:
\(D_t\) be the set of documents containing term \(t\).
A Boolean query \(Q\) is an expression built using \(\cup\) (OR), \(\cap\) (AND), and \(\setminus\) (NOT).
Then the retrieval result is defined recursively: \[
R(Q) =
\begin{cases}
D_t, & \text{if } Q = t \\
R(Q_1) \cap R(Q_2), & \text{if } Q = Q_1 \text{ AND } Q_2 \\
R(Q_1) \cup R(Q_2), & \text{if } Q = Q_1 \text{ OR } Q_2 \\
D - R(Q_1), & \text{if } Q = \text{NOT } Q_1
\end{cases}
\]
This makes retrieval compositional and exact, each query maps deterministically to a document set.
Try It Yourself
Create your own inverted index for 5–10 sample documents.
Test queries like:
“data AND algebra”
“distributed OR parallel”
“consistency AND NOT eventual”
Extend to parentheses for precedence: (A AND B) OR (C AND NOT D).
Implement ranked retrieval as a next step (TF–IDF or BM25).
Compare Boolean vs ranked results on the same corpus.
Test Cases
Query
Result
Description
machine AND learning
{1, 3}
both words present
learning OR vision
{1, 2, 3}
union of sets
learning AND NOT vision
{3}
exclude vision
machine AND robotics
{3}
only document 3
deep AND vision
{2}
only document 2
Complexity
Operation
Time
Space
Build index
\(O(\sum_d | d | )\)
\(O(V)\)
AND/OR/NOT
\(O( | D_A | + | D_B | )\)
\(O(V)\)
Query evaluation
\(O(\text{length of expression})\)
constant
Boolean Retrieval is search in its purest logic, simple sets, clean truth, no shades of ranking. It’s the algebra of information, the “zero” from which all modern search theory evolved.
876 WAND Algorithm
The WAND (Weak AND) algorithm is an optimization for top-\(k\) document retrieval. Instead of scoring every document for a query (which can be millions), it cleverly skips documents that cannot possibly reach the current top-\(k\) threshold.
It’s the efficiency engine behind modern ranking systems, from Lucene and Elasticsearch to specialized IR engines in web-scale systems.
What Problem Are We Solving?
In ranking models like TF–IDF or BM25, a naive search engine must:
Compute a full score for every document containing any query term.
Sort all scores to return the top results.
That’s wasteful, most documents have low or irrelevant scores. We need a way to avoid computing scores for documents that can’t possibly enter the top-\(k\) list.
That’s what WAND does: it bounds the maximum possible score of each document and prunes the search early.
How Does It Work (Plain Language)
Each query term has:
a posting list of document IDs,
and a maximum possible term score (based on TF–IDF or BM25 upper bound).
WAND iteratively moves pointers through posting lists in increasing docID order:
Maintain one pointer per term (sorted by current docID).
Maintain a current score threshold, the \(k\)-th best score so far.
Compute an upper bound of the possible score for the document at the smallest docID across all lists.
If the upper bound is less than the current threshold, skip ahead, no need to compute.
Otherwise, fully evaluate the document’s score and update the threshold if it enters top-\(k\).
This gives the same top-\(k\) results as exhaustive scoring, but skips thousands of documents.
Example
Suppose query = “deep learning vision” and top-\(k = 2\).
Term
Posting List
Max Score
deep
[2, 5, 9]
1.2
learning
[1, 5, 9, 12]
2.0
vision
[5, 10]
1.5
Start with smallest docID = 1 (from learning). Upper bound = 2.0 < current threshold (0) → evaluate. Score(1) = 1.4 → add to heap, threshold = 1.4.
Later, skip docIDs whose bound < current threshold.
By continuously tightening the threshold, WAND skips irrelevant documents efficiently.
Tiny Code (Python Example)
import heapqclass Posting:def__init__(self, doc_id, score):self.doc_id = doc_idself.score = score# Sample postings (term -> list of (doc_id, score))index = {"deep": [Posting(2, 1.2), Posting(5, 1.0), Posting(9, 0.8)],"learning": [Posting(1, 1.4), Posting(5, 1.6), Posting(9, 1.5), Posting(12, 0.7)],"vision": [Posting(5, 1.5), Posting(10, 1.3)]}def wand(query_terms, k): heap = [] # top-k min-heap pointers = {t: 0for t in query_terms} max_score = {t: max(p.score for p in index[t]) for t in query_terms} threshold =0.0whileTrue:# Get smallest docID among pointers current_docs = [(index[t][pointers[t]].doc_id, t) for t in query_terms if pointers[t] <len(index[t])]ifnot current_docs: break current_docs.sort() pivot_doc, pivot_term = current_docs[0]# Compute upper bound for this doc ub =sum(max_score[t] for t in query_terms)if ub < threshold:# Skip aheadfor t in query_terms:while pointers[t] <len(index[t]) and index[t][pointers[t]].doc_id <= pivot_doc: pointers[t] +=1continue# Compute actual score score =0for t in query_terms: postings = index[t]if pointers[t] <len(postings) and postings[pointers[t]].doc_id == pivot_doc: score += postings[pointers[t]].score pointers[t] +=1if score >0: heapq.heappush(heap, score)iflen(heap) > k: heapq.heappop(heap) threshold =min(heap)else:for t in query_terms:while pointers[t] <len(index[t]) and index[t][pointers[t]].doc_id <= pivot_doc: pointers[t] +=1returnsorted(heap, reverse=True)print(wand(["deep", "learning", "vision"], 2))
Why It Matters
Enables real-time ranked retrieval even for massive collections.
Used in Lucene, Elasticsearch, Vespa, and other production search engines.
Reduces computation by skipping low-potential documents.
Maintains exact correctness, same top-\(k\) as exhaustive evaluation.
Tradeoffs:
Requires per-term score upper bounds.
Slight implementation complexity.
Performs best when query has few high-weight terms.
A Gentle Proof (Why It Works)
Let \(S(d)\) be the true score of document \(d\), and \(U(d)\) be its computed upper bound. Let \(\theta\) be the current top-\(k\) threshold (the minimum score among top-\(k\) results so far).
For any document \(d\): \[
U(d) < \theta \implies S(d) < \theta
\] Therefore, \(d\) cannot enter the top-\(k\), so skipping it is safe. This invariant ensures exact top-\(k\) correctness.
By monotonically tightening \(\theta\), WAND prunes an increasing number of documents efficiently.
Try It Yourself
Implement WAND on a small BM25 index.
Visualize how many documents are skipped as \(k\) decreases.
Compare runtime with brute-force scoring.
Extend to Block-Max WAND (BMW), using block-level score bounds.
Add early termination when threshold stabilizes.
Test Cases
Query
k
Docs
Skipped
Matches
deep learning
2
10,000
80%
same as brute force
data systems
5
5,000
60%
same
ai
10
1,000
minimal
same
long query (10 terms)
3
50,000
90%
same
Complexity
Stage
Time
Space
Build index
\(O\!\left(\sum_d d\right)\)
\(O(V)\)
Query scoring
\(O(k \log k + \text{\#evaluated docs})\)
\(O(V)\)
Pruning effect
60–95% fewer evaluations
negligible overhead
The WAND algorithm is the art of knowing what not to compute. It’s the whisper of efficiency in large-scale search, scoring only the few that matter, and skipping the rest without missing a single answer.
877 Block-Max WAND (BMW)
Block-Max WAND (BMW) is an advanced refinement of the WAND algorithm that uses block-level score bounds to skip large chunks of posting lists at once. It’s one of the key optimizations behind high-performance search engines such as Lucene’s “BlockMax” scorer and Bing’s “Fat-WAND” system.
What Problem Are We Solving?
Even though WAND skips documents whose upper bound is too low, it still needs to visit every posting entry to check individual document IDs. That’s too slow when posting lists contain millions of entries.
BMW groups postings into fixed-size blocks (like 64 or 128 docIDs per block) and stores the maximum possible score per block. This allows the algorithm to skip entire blocks instead of individual docs when it knows none of them can reach the current threshold.
How Does It Work (Plain Language)
Each term’s posting list is divided into blocks.
For each block, store:
the maximum term score in that block, and
the max docID (the last document in that block).
When evaluating a query:
Use the block max scores to compute an upper bound for the next possible block combination.
If this upper bound is below the current threshold → skip entire blocks.
Otherwise, descend into the block and evaluate individual documents using standard scoring (like BM25).
This drastically reduces the number of document accesses while maintaining exact top-\(k\) correctness.
Example
Suppose we have term “vision” with postings:
Block
DocIDs
Scores
Block Max
1
[1, 3, 5, 7]
[0.4, 0.6, 0.5, 0.3]
0.6
2
[10, 11, 13, 15]
[0.9, 0.8, 0.7, 0.5]
0.9
If current threshold = 1.5 and other terms’ max blocks sum to 1.2, then Block 1 (0.6 + 1.2 < 1.5) can be skipped entirely, and we jump directly to Block 2.
Tiny Code (Python Example)
import heapq# Simplified postings: term -> [(doc_id, score)], grouped into blocksindex = {"deep": [[(1, 0.3), (2, 0.7), (3, 0.4)], [(10, 1.0), (11, 0.8)]],"vision": [[(2, 0.4), (3, 0.6), (4, 0.5)], [(10, 1.2), (11, 1.1)]]}block_max = {"deep": [0.7, 1.0],"vision": [0.6, 1.2]}def bm25_bmw(query_terms, k): heap, threshold = [], 0.0 pointers = {t: [0, 0] for t in query_terms} # [block_index, posting_index]whileTrue: current_docs = []for t in query_terms: b, p = pointers[t]if b >=len(index[t]): continueif p >=len(index[t][b]): pointers[t][0] +=1 pointers[t][1] =0continue current_docs.append((index[t][b][p][0], t))ifnot current_docs: break current_docs.sort() doc, pivot_term = current_docs[0]# Compute block-level upper bound ub =sum(block_max[t][pointers[t][0]] for t in query_terms if pointers[t][0] <len(block_max[t]))if ub < threshold:# Skip entire blocksfor t in query_terms: b, _ = pointers[t]while b <len(index[t]) and block_max[t][b] + ub < threshold: pointers[t][0] +=1 pointers[t][1] =0 b +=1continue# Compute actual score score =0.0for t in query_terms: b, p = pointers[t]if b <len(index[t]) and p <len(index[t][b]) and index[t][b][p][0] == doc: score += index[t][b][p][1] pointers[t][1] +=1if score >0: heapq.heappush(heap, score)iflen(heap) > k: heapq.heappop(heap) threshold =min(heap)else:for t in query_terms: b, p = pointers[t]while b <len(index[t]) and p <len(index[t][b]) and index[t][b][p][0] <= doc: pointers[t][1] +=1returnsorted(heap, reverse=True)print(bm25_bmw(["deep", "vision"], 2))
Why It Matters
Block skipping reduces random memory access dramatically.
Enables near-real-time search in billion-document collections.
Integrates naturally with compressed posting lists.
Used in production systems like Lucene’s BlockMaxScore, Anserini, and Elasticsearch.
Tradeoffs:
Requires storing per-block maxima (slightly more index size).
Performance depends on block size and term distribution.
More complex to implement correctly.
A Gentle Proof (Why It Works)
Let \(B_t(i)\) be the \(i\)-th block for term \(t\), and let \(U_t(i)\) be its maximum score. The upper bound for a combination of blocks is:
\[
U_B = \sum_{t \in Q} U_t(b_t)
\]
If \(U_B < \theta\) (the current top-\(k\) threshold), then no document in those blocks can exceed \(\theta\), so it’s safe to skip them entirely.
Because \(U_t(i)\) ≥ any document’s actual score within \(B_t(i)\), this bound preserves exactness while maximizing skipping efficiency.
Compare WAND vs BMW on a large dataset, count total doc evaluations.
Visualize score threshold growth over time.
Extend to “MaxScore” or “Exhaustive BMW” hybrid for early termination.
Test Cases
Dataset
#Docs
Query
Algorithm
Evaluated Docs
Same Top-k
10k
“deep learning”
WAND
3,000
✅
10k
“deep learning”
BMW
900
✅
1M
“neural vision”
WAND
70,000
✅
1M
“neural vision”
BMW
12,000
✅
Complexity
Stage
Time
Space
Build index
\(O\!\left(\sum_d d\right)\)
\(O(V)\)
Query scoring
\(O(k \log k + \text{\#visited blocks})\)
\(O(V)\)
Skip gain
3×–10× fewer postings visited
small overhead
Block-Max WAND is efficiency at scale, where indexing meets geometry, and millions of postings melt into a few meaningful blocks.
878 Impact-Ordered Indexing
Impact-Ordered Indexing is a retrieval optimization that sorts postings by impact score (importance) instead of document ID. It allows the system to quickly find the most relevant documents first, without scanning all postings.
It’s a cornerstone of high-performance ranked retrieval, used in quantized BM25 systems, learning-to-rank engines, and neural hybrid search.
What Problem Are We Solving?
Traditional inverted indexes are sorted by docID, great for Boolean queries, but inefficient for ranked retrieval.
When ranking documents, we only need the top-\(k\) highest-scoring results. Scanning every posting in docID order wastes time evaluating low-impact documents.
Impact ordering solves this by precomputing a score estimate for each posting and sorting postings by that score, so the engine can focus on the most promising ones first.
How Does It Work (Plain Language)
Compute an impact score for each term–document pair, such as: \[
I(t, d) = w_{t,d} = TF(t,d) \times IDF(t)
\] (or its BM25 equivalent).
In the posting list for each term, store tuples (impact, docID) and sort them in descending order of impact.
During query evaluation:
Iterate over each term’s postings in descending impact order.
Accumulate partial scores in a heap for top-\(k\).
Stop early when the sum of remaining maximum impacts can’t beat the current threshold.
Example
Term
Postings (impact, docID)
deep
(2.3, 9), (1.5, 3), (0.8, 10)
learning
(2.8, 3), (1.9, 5), (0.7, 12)
vision
(3.1, 5), (2.5, 9), (1.2, 11)
When evaluating query “deep learning vision”, the system processes the highest-impact postings first, e.g., (3.1,5), (2.8,3), (2.5,9), and can stop once the remaining possible contributions are below the top-\(k\) threshold.
Tiny Code (Python Example)
import heapq# Example impact-ordered indexindex = {"deep": [(2.3, 9), (1.5, 3), (0.8, 10)],"learning": [(2.8, 3), (1.9, 5), (0.7, 12)],"vision": [(3.1, 5), (2.5, 9), (1.2, 11)]}def impact_ordered_search(query_terms, k): heap = [] # top-k scores = {} pointers = {t: 0for t in query_terms} threshold =0.0whileTrue:# Select next best impact posting next_term, best_doc, best_impact =None, None, 0for t in query_terms: p = pointers[t]if p <len(index[t]): impact, doc = index[t][p]if impact > best_impact: best_impact = impact best_doc = doc next_term = tifnot next_term:break# Accumulate score scores[best_doc] = scores.get(best_doc, 0) + best_impact pointers[next_term] +=1# If new doc enters top-k heapq.heappush(heap, scores[best_doc])iflen(heap) > k: heapq.heappop(heap) threshold =min(heap)# Early stop condition remaining_max =sum(index[t][p][0] for t, p in pointers.items() if p <len(index[t]))if remaining_max < threshold:breakreturnsorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]print(impact_ordered_search(["deep", "learning", "vision"], 3))
Output:
[(5, 5.0), (9, 4.8), (3, 4.3)]
Why It Matters
Processes high-value documents first.
Enables early termination once remaining potential can’t change the top-\(k\).
Works beautifully with compressed indexes and quantized BM25 (where scores are pre-bucketed).
Foundation for MaxScore, BMW, and learning-to-rank re-rankers.
Tradeoffs:
Loses efficient Boolean filtering (no docID order).
Memory overhead for storing and sorting postings by impact.
A Gentle Proof (Why It Works)
Let \(I(t,d)\) be the precomputed impact for term \(t\) in document \(d\). Let \(U_t\) be the maximum remaining impact for term \(t\) not yet processed.
At any point, if: \[
\sum_{t \in Q} U_t < \theta
\] (where \(\theta\) is the current top-\(k\) threshold), then no unseen document can exceed the current lowest top-\(k\) score, so evaluation can stop safely without missing any top document.
This guarantees correctness under monotonic scoring functions (like BM25 or TF–IDF).
Try It Yourself
Build a small index of 10–20 documents with TF–IDF weights.
Sort postings by descending impact and run queries using early termination.
Compare runtime vs docID-ordered BM25 scoring.
Experiment with quantized (integer bucketed) impacts for compression.
Combine with WAND or BMW for hybrid skipping.
Test Cases
Query
Algorithm
Docs Scanned
Same Top-k
Speedup
deep learning
DocID order
1000
Yes
1×
deep learning
Impact order
150
Yes
~6× faster
neural vision
Impact order + early stop
80
Yes
~10× faster
Complexity
Stage
Time
Space
Index build
\(O\!\left(\sum_d d \log d\right)\)
\(O(V)\)
Query scoring
\(O(k + \text{\#high-impact postings})\)
\(O(V)\)
Early stop
80–95% postings skipped
minor metadata overhead
Impact-Ordered Indexing is like sorting by importance before even starting the race, you reach the winners first, and stop running once you already know who they are.
879 Tiered Indexing
Tiered Indexing is a multi-layered organization of search indexes that prioritizes high-quality or high-scoring documents for early access. Instead of scanning all postings equally, it structures data into tiers so that the most promising documents are searched first, enabling faster top-\(k\) retrieval.
This approach is fundamental in web search engines, large-scale retrieval systems, and vector + keyword hybrid engines, where response time is critical and partial results must arrive quickly.
What Problem Are We Solving?
Modern search engines index billions of documents. Scanning everything for each query is impossible within latency constraints.
We need a way to search progressively, starting from the most valuable (high-priority) subset, and expanding only if needed.
Tiered indexing does exactly that by partitioning the index into tiers based on document quality, rank potential, or access frequency.
How Does It Work (Plain Language)
During indexing, documents are ranked or scored by a quality metric (e.g., PageRank, authority, click-through rate, popularity).
The index is split into tiers:
Tier 0 → high-quality, high-impact documents.
Tier 1, Tier 2, … → progressively lower priority.
At query time:
Start searching from Tier 0 using normal ranked retrieval.
If the top-\(k\) results are not yet stable (score gap small), move to Tier 1, and so on.
Stop once the top-\(k\) results are confident (remaining tiers cannot improve them).
This results in fast approximate answers that are often exact for most queries.
Example
Suppose you have 1 million documents ranked by quality. Divide them into tiers of 100k documents each.
Tier
Quality
Example
0
top 10%
Wikipedia, official pages
1
next 30%
trusted blogs, academic pages
2
next 60%
general web documents
When processing a query “deep learning”, Tier 0 might already contain most of the top-ranked results. The system can skip Tier 1–2 unless needed for recall.
Here retrieve_from_index is any retrieval method (BM25, impact-ordered, WAND, etc.) applied within that tier. Each tier is a self-contained inverted index.
Why It Matters
Faster response: often the top results come from Tier 0 or Tier 1.
Better caching: higher tiers fit in memory or SSD; lower tiers can stay on disk.
Incremental refresh: new or high-traffic documents can be inserted into higher tiers dynamically.
Scalable hybrid search: vector indexes can mirror the same structure for approximate-nearest-neighbor retrieval.
Tradeoffs:
Requires maintaining multiple indexes.
Risk of missing relevant but low-tier documents if thresholds are too strict.
Merging results across tiers adds coordination overhead.
A Gentle Proof (Why It Works)
Let \(S_i(d)\) denote the score of document \(d\) in tier \(i\), and let \(U_i\) be the maximum possible score of any document in that tier.
If \(\theta\) is the current top-\(k\) threshold after searching tiers \(0 \dots i\), and if: \[
\max_{j>i} U_j < \theta
\] then no unseen document in lower tiers can enter the top-\(k\). This guarantees correctness under monotonic scoring.
Because high-quality documents concentrate in earlier tiers, this condition is reached early for most queries.
Try It Yourself
Split your dataset by document authority or click rate into 3 tiers.
Index each tier separately with BM25.
Run queries in order: Tier 0 → Tier 1 → Tier 2.
Measure how often Tier 0 already yields full top-\(k\).
Combine with WAND or Impact-Ordered Indexing for deeper efficiency.
Optionally, use ANN vector tiers (for embeddings) alongside keyword tiers.
Test Cases
Dataset
Query
Tiers
% Queries Resolved in Tier 0
Accuracy vs Full Search
Web 1M
“AI”
3
94%
99.8%
Web 10M
“data systems”
3
90%
99.5%
News 1M
“latest election”
2
88%
99.9%
Complexity
Stage
Time
Space
Index build
\(O(N)\) per tier
\(O(N)\)
Query evaluation
\(O(k + \text{\#tiers visited})\)
\(O(V)\)
Early termination
often 1–2 tiers
minor metadata overhead
Tiered Indexing is how large-scale systems stay responsive, searching the summit first, and descending deeper only when they must.
880 DAAT vs SAAT Evaluation
DAAT (Document-at-a-Time) and SAAT (Score-at-a-Time) are two contrasting strategies for evaluating ranked retrieval queries. They define how posting lists are traversed and combined when computing document scores, a core decision in every search engine’s architecture.
Both reach the same results, but they make different tradeoffs between CPU efficiency, memory access, and early termination ability.
What Problem Are We Solving?
When multiple query terms appear in different posting lists, the engine must decide in what order to access them and how to merge their scores efficiently.
Should we gather all contributions for one document before moving on to the next? (DAAT)
Or should we process one posting at a time from all terms, gradually refining scores? (SAAT)
This distinction affects query latency, caching, parallelism, and even how compression and skipping work.
How It Works
DAAT (Document-at-a-Time)
Merge posting lists by document ID order.
For each document that appears in any posting list, compute its total score from all matching terms.
Once a document’s full score is known, it’s inserted into the top-\(k\) heap.
This is the approach used in WAND, BMW, and Lucene.
SAAT (Score-at-a-Time)
Process one posting (term–doc pair) at a time, sorted by impact or score.
Incrementally update partial document scores as high-impact postings arrive.
Stop when it’s provably impossible for any unseen posting to affect the top-\(k\).
This is the model used in Impact-Ordered Indexing, MaxScore, and Quantized BM25 systems.
Example
Suppose we have query terms “deep” and “learning”.
Term
Postings (docID, score)
deep
(1, 0.6), (2, 0.9), (4, 0.5)
learning
(2, 0.8), (3, 0.7), (4, 0.6)
DAAT traversal:
Step
docID
deep
learning
Total
1
1
0.6
0
0.6
2
2
0.9
0.8
1.7
3
3
0
0.7
0.7
4
4
0.5
0.6
1.1
SAAT traversal (sorted by score): Process postings by descending term–doc score: (2,0.9), (2,0.8), (4,0.6), (1,0.6), (3,0.7), (4,0.5)…
As partial scores accumulate, we can stop once remaining postings can’t affect the top-\(k\).
Tiny Code (Python Example)
import heapqdef daat(query_lists, k):# Merge all docIDs in sorted order all_docs =sorted(set().union(*[dict(pl).keys() for pl in query_lists])) scores = {}for doc in all_docs: total =sum(dict(pl).get(doc, 0) for pl in query_lists) scores[doc] = totalreturnsorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]def saat(query_lists, k):# Process postings by descending score postings = [(score, doc) for pl in query_lists for doc, score in pl] postings.sort(reverse=True) scores, heap, threshold = {}, [], 0.0for score, doc in postings: scores[doc] = scores.get(doc, 0) + score heapq.heappush(heap, (scores[doc], doc))iflen(heap) > k: heapq.heappop(heap) threshold = heap[0][0] remaining_max = score # Simplified upper boundif remaining_max < threshold:breakreturnsorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]deep = [(1, 0.6), (2, 0.9), (4, 0.5)]learning = [(2, 0.8), (3, 0.7), (4, 0.6)]print("DAAT:", daat([deep, learning], 2))print("SAAT:", saat([deep, learning], 2))
Why It Matters
Feature
DAAT
SAAT
Traversal
docID order
score order
Early termination
via WAND/BMW
via MaxScore
Skipping ability
strong (sorted docIDs)
weaker
Cache locality
good
random
Parallelism
limited
high (posting-level)
Compression friendliness
high
lower
Typical usage
Lucene, Elasticsearch
Terrier, Anserini, learned retrieval
Both reach the same final ranking, but DAAT is better for CPU cache efficiency, while SAAT shines in GPU or vectorized environments.
A Gentle Proof (Why Both Are Correct)
Let \(S(d)\) be the total score of document \(d\) for query \(Q\). Each algorithm computes:
\[
S(d) = \sum_{t \in Q} w_{t,d}
\]
Both DAAT and SAAT fully enumerate all \((t, d)\) pairs, only in different orders. Because addition is commutative and associative, the final score set \({S(d)}\) is identical.
Early termination preserves correctness under monotonic scoring when upper bounds are respected: \[
\sum_{t \in Q} U_t < \theta \implies \text{safe to stop}
\]
Thus, both are semantically equivalent, differing only in traversal order.
Try It Yourself
Implement DAAT and SAAT for TF–IDF on the same index.
Measure time per query and number of postings visited.
Add early stopping thresholds to both.
Try hybrid evaluation: DAAT for short queries, SAAT for long ones.
Visualize score accumulation curves to see how early termination differs.
Test Cases
Query
Algorithm
Docs Scanned
Same Top-k
Speedup
“deep learning”
DAAT
1,200
Yes
1×
“deep learning”
SAAT
250
Yes
~5×
“neural vision”
DAAT
3,000
Yes
1×
“neural vision”
SAAT
800
Yes
~4×
Complexity
Stage
DAAT
SAAT
Merge cost
\(O\!\left(\sum P_t\right)\)
\(O\!\left(\sum P_t \log P_t\right)\)
Early stop
via bound check
via score bound
Space
\(O(V)\)
\(O(V)\)
DAAT vs SAAT captures the philosophy of retrieval engineering, DAAT merges by documents, SAAT by importance, both converging to the same truth from different directions.
Section 89. Compression and Encoding in System
881 Run-Length Encoding (RLE)
Run-Length Encoding (RLE) is a simple lossless compression method that replaces consecutive repetitions of the same symbol with a single symbol and a count. It is extremely effective on data with long runs, such as binary images, simple graphics, and sensor streams with repeated values.
What Problem Are We Solving?
Uncompressed data often contains repeated symbols:
Images with long horizontal runs of the same pixel value
Logs or telemetry with repeated flags
Bitmaps and masks with large zero regions
RLE encodes these runs compactly, reducing storage and transmission cost when repetitions dominate.
How Does It Work (Plain Language)
Scan the sequence from left to right and group identical adjacent symbols into runs. Each run of symbol \(x\) with length \(k\) is written as a pair \((x, k)\).
Example:
Input: AAAABBCCDAA
Runs: A×4, B×2, C×2, D×1, A×2
RLE: (A,4)(B,2)(C,2)(D,1)(A,2)
Two common layouts:
Symbol–count pairs: store both symbol and length.
Count–symbol pairs: store length first for byte-aligned formats.
To decode, expand each pair by repeating the symbol \(k\) times.
Tiny Code (Python)
def rle_encode(data: str):ifnot data:return [] out = [] curr, count = data[0], 1for ch in data[1:]:if ch == curr and count <255: # cap to one byte for demo count +=1else: out.append((curr, count)) curr, count = ch, 1 out.append((curr, count))return outdef rle_decode(pairs):return"".join(ch * cnt for ch, cnt in pairs)s ="AAAABBCCDAA"pairs = rle_encode(s)print(pairs) # [('A', 4), ('B', 2), ('C', 2), ('D', 1), ('A', 2)]print(rle_decode(pairs)) # AAAABBCCDAA
Tiny Code (C)
#include <stdio.h>#include <string.h>void rle_encode(constchar*s){int n = strlen(s);for(int i =0; i < n;){char c = s[i];int j = i +1;while(j < n && s[j]== c && j - i <255) j++; printf("(%c,%d)", c, j - i); i = j;} printf("\n");}int main(){constchar*s ="AAAABBCCDAA"; rle_encode(s);return0;}
Why It Matters
Simplicity: tiny encoder and decoder, easy to implement in constrained systems
Speed: linear time, minimal CPU and memory
Interoperability: serves as a building block inside larger formats
Fax Group 3/4 bitmaps
TIFF PackBits
Some sprite and tile map formats
Hardware framebuffers and FPGA streams
Tradeoffs:
Poor compression on high-entropy or highly alternating data
Worst case can expand size (e.g., ABABAB...)
Needs run caps and escape rules to handle edge cases
A Gentle Proof (Why It Works)
Let the input be a sequence \(S = s_1 s_2 \dots s_n\). Partition \(S\) into maximal runs \(R_1, R_2, \dots, R_m\) where each run \(R_i\) consists of a symbol \(x_i\) repeated \(k_i\) times and \(k_i \ge 1\). These runs are unique because each boundary occurs exactly where \(s_j \ne s_{j+1}\).
Encoding maps \(S\) to the sequence of pairs: \[
E(S) = \big((x_1, k_1), (x_2, k_2), \dots, (x_m, k_m)\big)
\]
Decoding is the inverse map: \[
D(E(S)) = \underbrace{x_1 \dots x_1}*{k_1} \underbrace{x_2 \dots x_2}*{k_2} \dots \underbrace{x_m \dots x_m}_{k_m} = S
\]
Thus \(D \circ E\) is the identity on the set of sequences, which proves lossless correctness.
Compression ratio depends on the number and lengths of runs. If average run length is \(\bar{k}\) and the per-pair overhead is constant, then the compressed length scales as \(O(n / \bar{k})\).
Try It Yourself
Encode horizontal scanlines of a black and white image and compare size to raw.
Add a one-byte cap for counts and introduce an escape sequence to represent literal segments that are not compressible.
Measure worst-case expansion on random data.
Combine with delta encoding of pixel differences before RLE.
Compare RLE after sorting runs vs before on simple structured data.
Test Cases
Input
Encoded Pairs
Notes
AAAAA
(A,5)
single long run
ABABAB
(A,1)(B,1)(A,1)(B,1)(A,1)(B,1)
expansion risk
0000011100
(0,5)(1,3)(0,2)
typical binary mask
`(empty) |[]| edge case | |A|(A,1)`
minimal
Variants and Extensions
Run-length of bits: pack runs of 0s and 1s for bitmaps
PackBits: count byte with sign for literal vs run segments
RLE + entropy coding: RLE first, then Huffman on counts and symbols
RLE on deltas: compress repeated differences rather than raw values
Complexity
Time: \(O(n)\) encode and decode
Space: \(O(1)\) streaming, aside from output buffer
Compression: best when average run length \(\bar{k} \gg 1\)
Worst case size: can exceed input by a constant factor due to pair overhead
When to use RLE: data with long homogeneous stretches, predictable symbols, or structured masks. When to avoid: noisy text, shuffled bytes, or randomized streams where runs are short.
882 Huffman Coding
Huffman Coding is a cornerstone of lossless data compression. It assigns shorter bit patterns to frequent symbols and longer bit patterns to rare symbols, achieving near-optimal compression when symbol frequencies are known. It’s the beating heart of ZIP, PNG, JPEG, and countless codecs.
What Problem Are We Solving?
Fixed-length encoding (like ASCII) wastes bits. If ‘E’ occurs 13% of the time and ‘Z’ only 0.1%, it makes no sense to use the same 8 bits for both.
We want to minimize total bit length while keeping the encoding uniquely decodable.
How Does It Work (Plain Language)
Huffman coding builds a binary tree of symbols weighted by frequency:
Count frequencies for all symbols.
Put each symbol into a priority queue (min-heap) keyed by frequency.
Repeatedly remove the two least frequent nodes and merge them into a new parent node.
Continue until only one node remains, the root of the tree.
Assign 0 for one branch and 1 for the other. Each symbol’s bit code is its path from root to leaf.
Frequent symbols end up closer to the root → shorter codes.
Example
Symbol
Frequency
Code
A
5
10
B
2
110
C
1
111
D
1
00
E
1
01
Average bits per symbol ≈ weighted by frequency, much less than fixed 3-bit or 8-bit encodings.
Tiny Code (Python)
import heapqdef huffman_code(freqs): heap = [[w, [sym, ""]] for sym, w in freqs.items()] heapq.heapify(heap)whilelen(heap) >1: lo = heapq.heappop(heap) hi = heapq.heappop(heap)for pair in lo[1:]: pair[1] ="0"+ pair[1]for pair in hi[1:]: pair[1] ="1"+ pair[1] heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])returnsorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))freqs = {"A": 5, "B": 2, "C": 1, "D": 1, "E": 1}for sym, code in huffman_code(freqs):print(sym, code)
Near-optimal entropy compression for independent symbols
Fast: single pass for encoding and decoding using lookup tables
Universal: foundation of DEFLATE, GZIP, PNG, JPEG, MP3, PDF text streams
Tradeoffs:
Requires prior knowledge of symbol frequencies
Inefficient for small alphabets or changing distributions
Generates prefix codes, decoding must follow tree paths
A Gentle Proof (Why It Works)
Let each symbol \(i\) have frequency \(p_i\) and code length \(l_i\). Expected code length:
\[
L = \sum_i p_i l_i
\]
Huffman proves that for any prefix-free code,
\[
L_{\text{Huffman}} \le L_{\text{any other}}
\]
It’s optimal for any given distribution where symbol probabilities are fixed and independent.
Using Kraft’s inequality,
\[
\sum_i 2^{-l_i} \le 1
\]
Huffman coding finds the shortest integer lengths satisfying this constraint.
Try It Yourself
Encode the string "MISSISSIPPI" with Huffman coding.
Compare total bits vs fixed 8-bit ASCII.
Add adaptive Huffman coding (rebuild tree dynamically).
Apply to grayscale image data or sensor readings.
Explore canonical Huffman codes (used in DEFLATE).
Test Cases
Input
Raw Bits (8-bit)
Huffman Bits
Compression
AAAAABBC
64
20
68.8%
HELLO
40
23
42.5%
MISSISSIPPI
88
47
46.6%
Complexity
Stage
Time
Space
Tree build
\(O(n \log n)\)
\(O(n)\)
Encode
\(O(N)\)
\(O(n)\)
Decode
\(O(N)\)
\(O(n)\)
Huffman Coding captures a simple truth, give more to the common, less to the rare, and in doing so, it built the backbone of modern compression.
883 Arithmetic Coding
Arithmetic Coding is a powerful lossless compression algorithm that represents an entire message as a single fractional number between 0 and 1. Unlike Huffman coding, which assigns discrete bit patterns per symbol, arithmetic coding encodes the whole sequence into a progressively refined interval, achieving near-entropy efficiency.
What Problem Are We Solving?
When symbol probabilities are not powers of ½, Huffman codes can’t perfectly match entropy limits. Arithmetic coding solves this by assigning a fractional number of bits per symbol, making it more efficient for skewed or adaptive distributions.
How Does It Work (Plain Language)
The algorithm treats the entire message as a single real number in the interval \([0,1)\), subdivided according to symbol probabilities. At each step, it narrows the interval to the subrange corresponding to the next symbol.
Let’s say symbols A, B, C have probabilities 0.5, 0.3, 0.2.
Symbol
Range
A
[0.0, 0.5)
B
[0.5, 0.8)
C
[0.8, 1.0)
Encode “BA”:
Start with \([0, 1)\)
Symbol B → \([0.5, 0.8)\)
Symbol A → subdivide \([0.5, 0.8)\) using same probabilities:
New A range = \([0.5, 0.65)\)
New B range = \([0.65, 0.74)\)
New C range = \([0.74, 0.8)\)
Final interval = \([0.5, 0.65)\)
Any number inside represents “BA”, e.g. 0.6.
Decoding reverses the process by repeatedly identifying which subinterval the number falls into.
Tiny Code (Python)
def arithmetic_encode(symbols, probs): low, high =0.0, 1.0for s in symbols: span = high - lowfor sym, (p_low, p_high) in probs.items():if sym == s: high = low + span * p_high low = low + span * p_lowbreakreturn (low + high) /2def arithmetic_decode(code, n, probs): result = []for _ inrange(n):for sym, (p_low, p_high) in probs.items():if p_low <= code < p_high: result.append(sym) code = (code - p_low) / (p_high - p_low)breakreturn"".join(result)probs = {"A": (0.0, 0.5), "B": (0.5, 0.8), "C": (0.8, 1.0)}encoded = arithmetic_encode("BA", probs)print("Encoded value:", encoded)print("Decoded:", arithmetic_decode(encoded, 2, probs))
Why It Matters
Closer to entropy limit: can achieve fractional-bit compression.
Adaptable: works with dynamically updated probabilities.
Used in: JPEG2000, H.264, AV1, Bzip2, and modern AI model quantization.
Tradeoffs:
Arithmetic precision must be controlled (floating-point drift).
Implementation complexity is higher than Huffman’s.
Needs renormalization and bitstream encoding for real-world use.
A Gentle Proof (Why It Works)
If a message \(S = s_1 s_2 \dots s_n\) is encoded to the interval \([L, H)\), then
\[
H - L = \prod_{i=1}^{n} P(s_i)
\]
Thus, the total number of bits needed is approximately
This equals the Shannon information content, achieving optimal entropy coding under ideal arithmetic precision.
Try It Yourself
Encode “ABBA” with probabilities A:0.6, B:0.4.
Visualize shrinking intervals per symbol.
Add renormalization to emit bits progressively (range coder).
Compare compression ratio vs Huffman on skewed text.
Implement adaptive probabilities (update after each symbol).
Test Cases
Input
Probabilities
Bits
Compression
“AAAA”
A:1.0
~0 bits
Perfect
“ABAB”
A:0.9, B:0.1
~1.37 bits/symbol
Better than Huffman
“ABC”
A:0.5, B:0.3, C:0.2
~1.49 bits/symbol
Near entropy
Complexity
Stage
Time
Space
Encoding
\(O(n)\)
\(O(1)\)
Decoding
\(O(n)\)
\(O(1)\)
Precision
depends on bit width
Arithmetic coding captures the continuity of information, it doesn’t think in symbols, but in probability space. Every message becomes a unique slice of the unit interval.
884 Delta Encoding
Delta Encoding compresses data by storing the difference between successive values rather than the values themselves. It is simple, fast, and incredibly effective for datasets with strong local correlation, such as timestamps, counters, audio samples, or sensor readings.
What Problem Are We Solving?
Many real-world sequences change slowly. For example: [1000, 1001, 1002, 1005, 1006]
Instead of storing every full number, we can store only the differences: [1000, +1, +1, +3, +1]
This reduces magnitude, variance, and entropy, making the sequence easier for secondary compression (like Huffman or arithmetic coding).
How Does It Work (Plain Language)
Take a sequence of numeric values: \(x_1, x_2, x_3, \dots, x_n\)
Choose a reference (usually the first value).
Store the sequence of deltas: \(d_i = x_i - x_{i-1}\) for \(i > 1\)
To reconstruct, perform a cumulative sum: \(x_i = x_{i-1} + d_i\)
If the sequence changes gradually, most \(d_i\) will be small and compressible with fewer bits.
Example
Original
Delta
Reconstructed
1000
,
1000
1001
+1
1001
1002
+1
1002
1005
+3
1005
1006
+1
1006
The deltas use fewer bits than full integers and can be encoded with variable-length or entropy coding afterward.
#include <stdio.h>void delta_encode(int*data,int*out,int n){if(n ==0)return; out[0]= data[0];for(int i =1; i < n; i++) out[i]= data[i]- data[i -1];}void delta_decode(int*deltas,int*out,int n){if(n ==0)return; out[0]= deltas[0];for(int i =1; i < n; i++) out[i]= out[i -1]+ deltas[i];}
Why It Matters
Compression synergy: smaller deltas → better Huffman or arithmetic compression.
Hardware efficiency: used in video (frame deltas), telemetry, time-series DBs, audio coding, and binary diffs.
Streaming-friendly: supports incremental updates with minimal state.
Tradeoffs:
Sensitive to noise (large random jumps break efficiency).
Needs base frame or checkpoint for random access.
Must handle signed deltas safely.
A Gentle Proof (Why It Works)
Let the input sequence be \(x_1, x_2, \dots, x_n\) and output deltas \(d_1, \dots, d_n\) defined by
\[
d_1 = x_1, \quad d_i = x_i - x_{i-1} \text{ for } i > 1
\]
Decoding is the inverse mapping:
\[
x_i = d_1 + \sum_{j=2}^{i} d_j
\]
Since addition and subtraction are invertible over integers, the transform is lossless. The entropy of \({d_i}\) is lower when \(x_i\) are locally correlated, allowing secondary encoders to exploit the reduced variance.
Try It Yourself
Encode temperature readings or stock prices, plot value vs delta distribution.
Combine with variable-byte encoding for integer streams.
Add periodic “reset” values for random access.
Use second-order deltas (\(d_i = d_i - d_{i-1}\)) for smoother signals.
Compare compression with and without delta preprocessing using gzip or zstd.
Test Cases
Input
Delta Encoded
Reconstructed
[5, 7, 9, 12]
[5, 2, 2, 3]
[5, 7, 9, 12]
[10, 10, 10, 10]
[10, 0, 0, 0]
[10, 10, 10, 10]
[1, 5, 2]
[1, 4, -3]
[1, 5, 2]
Complexity
Operation
Time
Space
Encode
\(O(n)\)
\(O(1)\)
Decode
\(O(n)\)
\(O(1)\)
Delta encoding is the essence of compression through change awareness, it ignores what stays the same and focuses only on how things move.
885 Variable Byte Encoding
Variable Byte (VB) Encoding is a simple and widely used integer compression technique that stores small numbers in fewer bytes and large numbers in more bytes. It’s especially popular in search engines and inverted indexes where posting lists contain large numbers of document IDs or gaps.
What Problem Are We Solving?
If you store every integer using a fixed 4 bytes, small numbers waste space. But most data (like docID gaps) are small integers. VB encoding uses as few bytes as necessary to store each number.
It’s fast, byte-aligned, and easy to decode, perfect for systems like Lucene, Zettair, or SQLite.
How Does It Work (Plain Language)
Each integer is split into 7-bit chunks (base 128 representation). The highest bit in each byte is a continuation flag:
1 means more bytes follow.
0 means this is the last byte of the number.
This makes decoding trivial: keep reading until a byte with a leading 0 is found.
Example: encode 300
Binary: 100101100
Split into 7-bit chunks: [0010110] [00101100]
Encode:
Lower 7 bits: 00101100 (0x2C)
Upper bits: 0000010 (0x02)
Set continuation on first: 100101100 → bytes [0x82, 0x2C]
Example Table
Integer
Binary
Encoded Bytes (binary)
1
00000001
00000001
128
10000000
10000001 00000000
300
100101100
10000010 00101100
16384
010000000000000
10000001 10000000 00000000
Tiny Code (Python)
def vb_encode_number(n): bytes_out = []whileTrue: bytes_out.insert(0, n %128)if n <128:break n //=128 bytes_out[-1] +=128# set stop bit (MSB=1)return bytes_outdef vb_encode_list(numbers): result = []for n in numbers: result.extend(vb_encode_number(n))returnbytes(result)def vb_decode(data): n, out =0, []for b in data:if b <128: n =128* n + belse: n =128* n + (b -128) out.append(n) n =0return outnums = [824, 5, 300]encoded = vb_encode_list(nums)print(list(encoded))print(vb_decode(encoded))
Tiny Code (C)
#include <stdio.h>#include <stdint.h>int vb_encode(uint32_t n,uint8_t*out){uint8_t buf[5];int i =0;do{ buf[i++]= n %128; n /=128;}while(n >0); buf[0]+=128;// set continuation bitfor(int j = i -1; j >=0; j--) out[i -1- j]= buf[j];return i;}uint32_t vb_decode(constuint8_t*in,int*pos){uint32_t n =0;uint8_t b;do{ b = in[(*pos)++];if(b >=128) n =128* n +(b -128);else n =128* n + b;}while(b <128);return n;}
Why It Matters
Used in IR systems: to compress posting lists (docID gaps, term frequencies).
Compact + fast: byte-aligned, simple bitwise ops, good for CPU caching.
Perfect for delta-encoded integers (follows Algorithm 884).
Tradeoffs:
Less dense than bit-packing or Frame-of-Reference compression.
Not SIMD-friendly (irregular length).
Needs extra byte per 7 bits for long numbers.
A Gentle Proof (Why It Works)
Let each number \(x\) be decomposed in base \(128\):
Each chunk \(a_i\) fits in 7 bits \((0 \le a_i < 128)\). The encoding appends bytes in reverse order with the high bit of the last byte set to 1. Decoding reverses this process by multiplying partial sums by 128 until a terminating byte appears.
The bijection between \(\mathbb{N}\) and valid byte sequences guarantees correctness and prefix-freeness.
Try It Yourself
Combine with delta encoding for document IDs: store gap[i] = doc[i] - doc[i-1].
Benchmark encoding and decoding throughput vs plain integers.
Visualize how numbers of different sizes consume variable bytes.
Add a SIMD-like unrolled decoder for fixed batch size.
Apply to time-series or integer streams from logs.
Test Cases
Input
Encoded Bytes
Decoded
[1]
[129]
[1]
[128]
[1, 128]
[128]
[300]
[2, 172]
[300]
[824, 5, 300]
[6, 136, 5, 2, 172]
[824, 5, 300]
Complexity
Operation
Time
Space
Encode
\(O(n)\)
\(O(n)\)
Decode
\(O(n)\)
\(O(1)\)
Compression ratio
depends on data magnitude
Variable Byte Encoding is the compression workhorse of information retrieval, small, fast, byte-aligned, and perfectly tuned for integers that love to shrink.
886 Elias Gamma Coding
Elias Gamma Coding is a universal code for positive integers that encodes numbers using a variable number of bits without requiring a known upper bound. It is elegant, prefix-free, and forms the foundation for other universal codes such as Delta and Rice coding.
What Problem Are We Solving?
When compressing integer sequences of unknown range, fixed-length codes are wasteful. We want a self-delimiting, bit-level code that efficiently represents both small and large numbers without a predefined limit.
Elias Gamma achieves this using logarithmic bit-length encoding, making it ideal for systems that store gaps, counts, or ranks in compressed indexes.
How Does It Work (Plain Language)
To encode a positive integer \(x\):
Write \(x\) in binary. Let \(b = \lfloor \log_2 x \rfloor\).
def elias_gamma_encode(n):if n <=0:raiseValueError("Elias gamma only encodes positive integers") binary =bin(n)[2:] prefix ='0'* (len(binary) -1)return prefix + binarydef elias_gamma_decode(code): i =0while i <len(code) and code[i] =='0': i +=1 length = i +1 value =int(code[i:i+length], 2)return valuenums = [1, 2, 3, 4, 5, 10]codes = [elias_gamma_encode(n) for n in nums]print(list(zip(nums, codes)))
Tiny Code (C)
#include <stdio.h>#include <math.h>void elias_gamma_encode(unsignedint x){int b =(int)floor(log2(x));for(int i =0; i < b; i++) printf("0");for(int i = b; i >=0; i--) printf("%d",(x >> i)&1);}int main(){for(int i =1; i <=10; i++){ printf("%2d: ", i); elias_gamma_encode(i); printf("\n");}return0;}
Why It Matters
Universal: works for any positive integer without fixed range.
Prefix-free: every codeword can be parsed unambiguously.
Compact for small numbers: cost grows as \(2 \lfloor \log_2 n \rfloor + 1\) bits.
Used in:
Inverted indexes for docID gaps
Graph adjacency lists
Compact dictionaries and rank/select structures
Tradeoffs:
Bit-level (not byte-aligned) → slower to decode than variable-byte.
Not ideal for large integers (code length grows logarithmically).
A Gentle Proof (Why It Works)
Each number \(n\) is encoded using: \[
\text{length} = 2\lfloor \log_2 n \rfloor + 1
\]
Since each binary code has a unique unary prefix length, the code satisfies Kraft’s inequality: \[
\sum_{n=1}^\infty 2^{-\text{length}(n)} \le 1
\]
Thus, the code is prefix-free and decodable. The redundancy (extra bits over \(-\log_2 n\)) is bounded by 1 bit per symbol.
Try It Yourself
Encode the sequence [1, 3, 7, 15] and count total bits.
Compare compression vs variable-byte encoding for small gaps.
Implement Elias Delta coding (adds Gamma on length).
Visualize prefix length growth vs number magnitude.
Measure speed for bitstream decoding on random data.
Test Cases
Number
Gamma Code
Bits
1
1
1
2
010
3
3
011
3
4
00100
5
5
00101
5
10
0001010
7
Complexity
Operation
Time
Space
Encode
\(O(\log n)\)
\(O(1)\)
Decode
\(O(\log n)\)
\(O(1)\)
Elias Gamma Coding is a model of elegant simplicity, a self-delimiting whisper of bits that expands just enough to hold a number’s size and no more.
887 Rice Coding
Rice Coding (or Golomb–Rice Coding) is a practical and efficient method for compressing non-negative integers, particularly when smaller values occur much more frequently than larger ones. It’s a simplified form of Golomb coding that uses a power-of-two divisor, enabling extremely fast bit-level encoding and decoding.
What Problem Are We Solving?
When encoding counts, run lengths, or residuals (like deltas), we often have geometrically distributed data, small numbers are common, big ones rare. Rice coding exploits this skew efficiently, using one parameter \(k\) that controls how many bits are allocated to the remainder.
It’s simple, lossless, and widely used in FLAC, H.264, and LZMA for integer compression.
How Does It Work (Plain Language)
Rice coding divides an integer \(x\) into two parts:
Quotient: \(q = \lfloor x / 2^k \rfloor\)
Remainder: \(r = x \bmod 2^k\)
Then:
Encode \(q\) in unary (a series of q ones followed by a zero).
Encode \(r\) in binary using exactly \(k\) bits.
So, Rice\((x, k)\) = 111...10 + (r in k bits)
Example
Let \(k = 2\) (divisor \(= 4\)):
\(x\)
\(q = \lfloor x/4 \rfloor\)
\(r = x \bmod 4\)
Code
0
0
0
0 00 → 000
1
0
1
0 01 → 001
2
0
2
0 10 → 010
3
0
3
0 11 → 011
4
1
0
10 00 → 1000
5
1
1
10 01 → 1001
7
1
3
10 11 → 1011
8
2
0
110 00 → 11000
Unary encodes quotient; binary encodes remainder.
Tiny Code (Python)
def rice_encode(x, k): q = x >> k r = x & ((1<< k) -1) unary ="1"* q +"0" binary =format(r, f"0{k}b")return unary + binarydef rice_decode(code, k): q = code.find("0") r =int(code[q +1:q +1+ k], 2)return (q << k) + rfor x inrange(0, 9): code = rice_encode(x, 2)print(x, code)
Tiny Code (C)
#include <stdio.h>void rice_encode(unsignedint x,int k){unsignedint q = x >> k;unsignedint r = x &((1<< k)-1);for(unsignedint i =0; i < q; i++) putchar('1'); putchar('0');for(int i = k -1; i >=0; i--) putchar((r >> i)&1?'1':'0');}int main(){for(int x =0; x <9; x++){ printf("%2d: ", x); rice_encode(x,2); printf("\n");}return0;}
Why It Matters
Compact for small integers: geometric-like data compresses well.
Fast: only shifts, masks, and simple loops.
Parameter-tunable: \(k\) controls balance between quotient and remainder.
Used in:
FLAC (audio residual encoding)
FFV1 / H.264 residuals
Entropy coders in LZMA, Bzip2 variants
Tradeoffs:
Requires tuning of \(k\) for optimal efficiency.
Not ideal for uniform or large outlier-heavy data.
Works best when \(E[x] \approx 2^k\).
A Gentle Proof (Why It Works)
Given integer \(x \ge 0\), let \[
x = q \cdot 2^k + r, \quad 0 \le r < 2^k
\]
The codeword consists of \(q + 1 + k\) bits:
\(q + 1\) from unary encoding
\(k\) from binary remainder
Expected code length for geometric distribution \(P(x) = (1 - p)^x p\) is minimized when \[
2^k \approx \frac{-1}{\log_2(1 - p)}
\]
Thus, tuning \(k\) matches the data’s skew for near-optimal entropy coding.
Compare compression ratio vs Elias Gamma for small integers.
Test Cases
Input \(x\)
\(k\)
Code
Bits
0
2
000
3
3
2
011
3
4
2
1000
4
7
2
1011
4
8
2
11000
5
Complexity
Stage
Time
Space
Encode
\(O(1)\) per value
\(O(1)\)
Decode
\(O(1)\) per value
\(O(1)\)
Rice Coding is the perfect bridge between mathematical precision and machine efficiency, a few shifts, a few bits, and data collapses into rhythmically compact form.
888 Snappy Compression
Snappy is a fast, block-based compression algorithm designed by Google for real-time systems where speed matters more than maximum compression ratio. Unlike heavy compressors like zlib or LZMA, Snappy prioritizes throughput over ratio, achieving hundreds of MB/s in both compression and decompression.
What Problem Are We Solving?
Many modern systems, databases, stream processors, and log pipelines, generate huge volumes of data that need to be stored or transmitted quickly. Traditional compressors (like DEFLATE or bzip2) offer good compression but are too slow for these pipelines.
Snappy trades off some compression ratio for lightweight CPU cost, perfect for:
Columnar databases (Parquet, ORC)
Message queues (Kafka)
Data interchange (Avro, Arrow)
In-memory caches (RocksDB, LevelDB)
How Does It Work (Plain Language)
Snappy is based on LZ77-style compression, but optimized for speed:
Divide data into blocks (usually 32 KB).
Maintain a sliding hash table of previous byte sequences.
For each new sequence:
If a match is found in history, emit a copy command (offset + length).
Otherwise, emit a literal (raw bytes).
Repeat until the block ends.
Every output is composed of alternating literal and copy segments.
def snappy_compress(data: bytes): i =0 out = []while i <len(data):# literal section (no match detection for simplicity) literal_len =min(60, len(data) - i) out.append((literal_len <<2) |0x00) # tag out.append(data[i:i+literal_len]) i += literal_lenreturnb"".join(out ifisinstance(x, bytes) elsebytes([x]) for x in out)def snappy_decompress(encoded: bytes): i =0 out =bytearray()while i <len(encoded): tag = encoded[i] i +=1 ttype = tag &0x03if ttype ==0: length = tag >>2 out += encoded[i:i+length] i += lengthreturnbytes(out)
(Simplified, real Snappy adds matching, offsets, and variable-length headers.)
Widely adopted: used in RocksDB, Parquet, BigQuery, TensorFlow checkpoints.
Deterministic and simple: block-local, restartable anywhere.
Tradeoffs:
Compression ratio ~1.5–2.0×, less than DEFLATE or LZMA.
Not entropy-coded (no Huffman stage).
Not ideal for highly repetitive or structured text.
A Gentle Proof (Why It Works)
Snappy achieves its balance through information-theoretic locality: If most redundancy occurs within 32 KB windows, we can find repeated sequences quickly without full entropy modeling.
For a stream of bytes \(S = s_1, s_2, \dots, s_n\), Snappy emits tokens \((L_i, O_i)\) such that: \[
S = \bigoplus_i \text{(literal)}(L_i) \oplus \text{(copy)}(O_i, L_i)
\]
Since literal and copy tokens cover the stream disjointly and encode full offsets, the compression and decompression functions form an invertible mapping, ensuring losslessness.
Try It Yourself
Compress text files and measure compression ratio vs gzip.
Inspect Parquet file metadata, see compression=SNAPPY.
Implement rolling hash for match detection.
Visualize literal/copy segments for a repetitive input.
Benchmark your implementation with random vs repetitive data.
Test Cases
Input
Encoded (Hex)
Notes
hello
140x68656c6c6f
literal only
aaaaaa
smaller than raw
due to repeated pattern
large repetitive log
2–3× smaller
predictable structure
Complexity
Operation
Time
Space
Compress
\(O(n)\)
\(O(1)\)
Decompress
\(O(n)\)
\(O(1)\)
Compression Ratio
1.5–2.0× typical
Snappy is the speed daemon of compression, sacrificing only a few bits to stay perfectly in sync with your CPU’s rhythm.
889 Zstandard (Zstd)
Zstandard, or Zstd, is a modern, general-purpose compression algorithm developed by Facebook that strikes a remarkable balance between speed and compression ratio. It offers tunable compression levels, an adaptive dictionary system, and extremely fast decompression—making it ideal for data storage, streaming, and transport systems.
What Problem Are We Solving?
Legacy compressors like zlib (DEFLATE) offer decent ratios but struggle with speed; newer ones like LZ4 are fast but often too shallow in compression. Zstd fills this gap: it compresses 3–5× faster than zlib while achieving better compression ratios.
Zstd supports:
Adjustable compression levels (1–22)
Pretrained dictionaries for small data (logs, JSON, RPC payloads)
Streaming and frame-based encoding for large files
How Does It Work (Plain Language)
Zstd is built on three conceptual layers:
LZ77 Back-References It finds repeated byte sequences and replaces them with (offset, length) pairs.
FSE (Finite State Entropy) Coding It compresses literal bytes, offsets, and lengths using entropy models. FSE is a highly efficient implementation of asymmetric numeral systems (ANS), an alternative to Huffman coding.
Adaptive Dictionary and Block Mode It can learn patterns from prior samples, compressing small payloads efficiently.
import zstandard as zstddata =b"The quick brown fox jumps over the lazy dog."*10cctx = zstd.ZstdCompressor(level=5)compressed = cctx.compress(data)dctx = zstd.ZstdDecompressor()decompressed = dctx.decompress(compressed)print("Compression ratio:", len(data)/len(compressed))
Works for small objects (via dictionaries) or terabyte-scale data (streaming mode)
Adoption:
Used in zstd, tar --zstd, Facebook RocksDB, TensorFlow checkpoints, Linux kernel (initramfs), Kafka, and Git
Tradeoffs:
Higher compression levels require more memory (both compressor and decompressor).
Slightly higher implementation complexity than simple LZ-based schemes.
A Gentle Proof (Why It Works)
Zstd’s core entropy engine, Finite State Entropy, maintains a single state variable \(x\) that represents multiple probabilities simultaneously. For a stream of symbols \(s_1, s_2, \dots, s_n\) with probabilities \(P(s_i)\), the state update rule follows:
This maintains information balance equivalent to arithmetic coding but with less overhead. Because entropy coding is integrated directly with back-references, Zstd achieves compression ratios close to DEFLATE + Huffman but runs faster by using pre-normalized tables.
Try It Yourself
Compress logs with zstd -1 and zstd -9, compare sizes and speeds.
Use dictionary training with:
zstd --train *.json -o dict
Experiment with streaming APIs (ZSTD_CStream).
Compare decompression time vs gzip and LZ4.
Inspect frame headers using zstd --list --verbose file.zst.
Test Cases
Input
Level
Ratio
Speed (MB/s)
Text logs
3
2.8×
420
JSON payloads
9
4.5×
250
Binary column data
5
3.0×
350
Video frames
1
1.6×
500
Complexity
Stage
Time
Space
Compression
\(O(n)\)
\(O(2^k)\) for level \(k\)
Decompression
\(O(n)\)
\(O(1)\)
Entropy coding
\(O(n)\)
\(O(1)\)
Zstandard is a masterclass in modern compression— fast, flexible, and mathematically graceful, where entropy coding meets engineering pragmatism.
890 LZ4 Compression
LZ4 is a lightweight, lossless compression algorithm focused on speed and simplicity. It achieves extremely fast compression and decompression by using a minimalistic version of the LZ77 scheme, ideal for real-time systems, in-memory storage, and network serialization.
What Problem Are We Solving?
When applications process massive data streams, logs, metrics, caches, or columnar blocks, every CPU cycle matters. Traditional compressors like gzip or zstd offer good ratios but introduce latency. LZ4 instead delivers instant compression, fast enough to keep up with high-throughput pipelines, even on a single CPU core.
Used in:
Databases: RocksDB, Cassandra, SQLite
Filesystems: Btrfs, ZFS
Serialization frameworks: Kafka, Arrow, Protobuf
Real-time systems: telemetry, log ingestion
How Does It Work (Plain Language)
LZ4 is built around a match-copy model similar to LZ77, but optimized for simplicity.
Scan input and maintain a 64 KB sliding window.
Find repeated sequences in the recent history.
Encode data as literals (unmatched bytes) or matches (offset + length).
Each block encodes multiple segments in a compact binary format.
A block is encoded as:
[token][literals][offset][match_length]
Where:
The high 4 bits of token = literal length
The low 4 bits = match length (minus 4)
Offset = 2-byte backward distance
Example (Concept Flow)
Original data:
ABABABABABAB
First “AB” stored as literal.
Subsequent repetitions found at offset 2 → encoded as (offset=2, length=10).
Result: compact and near-instant to decompress.
Tiny Code (Python Prototype)
def lz4_compress(data): out = [] i =0while i <len(data):# emit literal block of up to 4 bytes lit_len =min(4, len(data) - i) token = lit_len <<4 out.append(token) out.extend(data[i:i+lit_len]) i += lit_lenreturnbytes(out)def lz4_decompress(data): out = [] i =0while i <len(data): token = data[i]; i +=1 lit_len = token >>4 out.extend(data[i:i+lit_len]) i += lit_lenreturnbytes(out)
(Simplified, real LZ4 includes match encoding and offsets.)
Speed: One of the fastest compressors in the world.
Compression: ~400–700 MB/s
Decompression: ~1500–2500 MB/s
Deterministic: No entropy coding, minimal branching.
Cross-platform: Implementations in C, Rust, Go, Python, Java.
Streaming-friendly: Built-in frame and block APIs.
Tradeoffs:
Lower compression ratio (1.5–2× typical).
Not entropy-coded, so redundancy beyond local scope is missed.
A Gentle Proof (Why It Works)
Let \(S = s_1, s_2, \dots, s_n\) be the input string. LZ4 finds pairs \((o_i, \ell_i)\) representing a match starting \(\ell_i\) bytes long, \(o_i\) bytes back.
The encoded stream alternates between literals and matches: \[
S = L_1 + (o_1, \ell_1) + L_2 + (o_2, \ell_2) + \dots
\]
Since each \((o_i, \ell_i)\) refers to previously emitted data, decoding is a simple linear pass. LZ4 ensures every byte is processed exactly once, giving \(O(n)\) time for both compression and decompression.
Try It Yourself
Compare compression ratio of LZ4 vs gzip vs zstd.
Use lz4 -1 and lz4 -9 on large logs.
Enable block streaming mode:
lz4 --block-size=256KB input.txt output.lz4
Measure decompression speed for random data vs structured text.
Embed LZ4 into a database or cache to reduce I/O.
Test Cases
Input
Size
LZ4 Ratio
zstd Ratio
Notes
Random text
100 KB
1.4×
2.6×
fast
JSON logs
1 MB
1.8×
3.5×
moderate
Repetitive
500 KB
2.5×
3.0×
both good
Binary
2 MB
1.2×
2.0×
little gain
Complexity
Stage
Time
Space
Compression
\(O(n)\)
\(O(1)\)
Decompression
\(O(n)\)
\(O(1)\)
LZ4 is the Formula 1 of compression, minimal, fast, and engineered to stay ahead of every byte in flight.
Section 90. Fault tolenrance and Replication
891 Primary–Backup Replication
Primary–Backup Replication (also called master–slave) is one of the simplest and most widely used methods to achieve fault tolerance and high availability in distributed systems. It maintains one primary node that handles all writes and one or more backup nodes that keep synchronized copies of the data.
What Problem Are We Solving?
In a distributed system or database, nodes can fail unexpectedly, due to crashes, network partitions, or maintenance. If there is only one copy of the data, a single failure means downtime or data loss.
Primary–backup replication ensures that:
There is always a standby replica ready to take over.
Updates are replicated reliably from the primary to backups.
How Does It Work (Plain Language)
Client sends a request (usually a write or transaction) to the primary node.
The primary executes the operation and logs the update.
It sends the update (or log entry) to the backup node(s).
Once all backups acknowledge, the primary commits the change.
If the primary fails, a backup is promoted to become the new primary.
#include <stdio.h>#include <string.h>void replicate(constchar*data){ printf("Replicating to backup: %s\n", data);}int main(){constchar*updates[]={"x=1","y=2","z=3"};for(int i =0; i <3; i++){ printf("Primary committing: %s\n", updates[i]); replicate(updates[i]);} printf("All updates replicated successfully.\n");return0;}
Why It Matters
Simplicity: Easy to implement and reason about.
Availability: If one node fails, another can take over.
Durability: Backups ensure persistence across failures.
Used in: MySQL replication, ZooKeeper observers, Redis replication, PostgreSQL streaming replication.
Tradeoffs:
Writes go through a single primary (bottleneck).
Failover can cause temporary unavailability.
Replication lag may lead to stale reads.
Split-brain risk if multiple nodes think they are primary.
A Gentle Proof (Why It Works)
Let \(S_p\) be the state of the primary and \(S_b\) the state of the backup.
For each write operation \(w_i\): \[
S_p = S_p \cup {w_i}, \quad S_b = S_b \cup {w_i}
\]
If replication is synchronous, then: \[
S_p = S_b \quad \forall i
\]
If asynchronous, there exists a lag \(\Delta\) such that: \[
|S_p| - |S_b| \leq \Delta
\]
In the event of a primary failure, the backup can safely resume if \(\Delta = 0\) or recover up to the last replicated state.
Try It Yourself
Simulate primary–backup with two processes.
Introduce failure before backup receives an update.
Measure data loss under asynchronous mode.
Add heartbeats for failover detection.
Implement synchronous replication (wait for ack).
Test Cases
Mode
Replication
Failure Loss
Latency
Synchronous
Immediate
None
Higher
Asynchronous
Deferred
Possible
Lower
Semi-sync
Bounded delay
Minimal
Moderate
Complexity
Operation
Time
Space
Write (sync)
\(O(n)\) for n replicas
\(O(n)\)
Read
\(O(1)\)
\(O(1)\)
Failover
\(O(1)\) detection
\(O(1)\) recovery
Primary–backup replication is the first building block of reliable systems, simple, strong, and always ready to hand over the torch when one node goes dark.
892 Quorum Replication
Quorum Replication is a distributed consistency protocol that balances availability, fault tolerance, and consistency by requiring only a subset (a quorum) of replicas to agree before an operation succeeds. It is the backbone of modern distributed databases like Cassandra, DynamoDB, and MongoDB.
What Problem Are We Solving?
In fully replicated systems, every write must reach all nodes, which becomes slow or impossible when some nodes fail or are unreachable. Quorum replication ensures correctness even when part of the system is down, as long as enough nodes agree.
The idea:
Not every replica must respond, just enough to form a quorum.
The quorum intersection guarantees consistency.
How Does It Work (Plain Language)
Let there be N replicas. Each operation requires contacting a subset of them:
R: number of replicas needed for a read
W: number of replicas needed for a write
Consistency is guaranteed when: \[
R + W > N
\]
This ensures that every read overlaps with the latest write on at least one replica.
Example:
\(N = 3\) replicas
Choose \(W = 2\), \(R = 2\) Then:
A write succeeds if 2 replicas confirm.
A read succeeds if it hears from 2 replicas.
They overlap → always see the newest data.
Example Flow
Client writes value x = 10 → send to all N nodes.
Wait for W acknowledgments → commit.
Client reads → query all N nodes, wait for R responses.
Resolve conflicts (if any) using latest timestamp or version vector.
Tiny Code (Python Simulation)
import randomN, R, W =3, 2, 2replicas = [{"v": None, "ts": 0} for _ inrange(N)]def write(value, ts): acks =0for r in replicas:if random.random() <0.9: # simulate success r["v"], r["ts"] = value, ts acks +=1return acks >= Wdef read(): responses =sorted( [r for r in replicas if random.random() <0.9], key=lambda r: r["ts"], reverse=True )return responses[0]["v"] iflen(responses) >= R elseNonewrite("alpha", 1)print("Read:", read())
Tiny Code (C Sketch)
#include <stdio.h>#define N 3#define R 2#define W 2typedefstruct{int value;int ts;} Replica;Replica replicas[N];int write_quorum(int value,int ts){int acks =0;for(int i =0; i < N; i++){ replicas[i].value = value; replicas[i].ts = ts; acks++;}return acks >= W;}int read_quorum(){int latest =-1, value =0;for(int i =0; i < N; i++){if(replicas[i].ts > latest){ latest = replicas[i].ts; value = replicas[i].value;}}return value;}int main(){ write_quorum(42,1); printf("Read quorum value: %d\n", read_quorum());}
Why It Matters
Fault-tolerance: system works even if \((N - W)\) nodes are down.
Scalability: can trade off between latency and consistency.
Consistency guarantee: intersection between R and W sets ensures no stale reads.
Used in: Amazon Dynamo, Cassandra, Riak, MongoDB replica sets.
Tradeoffs:
Large quorums → higher latency.
Small quorums → risk of stale reads.
Need conflict resolution for concurrent writes.
A Gentle Proof (Why It Works)
Let \(W\) be the number of replicas required for a write and \(R\) for a read.
To guarantee that every read sees the latest write: \[
R + W > N
\]
This ensures any two quorums (one write, one read) intersect in at least one node: \[
|Q_r \cap Q_w| \ge 1
\]
That intersection node always carries the most recent value, propagating consistency.
If \(R + W \le N\), two disjoint quorums could exist, causing stale reads.
Try It Yourself
Simulate a cluster with \(N = 5\).
Set different quorum pairs:
\(R=3\), \(W=3\) → strong consistency
\(R=1\), \(W=3\) → fast reads
\(R=3\), \(W=1\) → fast writes
Inject random failures or slow nodes.
Verify which reads remain consistent.
Test Cases
N
R
W
Condition
Behavior
3
2
2
R + W > N
Consistent
3
1
1
R + W ≤ N
Stale reads possible
5
3
3
Strong consistency
High latency
5
1
4
Fast read, slower write
Available
Complexity
Operation
Time
Space
Write
\(O(W)\)
\(O(N)\)
Read
\(O(R)\)
\(O(N)\)
Recovery
\(O(N)\)
\(O(N)\)
Quorum replication elegantly balances the impossible triangle of distributed systems, consistency, availability, and partition tolerance, by choosing how much agreement is enough.
893 Chain Replication
Chain Replication is a fault-tolerant replication technique designed for strong consistency and high throughput in distributed storage systems. It organizes replicas into a linear chain, where writes flow from head to tail and reads are served from the tail.
What Problem Are We Solving?
Traditional replication models either sacrifice consistency (as in asynchronous replication) or throughput (as in synchronous broadcast to all replicas). Chain replication provides both linearizability and high throughput by structuring replicas into a pipeline.
How Does It Work (Plain Language)
The replicas are ordered:
[Head] → [Middle] → [Tail]
Writes start at the head and are forwarded down the chain.
Each replica applies the update and forwards it.
When the tail applies the update, it acknowledges success to the client.
Reads go to the tail, ensuring clients always see the most recent committed state.
All writes follow the same order, and reads only occur at \(R_n\). Therefore, all reads observe a prefix of the write sequence, satisfying linearizability.
Formally, if \(w_i\) completes before \(w_j\), then:
\[
w_i \text{ visible at all replicas before } w_j
\]
Hence, clients never see stale or out-of-order data.
Try It Yourself
Simulate a chain of 3 nodes and inject a write failure at the middle node.
Reconfigure the chain to bypass the failed node.
Measure throughput vs. synchronous replication to all nodes.
Extend to 5 nodes and observe write latency growth.
Test Cases
Nodes
Writes/Second
Latency
Consistency
3
High
Moderate
Strong
5
Moderate
Higher
Strong
3 (head failure)
Reconfig needed
Paused
Recoverable
Complexity
Operation
Time
Space
Write
\(O(n)\)
\(O(n)\)
Read
\(O(1)\)
\(O(n)\)
Reconfiguration
\(O(1)\)
\(O(1)\)
Chain Replication turns replication into a well-ordered pipeline, each node a link in the chain of reliability, ensuring that data flows smoothly and consistently from start to finish.
894 Gossip Protocol
Gossip Protocol, also known as epidemic communication, is a decentralized mechanism for spreading information in distributed systems. Instead of a central coordinator, every node periodically “gossips” with random peers to exchange updates, like how rumors spread in a crowd.
What Problem Are We Solving?
In large, unreliable networks, we need a way for all nodes to eventually learn about new data, failures, or configuration changes. Broadcasting to every node is too costly and centralized coordination doesn’t scale. Gossip protocols achieve eventual consistency with probabilistic guarantees and minimal coordination.
How Does It Work (Plain Language)
Each node keeps a local state (like membership info, key-value pairs, or version vectors). At regular intervals:
A node randomly picks another node.
They exchange updates (push, pull, or both).
Each merges what it learns.
Repeat until all nodes converge to the same state.
After a few rounds, nearly all nodes in the system will have consistent information, similar to viral spread.
Gossip Styles
Type
Description
Push
Send updates to a random peer.
Pull
Ask peers for missing updates.
Push–Pull
Exchange both ways, fastest convergence.
Example: Membership Gossip
Nodes maintain a list of members with timestamps or heartbeats.
Each gossip round:
Node A → Node B:
{ NodeC: alive, NodeD: suspect }
Node B merges this information into its own list and gossips it further.
After several rounds, the entire cluster agrees on which nodes are alive or failed.
This simple simulation shows that after a few rounds, all nodes converge to the same value.
Tiny Code (C Sketch)
#include <stdio.h>#include <stdlib.h>#include <time.h>#define N 3int state[N]={0,0,0};void gossip_round(){for(int i =0; i < N; i++){int peer = rand()% N;if(peer != i) state[peer]= state[i];}}int main(){ srand(time(NULL)); state[0]=42;for(int i =0; i <5; i++) gossip_round();for(int i =0; i < N; i++) printf("Node %d state: %d\n", i, state[i]);}
Why It Matters
Scalable: Works with thousands of nodes.
Fault-tolerant: No single point of failure.
Probabilistic efficiency: Fast convergence with low network cost.
Widely used in: Cassandra, Dynamo, Redis Cluster, Akka, Serf, Consul.
Tradeoffs:
Convergence is probabilistic, not deterministic.
Possible temporary inconsistencies.
Gossip traffic can grow if update frequency is too high.
A Gentle Proof (Why It Works)
Let \(n\) be the number of nodes and \(t\) the number of gossip rounds. Each node randomly contacts another node per round.
The expected number of informed nodes grows exponentially: \[
I_t = n \left(1 - e^{-t/n}\right)
\]
After \(O(\log n)\) rounds, almost all nodes are informed, similar to how epidemics spread through a population. This gives eventual consistency with high probability.
Try It Yourself
Start with 10 nodes and let 1 node have new data.
Run gossip for several rounds; track convergence time.
Experiment with push-only vs push–pull.
Add random failures to test resilience.
Tune gossip interval to balance speed and bandwidth.
Test Cases
Nodes
Gossip Style
Convergence Rounds
Reliability
10
Push
~7
95%
10
Push–Pull
~4
100%
100
Push–Pull
~9
99%
1000
Push–Pull
~14
99.9%
Complexity
Operation
Time
Messages
Per gossip round
\(O(1)\)
\(O(n)\)
Convergence
\(O(\log n)\)
\(O(n \log n)\)
Gossip Protocol transforms chaos into harmony, each node sharing whispers until the entire system hums with the same truth.
895 Anti-Entropy Repair
Anti-Entropy Repair is a background process that keeps replicated data consistent across nodes in a distributed system. It detects and reconciles differences between replicas, ensuring eventual consistency even when updates or failures cause divergence.
What Problem Are We Solving?
In real distributed systems, nodes can miss updates due to:
Network partitions
Temporary outages
Message loss
Over time, replicas drift apart, their states diverge. Anti-entropy repair continuously compares replicas and syncs differences, restoring consistency without central coordination.
How Does It Work (Plain Language)
Each node periodically selects another node and performs a reconciliation exchange.
There are two main steps:
Detect divergence Compare data digests (hashes, version vectors, Merkle trees).
Repair differences Send and merge missing or newer data items.
This process continues in the background, slowly and continuously healing inconsistencies.
Example Flow
Node A ↔ Node B
Compare: digests or version vectors
If mismatch:
A → B: send missing keys
B → A: send newer values
Merge → both converge
After multiple rounds, all replicas reach the same versioned state.
Techniques Used
Technique
Description
Merkle Trees
Hierarchical hash comparison for large datasets
Vector Clocks
Track causal order of updates
Timestamps
Choose latest version when conflicts occur
Version Merging
Combine conflicting writes if possible
Tiny Code (Python Simulation)
from hashlib import md5def digest(data):return md5("".join(sorted(data)).encode()).hexdigest()A = {"x": "1", "y": "2"}B = {"x": "1", "y": "3"}def anti_entropy_repair(a, b):if digest(a.values()) != digest(b.values()):for k in a:if a[k] != b.get(k):if a[k] > b.get(k, ""): b[k] = a[k]else: a[k] = b[k]anti_entropy_repair(A, B)print("After repair:", A, B)
Heals eventual consistency: Keeps data synchronized after failures.
Autonomous: Each node repairs independently, without global coordination.
Bandwidth-efficient: Uses hashes (Merkle trees) to minimize data transfer.
Used in: Amazon Dynamo, Cassandra, Riak, and other AP systems.
Tradeoffs:
Background repairs consume bandwidth.
Conflicts require resolution logic.
Repair frequency affects freshness vs. overhead balance.
A Gentle Proof (Why It Works)
Let \(S_i\) and \(S_j\) be the data sets at nodes \(i\) and \(j\). At time \(t\), they may differ: \[
\Delta_{ij}(t) = S_i(t) \setminus S_j(t)
\]
Each anti-entropy session reduces divergence: \[
|\Delta_{ij}(t+1)| < |\Delta_{ij}(t)|
\]
Over repeated rounds, as long as the network eventually connects and repair continues: \[
\lim_{t \to \infty} \Delta_{ij}(t) = \emptyset
\]
Thus, all replicas converge, ensuring eventual consistency.
Try It Yourself
Simulate three replicas with inconsistent states.
Implement a repair round using simple digest comparison.
Add random failures between rounds.
Observe convergence over multiple iterations.
Extend to Merkle tree comparison for large datasets.
Test Cases
Nodes
Repair Method
Converges In
Consistency
2
Direct diff
1 round
Strong
3
Pairwise gossip
~log(n) rounds
Eventual
100
Merkle trees
Few rounds
Eventual
Complexity
Operation
Time
Bandwidth
Digest comparison
\(O(n)\)
\(O(1)\)
Full repair
\(O(n)\)
\(O(n)\)
Merkle repair
\(O(\log n)\)
\(O(\log n)\)
Anti-Entropy Repair acts as the quiet caretaker of distributed systems, steadily walking the network, comparing notes, and making sure every replica tells the same story once again.
896 Erasure Coding
Erasure Coding is a fault-tolerance technique that protects data against loss by breaking it into fragments and adding redundant parity blocks. Unlike simple replication, it achieves the same reliability with much lower storage overhead, making it a cornerstone of modern distributed storage systems.
What Problem Are We Solving?
Replication (keeping 3 or more copies of each block) guarantees durability but wastes space. Erasure coding provides a mathematically efficient alternative that maintains redundancy while using fewer extra bytes.
Goal: If part of the data is lost, the system can reconstruct the original from a subset of fragments.
How Does It Work (Plain Language)
Data is divided into k data blocks, and r parity blocks are generated using algebraic encoding. Together, these form n = k + r total fragments.
Any k out of n fragments can reconstruct the original data. Even if up to r fragments are lost, data remains recoverable.
This is a toy example; real systems use linear algebra over finite fields (e.g., Reed–Solomon codes).
Tiny Code (C Sketch)
#include <stdio.h>int parity(int*data,int n){int p =0;for(int i =0; i < n; i++) p ^= data[i];return p;}int main(){int data[4]={1,2,3,4};int p = parity(data,4); printf("Parity block: %d\n", p);return0;}
Why It Matters
Storage-efficient durability: 50–70% less overhead than replication.
Fault tolerance: Can recover from multiple failures.
Used in: Hadoop HDFS, Ceph, MinIO, Google Colossus, Azure Storage.
Tradeoffs:
Higher CPU cost for encoding/decoding.
Rebuilding lost data requires multiple fragments.
Latency increases during repair.
A Gentle Proof (Why It Works)
Erasure coding relies on linear independence of encoded fragments.
Let original data be a vector: \[
\mathbf{d} = [d_1, d_2, \dots, d_k]
\]
As long as any \(k\) columns of \(G\) are linearly independent, we can recover \(\mathbf{d}\) by solving: \[
\mathbf{d} = \mathbf{c} G^{-1}
\]
Thus, even with \(r = n - k\) missing fragments, recovery is guaranteed.
Try It Yourself
Create 4 data chunks, 2 parity chunks.
Delete any 2 randomly, reconstruct using the remaining 4.
Measure recovery time as system scales.
Compare storage efficiency with 3× replication.
Experiment with Reed–Solomon libraries in Python (pyreedsolomon or zfec).
Test Cases
Scheme
Data (k)
Parity (r)
Tolerates
Overhead
Example Use
(3, 2)
3
2
2 failures
1.67×
Ceph
(6, 3)
6
3
3 failures
1.5×
MinIO
(10, 4)
10
4
4 failures
1.4×
Azure Storage
3× Replication
1
2
2 failures
3×
Simple systems
Complexity
Operation
Time
Space
Encode
\(O(k \times r)\)
\(O(n)\)
Decode
\(O(k^3)\) (matrix inversion)
\(O(n)\)
Repair
\(O(k)\)
\(O(k)\)
Erasure Coding turns mathematics into resilience, it weaves parity from data, allowing a system to lose pieces without ever losing the whole.
897 Checksum Verification
Checksum Verification is a lightweight integrity algorithm that detects data corruption during storage or transmission. It works by computing a compact numeric fingerprint (the checksum) of data and verifying it whenever the data is read or received.
What Problem Are We Solving?
When data moves across disks, memory, or networks, it can silently change due to:
Bit flips
Transmission noise
Hardware faults
Software bugs
Even a single wrong bit can make entire files invalid. Checksum verification ensures we can detect corruption quickly, often before it causes harm.
How Does It Work (Plain Language)
Compute a checksum for the data before sending or saving.
Store or transmit both the data and checksum.
Recompute and compare the checksum when reading or receiving.
If the two values differ → the data is corrupted.
Checksums use simple arithmetic, hash functions, or cyclic redundancy checks (CRC).
Common Algorithms
Type
Description
Use Case
Sum / XOR
Adds or XORs all bytes
Fast, simple, low accuracy
CRC (Cyclic Redundancy Check)
Polynomial division over bits
Networking, filesystems
MD5 / SHA
Cryptographic hash
Secure verification
Fletcher / Adler
Weighted modular sums
Embedded systems
Example
Data: "HELLO"
Compute simple checksum: \[
\text{sum} = H + E + L + L + O = 72 + 69 + 76 + 76 + 79 = 372
\]
Detects silent corruption on disk, memory, or network.
Protects storage systems (HDFS, ZFS, Ceph, S3).
Prevents undetected data drift in replication pipelines.
Simple to compute, easy to verify.
Tradeoffs:
Cannot repair data, only detect errors.
Weak checksums (like XOR) may miss certain patterns.
Cryptographic hashes cost more CPU time.
A Gentle Proof (Why It Works)
A checksum function \(f(x)\) maps data \(x\) to a compact signature.
If data \(x\) is corrupted to \(x'\), we detect the error when: \[
f(x) \ne f(x')
\]
If \(f\) distributes values uniformly across \(k\) bits, the probability of undetected corruption is approximately: \[
P_{\text{miss}} = 2^{-k}
\]
For example:
16-bit checksum → \(1/65536\) chance of miss
32-bit CRC → \(1/4,294,967,296\)
SHA-256 → effectively zero
Try It Yourself
Compute CRC32 for any file.
Flip a single byte and recompute, observe checksum change.
Try different algorithms (MD5, SHA-1, Adler-32).
Compare speed vs reliability.
Integrate checksum into a replication or transfer pipeline.
Test Cases
Algorithm
Bits
Detection Rate
Typical Use
Sum
8
Low
Legacy systems
CRC32
32
Excellent
Network packets
MD5
128
Very high
File integrity
SHA-256
256
Near perfect
Secure verification
Complexity
Operation
Time
Space
Compute
\(O(n)\)
\(O(1)\)
Verify
\(O(n)\)
\(O(1)\)
Checksum Verification is the simplest form of data trust, a small number that quietly guards against invisible corruption, ensuring what you store or send is exactly what you get back.
898 Heartbeat Monitoring
Heartbeat Monitoring is a simple yet essential distributed algorithm for failure detection, it helps systems know which nodes are alive and which have silently failed. By periodically sending “heartbeat” signals between nodes, the system can quickly detect when one stops responding and trigger recovery or failover actions.
What Problem Are We Solving?
In distributed systems, nodes can fail for many reasons:
Power or network loss
Process crashes
Partition or congestion
Without explicit detection, the system might continue sending requests to a dead node or leave data unavailable. Heartbeat monitoring provides a lightweight, continuous liveness check to detect failures automatically.
How Does It Work (Plain Language)
Each node (or a central monitor) maintains a list of peers and timestamps for their last heartbeat. At fixed intervals:
A node sends a heartbeat message to peers (or a coordinator).
The receiver updates the timestamp of the sender.
If no heartbeat is received within a timeout, the node is marked as suspected failed.
Recovery or reconfiguration begins (e.g., elect a new leader, redistribute load).
Example Flow
Node A → heartbeat → Node B
Node B records: last_seen[A] = current_time
If current_time - last_seen[A] > timeout:
Node A considered failed
Tiny Code (Python Example)
import time, threadingpeers = {"A": time.time(), "B": time.time(), "C": time.time()}def send_heartbeat(name):whileTrue: peers[name] = time.time() time.sleep(1)def check_failures(timeout=3):whileTrue: now = time.time()for node, last inlist(peers.items()):if now - last > timeout:print(f"[Alert] Node {node} timed out!") time.sleep(1)threading.Thread(target=send_heartbeat, args=("A",), daemon=True).start()threading.Thread(target=check_failures, daemon=True).start()time.sleep(10)
Fast failure detection: enables automatic recovery.
Essential for leader election, replication, and load balancing.
Simple yet robust: works in all distributed architectures.
Used in: Kubernetes liveness probes, Raft, ZooKeeper, Redis Sentinel, Cassandra.
Tradeoffs:
Network jitter can cause false positives.
Choosing the right timeout is tricky (too short → flapping, too long → delay).
Doesn’t distinguish between crash vs. network partition without higher-level logic.
A Gentle Proof (Why It Works)
Let each node send heartbeats every \(\Delta\) seconds, and let the failure detector timeout be \(T\). If a node stops sending at \(t_f\), it is detected failed at: \[
t_d = t_f + T
\]
Detection time satisfies: \[
\Delta \le T < \text{network delay bound} + \text{heartbeat jitter}
\]
Choosing \(T\) within this range ensures completeness (no missed failures) and accuracy (few false alarms). Heartbeat algorithms are therefore classified as \(\phi\)-accrual detectors or eventual failure detectors in theory.
Try It Yourself
Implement a cluster with 3 nodes using heartbeats every 2 seconds.
Introduce a random delay or packet loss to simulate network jitter.
Adjust the timeout threshold to balance sensitivity and stability.
Log when nodes are marked alive or failed.
Extend to elect a new leader upon failure detection.
Test Cases
Heartbeat Interval (s)
Timeout (s)
Detection Delay
False Positive Rate
1
3
~2s
Low
1
1.5
~0.5s
Medium
0.5
1
~0.5s
Higher
2
6
~4s
Very Low
Complexity
Operation
Time
Space
Send heartbeat
\(O(1)\)
\(O(1)\)
Check failure
\(O(n)\)
\(O(n)\)
Network overhead
\(O(n)\)
–
Heartbeat Monitoring is the pulse of distributed systems, a steady rhythm that tells us who’s still alive, who’s gone silent, and when it’s time for the system to heal itself.
899 Leader Election (Bully Algorithm)
The Bully Algorithm is a classic distributed algorithm for leader election, used to choose a coordinator among nodes in a cluster. It assumes all nodes have unique IDs and can communicate directly, the node with the highest ID becomes the new leader after failures.
What Problem Are We Solving?
Distributed systems often require one node to act as a leader or coordinator, managing tasks, assigning work, or ensuring consistency. When that leader fails, the system must elect a new one automatically.
The Bully Algorithm provides a deterministic and fault-tolerant method for leader election when nodes can detect crashes and compare identities.
How Does It Work (Plain Language)
Each node has a unique ID (often numeric). When a node notices that the leader is down, it starts an election:
The node sends ELECTION messages to all nodes with higher IDs.
If no higher node responds → it becomes the new leader.
If a higher node responds → it waits for the higher node to finish its own election.
The winner announces itself with a COORDINATOR message.
Example Flow
Node
Action
Node 3 detects leader 5 is down
Sends ELECTION to nodes {4,5}
Node 4 replies “OK”
Node 3 stops its election
Node 4 now holds its own election
Sends to {5}
Node 5 is dead → no reply
Node 4 becomes leader
Node 4 broadcasts “COORDINATOR(4)”
All nodes update leader = 4
Tiny Code (Python Example)
nodes = [1, 2, 3, 4, 5]alive = {1, 2, 3, 4} # node 5 faileddef bully_election(start): higher = [n for n in nodes if n > start and n in alive]ifnot higher:print(f"Node {start} becomes leader")return startfor h in higher:print(f"Node {start} → ELECTION → Node {h}")print(f"Node {start} waits...")returnmax(alive)leader = bully_election(3)print("Elected leader:", leader)
Simple: requires only message exchange and ID comparison.
Fast recovery: quickly replaces failed leader.
Used in: legacy distributed systems, election phases of Raft or ZooKeeper, and fault-tolerant controllers.
Tradeoffs:
Requires reliable failure detection.
High message overhead for large clusters.
Assumes full connectivity and synchrony.
A Gentle Proof (Why It Works)
Let \(N\) be the set of nodes with unique IDs, and \(L = \max(N_{\text{alive}})\) the highest alive ID.
Node \(i\) detects leader failure.
It sends ELECTION to all \(j > i\).
If no \(j\) replies, then \(i = L\).
Otherwise, \(j\) initiates its own election, and since \(L\) is maximal, it eventually declares itself leader.
Hence, exactly one leader (the highest-ID node) is elected, satisfying: \[
\text{Safety: only one leader at a time} \
\text{Liveness: eventually a leader is chosen}
\]
Try It Yourself
Simulate a cluster of 5 nodes with random failures.
Trigger elections and log message flow.
Measure time to converge.
Modify to use asynchronous timeouts.
Compare to Raft’s randomized election.
Test Cases
Nodes
Failed Leader
Elected Leader
Messages Sent
{1,2,3,4,5}
5
4
6
{1,2,3,5}
5
3
3
{1,2,3,4,6}
6
5
7
Complexity
Operation
Time
Messages
Election
\(O(n)\)
\(O(n^2)\)
Announcement
\(O(n)\)
\(O(n)\)
The Bully Algorithm ensures order in a distributed world, when silence falls, the highest voice rises to lead until the system heals again.
900 Leader Election (Ring Algorithm)
The Ring Algorithm is another approach to leader election in distributed systems, especially when nodes are organized in a logical ring. Unlike the Bully algorithm (which favors the highest ID node via direct messages), the Ring algorithm circulates election messages around the ring until a single leader emerges through cooperation.
What Problem Are We Solving?
In a distributed network with no central controller, nodes must elect a leader when the current one fails. The Ring algorithm is designed for:
Systems with ring or circular topologies
Symmetric communication (each node only knows its successor)
Situations where full broadcast or direct addressing is expensive
It ensures that all nodes participate equally and guarantees that the highest-ID node eventually becomes leader.
How Does It Work (Plain Language)
Topology: Each node knows only its immediate neighbor in a logical ring.
Election Start: A node noticing leader failure starts an election.
It sends an ELECTION message containing its ID to the next node.
Each node compares the received ID to its own:
If the received ID is higher → forward it unchanged.
If lower → replace with its own ID and forward.
If equal to its own ID → this node wins and broadcasts COORDINATOR.
All nodes update their leader to the announced winner.
Example Flow
Suppose nodes {1, 3, 4, 5, 7} arranged in a ring.
Node 3 detects leader failure.
ELECTION(3) → 4 → 5 → 7 → 1 → 3
Each node compares IDs.
Node 7 has highest ID → sends COORDINATOR(7)
All nodes accept 7 as leader.
Tiny Code (Python Example)
nodes = [1, 3, 4, 5, 7]def ring_election(start): n =len(nodes) current = start candidate = nodes[start]whileTrue: current = (current +1) % nif nodes[current] > candidate: candidate = nodes[current]if current == start:breakprint(f"Leader elected: {candidate}")return candidatering_election(1) # start from node 3
Works naturally for ring-structured or overlay networks.
Reduces message complexity compared to full broadcasts.
Ensures fairness: all nodes can initiate elections equally.
Common in token-based systems (like the Token Ring protocol).
Tradeoffs:
Slower in large rings (must pass through all nodes).
Assumes reliable ring links.
Requires reformation if topology changes (node join/leave).
A Gentle Proof (Why It Works)
Let each node \(n_i\) have unique ID \(id_i\). During election, IDs circulate around the ring. Only the maximum ID survives each comparison:
\[
\max(id_1, id_2, \ldots, id_n)
\]
When the initiating node receives its own ID, it knows it is the maximum, and declares itself leader.
Safety: only one leader (unique max ID) Liveness: election terminates after at most \(n\) message hops
Formally, message count ≤ \(2n - 1\) (one for election, one for coordinator announcement).
Try It Yourself
Simulate 5 nodes in a ring with random IDs.
Start election from different nodes, observe same result.
Introduce message loss and see how election restarts.
Measure number of messages vs ring size.
Compare with Bully algorithm in time and cost.
Test Cases
Nodes
Start Node
Elected Leader
Messages Sent
{1, 3, 4, 5, 7}
3
7
8
{10, 20, 15, 5}
0
20
8
{2, 5, 8}
1
8
6
Complexity
Operation
Time
Messages
Election
\(O(n)\)
≤ \(2n\)
Announcement
\(O(n)\)
\(O(n)\)
The Ring Algorithm captures the cooperative rhythm of distributed systems, each node passes the message in turn, and through collective agreement, the system finds its strongest leader.