Memory

KV Cache Management

Each attention layer caches Key and Value tensors for past tokens. vLLM maps them to physical blocks through a block table, just like a page table in an OS.

Block Table

seq_AK cache · 3 tok

seq_AV cache · 3 tok

seq_BK cache · 3 tok

seq_BV cache · 3 tok

seq_AK cache · 3 tok

seq_AV cache · 3 tok

seq_BK cache · 2 tok

seq_BV cache · 2 tok

seq_CK cache · 4 tok

seq_CV cache · 4 tok

seq_CK cache · 3 tok

seq_CV cache · 3 tok

GPU Memory

free

seq_A

seq_B

seq_C

Used Blocks

Free Blocks

Saved (Sharing)

75%

Utilization

Copy-on-Write (CoW)

When sequences share a KV cache block (same prompt prefix), vLLM tracks it with a reference count. As long as nobody modifies the block, all sequences just point at the same physical memory. The moment one sequence diverges (for example, it generates a different next token), vLLM copies the block first, decrements the original's ref count, and gives the copy to the diverging sequence. Same trick your OS uses for fork().