Memory

KV Cache Management

Each attention layer caches Key and Value tensors for past tokens. vLLM maps them to physical blocks through a block table, just like a page table in an OS.

0
seq_AK cache · 3 tok
1
seq_AV cache · 3 tok
2
seq_BK cache · 3 tok
3
seq_BV cache · 3 tok
4
seq_AK cache · 3 tok
5
seq_AV cache · 3 tok
6
seq_BK cache · 2 tok
7
seq_BV cache · 2 tok
8
seq_CK cache · 4 tok
9
seq_CV cache · 4 tok
10
seq_CK cache · 3 tok
11
seq_CV cache · 3 tok
KA
VA
KB
VB
KA
VA
KB
VB
KC
VC
KC
VC
free
free
free
free
seq_A
seq_B
seq_C
12
Used Blocks
4
Free Blocks
0
Saved (Sharing)
75%
Utilization

Copy-on-Write (CoW)

When sequences share a KV cache block (same prompt prefix), vLLM tracks it with a reference count. As long as nobody modifies the block, all sequences just point at the same physical memory. The moment one sequence diverges (for example, it generates a different next token), vLLM copies the block first, decrements the original's ref count, and gives the copy to the diverging sequence. Same trick your OS uses for fork().