Core Innovation
PagedAttention
Treat GPU memory like an OS treats RAM: allocate fixed-size pages on demand, map them through a block table, and waste almost nothing.
Speed
Token 0/11
Input Prompt
Thequickbrownfoxjumpsoverthelazydogandruns
Logical Block Table
Press Play to start the animation
Physical GPU Blocks
Block 0free
Block 1free
Block 2free
Block 3free
Block 4free
Block 5free
Block 6free
Block 7free
0
Tokens
0
Pages Used
0
Wasted Slots
0%
Fragmentation
The problem it solves
Traditional systems pre-allocate a contiguous KV cache for the maximum possible sequence length. A request that might reach 2048 tokens but only uses 100 still reserves all 2048 slots, wasting 60-80% of GPU memory.
How paging fixes it
Instead of one big allocation, KV data is stored in small fixed-size blocks (pages). New pages are allocated only when needed, and they don't have to be contiguous. Waste drops to the last page per sequence, typically under 4%.