Core Innovation

PagedAttention

Treat GPU memory like an OS treats RAM: allocate fixed-size pages on demand, map them through a block table, and waste almost nothing.

Speed
Token 0/11
Thequickbrownfoxjumpsoverthelazydogandruns

Press Play to start the animation

Block 0free
Block 1free
Block 2free
Block 3free
Block 4free
Block 5free
Block 6free
Block 7free
0
Tokens
0
Pages Used
0
Wasted Slots
0%
Fragmentation

The problem it solves

Traditional systems pre-allocate a contiguous KV cache for the maximum possible sequence length. A request that might reach 2048 tokens but only uses 100 still reserves all 2048 slots, wasting 60-80% of GPU memory.

How paging fixes it

Instead of one big allocation, KV data is stored in small fixed-size blocks (pages). New pages are allocated only when needed, and they don't have to be contiguous. Waste drops to the last page per sequence, typically under 4%.