Core Innovation

PagedAttention

Treat GPU memory like an OS treats RAM: allocate fixed-size pages on demand, map them through a block table, and waste almost nothing.

Speed

Token 0/11

Input Prompt

Thequickbrownfoxjumpsoverthelazydogandruns

Logical Block Table

Press Play to start the animation

Physical GPU Blocks

Block 0free

Block 1free

Block 2free

Block 3free

Block 4free

Block 5free

Block 6free

Block 7free

Tokens

Pages Used

Wasted Slots

Fragmentation

The problem it solves

Traditional systems pre-allocate a contiguous KV cache for the maximum possible sequence length. A request that might reach 2048 tokens but only uses 100 still reserves all 2048 slots, wasting 60-80% of GPU memory.

How paging fixes it

Instead of one big allocation, KV data is stored in small fixed-size blocks (pages). New pages are allocated only when needed, and they don't have to be contiguous. Waste drops to the last page per sequence, typically under 4%.