Scheduling

Scheduler & Preemption

FCFS ordering with priority-aware preemption. When GPU memory runs out, the scheduler can either swap KV cache to CPU or discard and recompute later.

Speed

Step 0/18

GPU Blocks

0/8

—

Waiting (5)

1SG1P1

2SG2P2

3SG3P3

4SG4P4

5SG5P0

Running (0)

Empty

Swapped (0) / Done (0)

Empty

Swap preemption

The victim's KV cache blocks get copied from GPU → CPU over PCIe. When the sequence gets rescheduled, blocks are swapped back. All generated tokens are preserved, but you pay for the transfer bandwidth.

Recompute preemption

The KV cache is simply discarded. When the sequence runs again, vLLM re-processes the prompt from scratch (prefill) to rebuild the cache. No extra memory needed, but you redo compute. Works best for short prompts.