Scheduling

Scheduler & Preemption

FCFS ordering with priority-aware preemption. When GPU memory runs out, the scheduler can either swap KV cache to CPU or discard and recompute later.

Speed
Step 0/18
0/8
1SG1P1
2SG2P2
3SG3P3
4SG4P4
5SG5P0

Empty

Empty

Swap preemption

The victim's KV cache blocks get copied from GPU → CPU over PCIe. When the sequence gets rescheduled, blocks are swapped back. All generated tokens are preserved, but you pay for the transfer bandwidth.

Recompute preemption

The KV cache is simply discarded. When the sequence runs again, vLLM re-processes the prompt from scratch (prefill) to rebuild the cache. No extra memory needed, but you redo compute. Works best for short prompts.