Scheduling
Continuous Batching
Static batching wastes GPU cycles whenever a short request finishes early. Continuous batching fills those empty slots immediately, keeping the GPU busy.
Speed
Iter 0/20
GPU Batch Slots (max 3)
100% utilizedR1
Explain quantum computing
1/6
R2
Write a haiku about AI
1/4
R3
Translate to French
1/8
Waiting (2)
R4
Summarize this paper
5 tokens
R5
Code a binary search
7 tokens
Completed (0)
None yet
Continuous batching (vLLM)
After every single iteration (one token per sequence), vLLM checks who's done. Finished sequences free their slot immediately, and waiting requests slide in. The GPU stays full almost all the time. This is why vLLM can serve 2-4x more requests per second than static-batching systems.