Scheduling

Continuous Batching

Static batching wastes GPU cycles whenever a short request finishes early. Continuous batching fills those empty slots immediately, keeping the GPU busy.

Speed

Iter 0/20

GPU Batch Slots (max 3)

100% utilized

Explain quantum computing

1/6

Write a haiku about AI

1/4

Translate to French

1/8

Waiting (2)

Summarize this paper

5 tokens

Code a binary search

7 tokens

Completed (0)

None yet

Continuous batching (vLLM)

After every single iteration (one token per sequence), vLLM checks who's done. Finished sequences free their slot immediately, and waiting requests slide in. The GPU stays full almost all the time. This is why vLLM can serve 2-4x more requests per second than static-batching systems.