Understand how vLLM actually works.
A hands-on guide to the internals of vLLM, the serving engine powering some of the fastest LLM deployments. Click through each module to see PagedAttention, continuous batching, and the scheduler in action.
Transformer Architecture
How the multi-layer transformer actually works: attention heads, feed-forward blocks, and why layer norm matters.
Explore →PagedAttention
vLLM's core idea: borrowing virtual memory concepts from operating systems to manage attention KV data without waste.
Explore →KV Cache Management
Non-contiguous block storage, block tables, copy-on-write, and prefix sharing. All the tricks that cut memory usage.
Explore →Continuous Batching
Why waiting for the slowest sequence in a batch is wasteful, and how iteration-level scheduling fixes it.
Explore →Scheduler & Preemption
FCFS scheduling with three queues (waiting, running, swapped) and how vLLM decides when to swap or recompute.
Explore →BPE Tokenizer
Watch byte-pair encoding in action. See how raw text gets broken into subword tokens the model can actually process.
Explore →Llama 3.1 8B Deep Dive
Concrete architecture numbers for Meta's Llama 3.1-8B: GQA heads, RoPE scaling, memory footprint, and how vLLM serves it.
Explore →How vLLM fits together
Request flow from client to GPU. Each box corresponds to a section in the sidebar.