Understand how vLLM actually works.

A hands-on guide to the internals of vLLM, the serving engine powering some of the fastest LLM deployments. Click through each module to see PagedAttention, continuous batching, and the scheduler in action.

24x

Throughput vs. HF

<4%

Memory Waste

55%

KV Cache Savings

Transformer Architecture

How the multi-layer transformer actually works: attention heads, feed-forward blocks, and why layer norm matters.

Explore →

PagedAttention

vLLM's core idea: borrowing virtual memory concepts from operating systems to manage attention KV data without waste.

Explore →

KV Cache Management

Non-contiguous block storage, block tables, copy-on-write, and prefix sharing. All the tricks that cut memory usage.

Explore →

Continuous Batching

Why waiting for the slowest sequence in a batch is wasteful, and how iteration-level scheduling fixes it.

Explore →

Scheduler & Preemption

FCFS scheduling with three queues (waiting, running, swapped) and how vLLM decides when to swap or recompute.

Explore →

BPE Tokenizer

Watch byte-pair encoding in action. See how raw text gets broken into subword tokens the model can actually process.

Explore →

Llama 3.1 8B Deep Dive

Concrete architecture numbers for Meta's Llama 3.1-8B: GQA heads, RoPE scaling, memory footprint, and how vLLM serves it.

Explore →

How vLLM fits together

Request flow from client to GPU. Each box corresponds to a section in the sidebar.