Tokenization
BPE Tokenizer
Before text reaches the transformer, it gets split into subword tokens via Byte-Pair Encoding. Each token maps to an integer ID and then to a dense embedding vector.
Try it
Result
4 tokensThe1000·quick2001·brown2002·fox2003
| Pos | Token | ID |
|---|---|---|
| 0 | "The" | 1000 |
| 1 | "·quick" | 2001 |
| 2 | "·brown" | 2002 |
| 3 | "·fox" | 2003 |
How BPE works
1. Start with individual bytes as the vocabulary.
2. Find the most frequent adjacent pair in the training data.
3. Merge that pair into a new token. Add it to the vocab.
4. Repeat until you hit the target vocab size (32K–128K tokens).
Why this matters for vLLM
KV cache size: every token position stores K and V vectors. More tokens = more GPU memory.
Block allocation: vLLM allocates KV cache in fixed-size blocks. The token count determines how many blocks a request needs.
Prefix sharing: shared prefixes are detected at the token level, so consistent tokenization is critical for cache reuse.