Tokenization

BPE Tokenizer

Before text reaches the transformer, it gets split into subword tokens via Byte-Pair Encoding. Each token maps to an integer ID and then to a dense embedding vector.

4 tokens
The1000·quick2001·brown2002·fox2003
PosTokenID
0"The"1000
1"·quick"2001
2"·brown"2002
3"·fox"2003

How BPE works

1. Start with individual bytes as the vocabulary.

2. Find the most frequent adjacent pair in the training data.

3. Merge that pair into a new token. Add it to the vocab.

4. Repeat until you hit the target vocab size (32K–128K tokens).

Why this matters for vLLM

KV cache size: every token position stores K and V vectors. More tokens = more GPU memory.

Block allocation: vLLM allocates KV cache in fixed-size blocks. The token count determines how many blocks a request needs.

Prefix sharing: shared prefixes are detected at the token level, so consistent tokenization is critical for cache reuse.