Model Deep Dive

Llama 3.1: 8B Parameters

A concrete walkthrough of Meta's Llama 3.1-8B architecture: every dimension, every weight matrix, and exactly how much memory each piece consumes when served with vLLM.

8.03B
Parameters
32
Layers
128K
Context
4:1
GQA Ratio
16.1 GB
Model Size (BF16)
Hidden Dim
4,096
FFN Dim
14,336
Query Heads
32
KV Heads
8
Head Dim
128
Vocab Size
128,256
RoPE Base θ
500,000
Activation
SiLU (SwiGLU)
Input Token IDs
× 32 transformer blocks
lm_head → Softmax → Next Token

Each KV head serves 4 query heads. This diagram shows one attention block with all 32 Q heads grouped under 8 KV heads.

KV 0Q0Q1Q2Q3Group 0KV 1Q4Q5Q6Q7Group 1KV 2Q8Q9Q10Q11Group 2KV 3Q12Q13Q14Q15Group 3KV 4Q16Q17Q18Q19Group 4KV 5Q20Q21Q22Q23Group 5KV 6Q24Q25Q26Q27Group 6KV 7Q28Q29Q30Q31Group 7KV Head (8 total)Query Head (32 total)4 Q heads share 1 KV head = 4× less KV cache