Architecture

How a Transformer Works

Every modern LLM (ChatGPT, Llama, Mistral) is built from the same basic building blocks stacked on top of each other. Click each layer below to understand what it does.

Layers

Attention Heads

KV Heads

4,096

Hidden Size

128K

Vocabulary

128K

Max Context

The Big Picture

A transformer takes in text, breaks it into tokens, processes them through many identical layers, and predicts the next token. That's it. The entire magic of ChatGPT, Llama, and every other LLM comes from repeating this simple loop, one token at a time.

Inside Each Layer

How Attention Works

Each token creates a Query, Key, and Value. Queries and Keys are compared to find which tokens are relevant, then Values carry the actual information.

What Happens Inside One Layer

Each layer follows this pattern: normalize the data, run attention to gather context, add the original back (skip connection), normalize again, run the FFN to process it, and add the original back again. This repeats 32 times.

Why are there so many layers?

Each layer refines the model's understanding a little bit more. Early layers tend to learn basic things like grammar and word relationships. Middle layers build up meaning and context. Later layers make high-level decisions about what to say next.

Stacking 32 (or 80, or 126) layers gives the model enough depth to go from raw words to genuinely understanding what you're asking.

What are attention heads?

A single attention head can only focus on one kind of relationship at a time. With multiple heads (32 in Llama 8B), the model can track many things simultaneously: grammar, meaning, word proximity, coreference, and more.

Grouped-Query Attention (GQA) is a memory-saving trick: instead of giving every head its own Key and Value, several heads share the same ones. Llama 8B groups 4 query heads per KV pair, cutting memory use by 4× with minimal quality loss.

Why skip connections matter

Without skip connections, information would have to survive passing through dozens of layers of transformations. In practice, it gets distorted and lost, making the model impossible to train.

Skip connections create a shortcut: the original data always makes it through unchanged, and each layer just adds small corrections on top. This is what makes deep models practical.

How does the model generate text?

One token at a time. The model reads everything so far, predicts the most likely next token, appends it to the input, and repeats. A 100-word response means the model ran ~130 times, once per token.

This is why generation feels slower than reading your prompt: the prompt can be processed all at once (parallel), but each new token must wait for the previous one (sequential).

Where Do the Parameters Live?

A model's billions of parameters aren't spread evenly. The FFN layers hold the majority; they're where the model stores most of its learned knowledge.

Embed

Attention ~25%

FFN ~65%

Embeddings Attention FFN Norms (tiny)