What is a transformer, actually?

Before I started taking AI seriously, “transformer” was just a word I’d seen in headlines. Here’s what I actually understand about it now.

The core problem transformers solve

Earlier neural networks processed sequences word-by-word, left to right. By the time you got to the end of a long sentence, the network had mostly forgotten the beginning. Transformers solve this by processing all tokens in a sequence simultaneously and using a mechanism called attention to decide which tokens should influence each other, regardless of distance.

Attention in plain terms

Imagine reading the sentence: “The bank can guarantee deposits will eventually cover future tuition costs.”

The word “bank” is ambiguous — it could be a riverbank or a financial institution. To resolve this, a human reader looks at other words in the sentence for context: deposits, tuition, costs. Attention does something similar: for each token, it calculates a score representing how much every other token should influence its interpretation.

The scores are computed using three vectors derived from each token:

Query — what this token is “looking for”
Key — what this token “offers” to others
Value — the actual content that gets passed along

The query of one token is compared (dot product) against the keys of all other tokens. Softmax converts the results into weights. Then those weights are applied to the values and summed. That’s the attention output for one token.

Why “multi-head”?

A single attention mechanism can only learn one kind of relationship at a time. Multi-head attention runs several attention operations in parallel — each one potentially learning something different (syntax, coreference, proximity, etc.) — then concatenates the results.

What transformers are not

Transformers are not “understanding” text in any cognitive sense. They’re predicting likely continuations based on patterns in training data. This distinction matters when you’re trying to figure out where they’ll fail.

Where I want to go next

I want to understand positional encoding better — specifically, why absolute positions (from the original paper) were replaced with rotary embeddings (RoPE) in most modern models.

Written as I learn. Corrections welcome.