Core Contribution
The Transformer is the first sequence transduction model built entirely on attention mechanisms, without any recurrence or convolution. This removes the sequential dependency that prevented RNNs from being parallelised, and establishes an architecture family that now underlies almost all large language models.
What I Understood Quickly
- The motivation against RNNs: sequential computation, long-range dependency path length.
- The mechanics of scaled dot-product attention — clear once you see the Q/K/V framing.
- Multi-head attention as parallel attention sub-spaces.
- The encoder-decoder structure and why causal masking is needed in the decoder.
What Confused Me (and How I Resolved It)
Why divide by √d_k?
For random unit-variance Q and K vectors of dimension d_k, the dot product has variance d_k. Dividing by √d_k restores unit variance regardless of dimension.
Why sine/cosine for positional encoding?
sin(pos + k) can be written as a linear combination of sin(pos) and cos(pos), so the model can learn to attend to "offset by k" positions via linear projection.
What to Read Next
| Paper | Relationship | Priority |
|---|---|---|
| BERT (Devlin et al., 2018) | Encoder-only, bidirectional pre-training | High |
| GPT (Radford et al., 2018) | Decoder-only, autoregressive | High |
| Relative Attention (Shaw et al., 2018) | Relative positions | Medium |
| ViT (Dosovitskiy et al., 2020) | Transformers for images | High |
References
[1]: Vaswani et al. (2017). Attention is all you need. NeurIPS. https://arxiv.org/abs/1706.03762
[2]: Rush, A. (2018). The Annotated Transformer. Harvard NLP.
References
- Vaswani et al. (2017). *Attention is all you need.* NeurIPS. https://arxiv.org/abs/1706.03762
- Rush, A. (2018). *The Annotated Transformer.* Harvard NLP.