Motivation
The 2017 paper Attention Is All You Need [1] replaced recurrent networks with a pure attention mechanism and changed the trajectory of NLP permanently. This walkthrough rebuilds each piece from scratch so the intuition is clear before the math arrives.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
— Vaswani et al., 2017
The Problem With RNNs
Recurrent networks process tokens one at a time. Each hidden state h_t depends on h_{t-1}:
- Training cannot be parallelised across time steps.
- Long-range dependencies cause vanishing / exploding gradients.
Scaled Dot-Product Attention
Given queries Q, keys K, and values V, attention is:
import math
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V), weights
Multi-Head Attention
Architecture Overview
Positional Encoding
References
[1]: Vaswani et al. (2017). Attention is all you need. NeurIPS. https://arxiv.org/abs/1706.03762
[2]: Rush, A. (2018). The Annotated Transformer. Harvard NLP.
[3]: Alammar, J. (2018). The Illustrated Transformer.
References
- Vaswani et al. (2017). *Attention is all you need.* NeurIPS. https://arxiv.org/abs/1706.03762
- Rush, A. (2018). *The Annotated Transformer.* Harvard NLP.
- Alammar, J. (2018). *The Illustrated Transformer.*