Attention
Also known as · self-attention · attention mechanism
The mechanism that lets a model weigh which tokens matter for each prediction.
Attention is the core mechanism inside a transformer. For each token it processes, the model computes how relevant every other token is, and uses those weights to decide what to focus on. In the sentence 'the trophy didn't fit in the suitcase because it was too big,' attention is what helps the model connect 'it' to 'trophy' rather than 'suitcase'.
Because attention looks at all tokens in relation to each other, it captures long-range context and meaning far better than older approaches that read text in strict sequence. This is what gives LLMs their grasp of grammar, reference, and nuance.
Attention is also computationally expensive — its cost grows quickly as the context gets longer, which is one reason very large context windows are technically hard and why a lot of research targets making attention more efficient.