File size: 493 Bytes
57bdca5
 
 
 
 
 
 
1
2
3
4
5
6
7
Attention mechanisms
Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
use a sparse version of the attention matrix to speed up training.
LSH attention
Reformer uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
dimension) of the matrix QK^t are going to give useful contributions.