Ahmadzei's picture
update 1
57bdca5
raw
history blame contribute delete
451 Bytes
It uses a combination of local windowed attention (attention only calculated from fixed window size around each token) and global attention (only for specific task tokens like [CLS] for classification) to create a sparse attention matrix instead of a full attention matrix.
Decoder[[nlp-decoder]]
GPT-2 is a decoder-only Transformer that predicts the next word in the sequence. It masks tokens to the right so the model can't "cheat" by looking ahead.