Spaces:

Ahmadzei
/

RAG

Runtime error

update 1

57bdca5 9 months ago

451 Bytes

	It uses a combination of local windowed attention (attention only calculated from fixed window size around each token) and global attention (only for specific task tokens like [CLS] for classification) to create a sparse attention matrix instead of a full attention matrix.
	Decoder[[nlp-decoder]]
	GPT-2 is a decoder-only Transformer that predicts the next word in the sequence. It masks tokens to the right so the model can't "cheat" by looking ahead.