Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame
256 Bytes
Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
their local window).