Take input attention masks to support left-padded sequences
#1
by
hiyouga
- opened
The previous implementation does not accept attention masks as inputs, so it will cause some unexpected behaviours at batched inference (commonly using left-padding). So I reimplemented the alibi encodings to take attention masks in user inputs. Note that this implementation largely depends on [1].
Of course, the above implementation requires re-computing alibi tensors at each inference time. We cannot use cached tensors while using input attention masks. Thus, the inference efficiency will be slightly worse than the original version.
Could alibi fused with expanded mask and do not need to take causal mask into consideration? Because alibi mask is like causal mask which is a lower triu?
hiyouga
changed pull request status to
closed