Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
The two optimizations in the fastpath execution are:
fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps
skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors
BetterTransformer also converts all attention operations to use the more memory-efficient scaled dot product attention.