The two optimizations in the fastpath execution are: fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors BetterTransformer also converts all attention operations to use the more memory-efficient scaled dot product attention.