clarification on the usage of `short_factor` and `long_factor`?
My tests show that short_factor
and long_factor
shall be used like this:
Let max_length
be the real maximum context length which shall be <= 128k.
If the
max_length
is less than 4096, just useshort_factor
;If the
max_length
is greater than 4096, just uselong_factor
.Can we use
long_factor
formax_length
less than 4096? Yes, but its performance is worse thanshort_factor
.
Mixed use of these two factors would not work, even if they are switched as in Phi3SuScaledRotaryEmbedding
on the boundary of batches, which means
that Phi3SuScaledRotaryEmbedding
needs to be fixed.
Please correct me, if anything is wrong.
I agree, though the issue is how to implement that since we won’t have any information regarding the true max_length that will be used.
The current implementation is relying on the amount of information that is used during the generation and re-calculates the inverse frequency based on that amount. For every generation smaller than 4096, the short_factor is used, else we use the long_factor.
One point of pain is the boundary around 4096, for example, 4095 and 4097, which will use different values for their rotary embeddings. The switch is not the ideal way, but my feeling is that keeping short_factor for a generation that was supposed to be small and turned out to be long is less reliable than switching to long_factor.