Scaling Law with Learning Rate Annealing
Paper
•
2408.11029
•
Published
•
3
Note 1. The result of memory “read” is fed to the processing unit; The output from the processing unit is “written” to the memory. 2. Token summarisation: implemented as a weighted summation of all context in memory. R_{k x p} * R_{p x d} = R{k x d}; Make R_{k x p} learnable. 3. Add positional embedding to distinguish tokens from memory vs. tokens from inputs.
Note 1. It has an upgrade version: https://arxiv.org/pdf/2303.16727 1.1. Progressive fine-tuning of the pre-trained models can contribute to higher performance. 1.2. Decoder takes inputs from the encoder visible tokens and only reconstructs the visible tokens under the decoder mask. 1.3 The supervision only applies to the decoder output tokens invisible to the encoder.