euclaise
/

ReMask-3B

@@ -68,7 +68,7 @@ Consider the following chat interaction:
 The model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
-We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE (note that R-TeaFor uses a mix of reverse and forward KL divergence).
 Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
 ```
@@ -91,12 +91,28 @@ Keeping this in mind:
 I trained StableLM-3B-4e1t repeatedly on [https://huggingface.co/datasets/euclaise/TinyCoT](TinyCoT), along with 1000 examples from [reddit-instruct-curated](https://huggingface.co/datasets/euclaise/reddit-instruct-curated) and 1000 examples from [oasst2-curated](https://huggingface.co/datasets/sablo/oasst2_curated).
-I trained once with ReMask (ReMask-CoT for CoT examples), once with Masked Thought (w/ partial label-masking for CoT), and once with SFT.
 Here are some benchmark results, computed using the the LM Evaluation Harness with vllm:
-| Model          | GSM8K (strict, 5-shot) | AGIEval (Nous subset, 0-shot) | ARC-C | BBH
-|:--------------:|-----------------------:|------------------------------:|------:|-----
-| SFT            | 23.81%                 |
-| Masked Thought | 20.24%                 | 23.80%
-| **ReMask**     | **24.03%**             | 24.71%

 The model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
+We then compute a distance loss `D(p_masked, p_full)` between the two predictions. For this, I used the average of the backwards and forwards KL divergences between the predictions.
 Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
 ```
 I trained StableLM-3B-4e1t repeatedly on [https://huggingface.co/datasets/euclaise/TinyCoT](TinyCoT), along with 1000 examples from [reddit-instruct-curated](https://huggingface.co/datasets/euclaise/reddit-instruct-curated) and 1000 examples from [oasst2-curated](https://huggingface.co/datasets/sablo/oasst2_curated).
+I trained once with ReMask/ReMask-CoT, once without regularization to match Masked Thought (w/ partial label-masking for CoT), and once with SFT.
+If my hypothesis regarding exposure bias is correct, ReMask should significantly improve generative benchmarks like GSM8K, but would not necessarily improve logprob-based benchmarks like ARC-c (as implemented by the evaluation harness):
 Here are some benchmark results, computed using the the LM Evaluation Harness with vllm:
+| Model          | GSM8K (strict, 5-shot) | ARC-c (acc_norm, 25-shot) |
+|:--------------:|-----------------------:|--------------------------:|
+| SFT            | 24.34%                 | 42.92%                    |
+| Masked Thought | 24.18%                 | **43.60%**                |
+| **ReMask**     | **27.90%**             | 43.26%                    |
+As I expected, it improves GSM8K doesn't do much to ARC.
+## Training details
+- Framework: PyTorch Lightning
+- Optimizer: [Lilith](https://github.com/euclaise/supertrainer2000/blob/master/src/supertrainer2k/optim/lilith.py)
+- Training sequence length: 256
+- Input masking probability: 40%
+- Label masking probability: 10%
+- Answer-only (full rationale masking) probability: 10%
+- Batch size: 16, accumulated to 256
+- Epochs: 6
+- Learning rate: 1e-5
+- Learning rate schedule: One Cycle, cosine, no cycle_momentum