EleutherAI
/

pythia-6.9b-deduped

@@ -11,7 +11,8 @@ datasets:
 ---
 The *Pythia Scaling Suite* is a collection of models developed to facilitate
-interpretability research. It contains two sets of eight models of sizes
 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
 models: one trained on the Pile, and one trained on the Pile after the dataset
 has been globally deduplicated. All 8 model sizes are trained on the exact
@@ -53,6 +54,8 @@ with exact parameter counts.
 - Language: English
 - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
  for training procedure, config files, and details on how to use.
 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
 - License: Apache 2.0
 - Contact: to ask questions about this model, join the [EleutherAI
@@ -66,10 +69,10 @@ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
 | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate         | Equivalent Models      |
 | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
 | 70M          | 18,915,328           | 6      | 512       | 8     | 2M         | 1.0 x 10<sup>-3</sup> | —                      |
-| 160M         | 85,056,000           | 12     | 768       | 12    | 4M         | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
-| 410M         | 302,311,424          | 24     | 1024      | 16    | 4M         | 3.0 x 10<sup>-4</sup> | OPT-350M               |
 | 1.0B         | 805,736,448          | 16     | 2048      | 8     | 2M         | 3.0 x 10<sup>-4</sup> | —                      |
-| 1.4B         | 1,208,602,624        | 24     | 2048      | 16    | 4M         | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
 | 2.8B         | 2,517,652,480        | 32     | 2560      | 32    | 2M         | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
 | 6.9B         | 6,444,163,072        | 32     | 4096      | 32    | 2M         | 1.2 x 10<sup>-4</sup> | OPT-6.7B               |
 | 12B          | 11,327,027,200       | 36     | 5120      | 40    | 2M         | 1.2 x 10<sup>-4</sup> | —                      |

 ---
 The *Pythia Scaling Suite* is a collection of models developed to facilitate
+interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf).
+It contains two sets of eight models of sizes
 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
 models: one trained on the Pile, and one trained on the Pile after the dataset
 has been globally deduplicated. All 8 model sizes are trained on the exact
 - Language: English
 - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
  for training procedure, config files, and details on how to use.
+ [See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation
+ details.
 - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
 - License: Apache 2.0
 - Contact: to ask questions about this model, join the [EleutherAI
 | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate         | Equivalent Models      |
 | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
 | 70M          | 18,915,328           | 6      | 512       | 8     | 2M         | 1.0 x 10<sup>-3</sup> | —                      |
+| 160M         | 85,056,000           | 12     | 768       | 12    | 2M         | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
+| 410M         | 302,311,424          | 24     | 1024      | 16    | 2M         | 3.0 x 10<sup>-4</sup> | OPT-350M               |
 | 1.0B         | 805,736,448          | 16     | 2048      | 8     | 2M         | 3.0 x 10<sup>-4</sup> | —                      |
+| 1.4B         | 1,208,602,624        | 24     | 2048      | 16    | 2M         | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
 | 2.8B         | 2,517,652,480        | 32     | 2560      | 32    | 2M         | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
 | 6.9B         | 6,444,163,072        | 32     | 4096      | 32    | 2M         | 1.2 x 10<sup>-4</sup> | OPT-6.7B               |
 | 12B          | 11,327,027,200       | 36     | 5120      | 40    | 2M         | 1.2 x 10<sup>-4</sup> | —                      |