avi-skowron
commited on
Commit
β’
372b1c0
1
Parent(s):
45dfa62
fix batch sizes and add paper
Browse files
README.md
CHANGED
@@ -11,7 +11,8 @@ datasets:
|
|
11 |
---
|
12 |
|
13 |
The *Pythia Scaling Suite* is a collection of models developed to facilitate
|
14 |
-
interpretability research
|
|
|
15 |
70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
|
16 |
models: one trained on the Pile, and one trained on the Pile after the dataset
|
17 |
has been globally deduplicated. All 8 model sizes are trained on the exact
|
@@ -53,6 +54,8 @@ with exact parameter counts.
|
|
53 |
- Language: English
|
54 |
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
|
55 |
for training procedure, config files, and details on how to use.
|
|
|
|
|
56 |
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
57 |
- License: Apache 2.0
|
58 |
- Contact: to ask questions about this model, join the [EleutherAI
|
@@ -66,10 +69,10 @@ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
|
|
66 |
| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
|
67 |
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
|
68 |
| 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | β |
|
69 |
-
| 160M | 85,056,000 | 12 | 768 | 12 |
|
70 |
-
| 410M | 302,311,424 | 24 | 1024 | 16 |
|
71 |
| 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | β |
|
72 |
-
| 1.4B | 1,208,602,624 | 24 | 2048 | 16 |
|
73 |
| 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
|
74 |
| 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
|
75 |
| 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | β |
|
|
|
11 |
---
|
12 |
|
13 |
The *Pythia Scaling Suite* is a collection of models developed to facilitate
|
14 |
+
interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf).
|
15 |
+
It contains two sets of eight models of sizes
|
16 |
70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
|
17 |
models: one trained on the Pile, and one trained on the Pile after the dataset
|
18 |
has been globally deduplicated. All 8 model sizes are trained on the exact
|
|
|
54 |
- Language: English
|
55 |
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
|
56 |
for training procedure, config files, and details on how to use.
|
57 |
+
[See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation
|
58 |
+
details.
|
59 |
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
60 |
- License: Apache 2.0
|
61 |
- Contact: to ask questions about this model, join the [EleutherAI
|
|
|
69 |
| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
|
70 |
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
|
71 |
| 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | β |
|
72 |
+
| 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
|
73 |
+
| 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10<sup>-4</sup> | OPT-350M |
|
74 |
| 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | β |
|
75 |
+
| 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
|
76 |
| 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
|
77 |
| 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
|
78 |
| 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | β |
|