add num tokens source
Browse files
README.md
CHANGED
@@ -115,10 +115,10 @@ The model is trained on the following data mixture of openly accessible English
|
|
115 |
|
116 |
| Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
|
117 |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
|
118 |
-
| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents |
|
119 |
-
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents |
|
120 |
-
| [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs |
|
121 |
-
| [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs |
|
122 |
|
123 |
**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
|
124 |
|
|
|
115 |
|
116 |
| Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
|
117 |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
|
118 |
+
| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | 114.906B | TODO | 1 | 73.85% |
|
119 |
+
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | 3.192B | TODO | 3 | 6.15% |
|
120 |
+
| [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | 1.636B | TODO | 1 | 17.18%
|
121 |
+
| [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | 29.920B | TODO | 3 | 2.82% | |
|
122 |
|
123 |
**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
|
124 |
|