HuggingFaceM4
/

idefics-80b

Text Generation

image-text-to-text

text-generation-inference

Model card Files Files and versions Community

Leyo commited on Jul 11, 2023

Commit

73fda5b

•

1 Parent(s): 436c345

add num tokens source

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -115,10 +115,10 @@ The model is trained on the following data mixture of openly accessible English
 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
-| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 1      | 73.85%                                  |
-| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 3      | 6.15%                                  |
-| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | TODO                      | TODO                      | 1      | 17.18%
-| [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | TODO                      | TODO                      | 3      | 2.82%                                   |                                |
 **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).

 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
+| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | 114.906B                      | TODO                      | 1      | 73.85%                                  |
+| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | 3.192B                     | TODO                      | 3      | 6.15%                                  |
+| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | 1.636B                      | TODO                      | 1      | 17.18%
+| [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | 29.920B                      | TODO                      | 3      | 2.82%                                   |                                |
 **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).