HugoLaurencon commited on
Commit
ade3656
1 Parent(s): de1ddb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -8
README.md CHANGED
@@ -115,19 +115,18 @@ The model is trained on the following data mixture of openly accessible English
115
 
116
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
118
- | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | TODO | TODO | 3 | 73.85% |
119
- | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | TODO | TODO | 1 | 6.15% |
120
- | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 2.82% |
121
- | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 17.18% |
122
 
123
- **PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
124
-
125
- **LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
126
 
127
  **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
128
 
129
- **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
130
 
 
131
 
132
  For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
133
 
 
115
 
116
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
118
+ | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 73.85% |
119
+ | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 17.18% |
120
+ | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | TODO | TODO | 1 | 6.15%
121
+ | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | TODO | TODO | 3 | 2.82% | |
122
 
123
+ **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 
 
124
 
125
  **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
126
 
127
+ **LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
128
 
129
+ **PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
130
 
131
  For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
132