Zlatislav Zlatev commited on
Commit
151cd18
1 Parent(s): ada060f

Upload 10 files

Browse files
.gitattributes CHANGED
@@ -32,3 +32,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
32
  *.zip filter=lfs diff=lfs merge=lfs -text
33
  *.zst filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
32
  *.zip filter=lfs diff=lfs merge=lfs -text
33
  *.zst filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
35
+ enwiki-words-frequency.txt filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,194 @@
1
  ---
2
- license: other
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+
3
+ inference: false
4
+ co2_eq_emissions:
5
+ emissions: 7540
6
+ source: MLCo2 Machine Learning Impact calculator
7
+ geographical_location: East USA
8
+ hardware_used: TPU v3-8
9
+ tags:
10
+ - text-to-image
11
+ license: apache-2.0
12
+
13
+ language: en
14
+ model-index:
15
+ - name: dalle-mini
16
+ results: []
17
  ---
18
+
19
+ # DALL·E Mini Model Card
20
+
21
+ This model card focuses on the model associated with the DALL·E mini space on Hugging Face, available [here](https://huggingface.co/spaces/dalle-mini/dalle-mini). The app is called “dalle-mini”, but incorporates “[DALL·E Mini](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy)’’ and “[DALL·E Mega](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mega-Training-Journal--VmlldzoxODMxMDI2)” models (further details on this distinction forthcoming).
22
+
23
+ The DALL·E Mega model is the largest version of DALLE Mini. For more information specific to DALL·E Mega, see the [DALL·E Mega model card](https://huggingface.co/dalle-mini/dalle-mega).
24
+
25
+ ## Model Details
26
+
27
+ * **Developed by:** Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phúc Lê, Luke, Luke Melas, Ritobrata Ghosh
28
+ * **Model type:** Transformer-based text-to-image generation model
29
+ * **Language(s):** English
30
+ * **License:** Apache 2.0
31
+ * **Model Description:** This is a model that can be used to generate images based on text prompts. As the model developers wrote in the [project report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) about DALL·E mini, “OpenAI had the first impressive model for generating images with [DALL·E](https://openai.com/blog/dall-e/). DALL·E mini is an attempt at reproducing those results with an open-source model.”
32
+ * **Resources for more information:** See OpenAI’s website for more information about [DALL·E](https://openai.com/blog/dall-e/), including the [DALL·E model card](https://github.com/openai/DALL-E/blob/master/model_card.md). See the [project report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) for more information from the model’s developers. To learn more about DALL·E Mega, see the DALL·E Mega [training journal](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mega-Training--VmlldzoxODMxMDI2#training-parameters).
33
+ * **Cite as:**
34
+ ```bib text
35
+ @misc{Dayma_DALL·E_Mini_2021,
36
+ author = {Dayma, Boris and Patil, Suraj and Cuenca, Pedro and Saifullah, Khalid and Abraham, Tanishq and Lê Khắc, Phúc and Melas, Luke and Ghosh, Ritobrata},
37
+ doi = {10.5281/zenodo.5146400},
38
+ month = {7},
39
+ title = {DALL·E Mini},
40
+ url = {https://github.com/borisdayma/dalle-mini},
41
+ year = {2021}
42
+ }
43
+ ```
44
+
45
+ ## Uses
46
+
47
+ ### Direct Use
48
+
49
+ The model is intended to be used to generate images based on text prompts for research and personal consumption. Intended uses include supporting creativity, creating humorous content, and providing generations for people curious about the model’s behavior. Intended uses exclude those described in the [Misuse and Out-of-Scope Use](#misuse-malicious-use-and-out-of-scope-use) section.
50
+
51
+ ### Downstream Use
52
+
53
+ The model could also be used for downstream use cases, including:
54
+ * Research efforts, such as probing and better understanding the limitations and biases of generative models to further improve the state of science
55
+ * Development of educational or creative tools
56
+ * Generation of artwork and use in design and artistic processes.
57
+ * Other uses that are newly discovered by users. This currently includes poetry illustration (give a poem as prompt), fan art (putting a character in various other visual universes), visual puns, fairy tale illustrations (give a fantasy situation as prompt), concept mashups (applying a texture to something completely different), style transfers (portraits in the style of), … We hope you will find your own application!
58
+
59
+ Downstream uses exclude the uses described in [Misuse and Out-of-Scope Use](#misuse-malicious-use-and-out-of-scope-use).
60
+
61
+ ### Misuse, Malicious Use, and Out-of-Scope Use
62
+
63
+ The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
64
+
65
+ #### Out-of-Scope Use
66
+
67
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
68
+
69
+ #### Misuse and Malicious Use
70
+
71
+ Using the model to generate content that is cruel to individuals is a misuse of this model. This includes:
72
+ * Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
73
+ * Intentionally promoting or propagating discriminatory content or harmful stereotypes.
74
+ * Impersonating individuals without their consent.
75
+ * Sexual content without consent of the people who might see it.
76
+ * Mis- and disinformation
77
+ * Representations of egregious violence and gore
78
+ * Sharing of copyrighted or licensed material in violation of its terms of use.
79
+ * Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
80
+
81
+
82
+ ## Limitations and Bias
83
+
84
+ ### Limitations
85
+
86
+ The model developers discuss the limitations of the model further in the DALL·E Mini [technical report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained-with-Demo--Vmlldzo4NjIxODA):
87
+ * Faces and people in general are not generated properly.
88
+ * Animals are usually unrealistic.
89
+ * It is hard to predict where the model excels or falls short…Good prompt engineering will lead to the best results.
90
+ * The model has only been trained with English descriptions and will not perform as well in other languages
91
+
92
+
93
+ ### Bias
94
+
95
+ **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
96
+
97
+ The model was trained on unfiltered data from the Internet, limited to pictures with English descriptions. Text and images from communities and cultures using other languages were not utilized. This affects all output of the model, with white and Western culture asserted as a default, and the model’s ability to generate content using non-English prompts is observably lower quality than prompts in English.
98
+
99
+ While the capabilities of image generation models are impressive, they may also reinforce or exacerbate societal biases. The extent and nature of the biases of DALL·E Mini and DALL·E Mega models have yet to be fully documented, but initial testing demonstrates that they may generate images that contain negative stereotypes against minoritized groups. Work to analyze the nature and extent of the models’ biases and limitations is ongoing.
100
+
101
+ Our current analyses demonstrate that:
102
+ * Images generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
103
+ * When the model generates images with people in them, it tends to output people who we perceive to be white, while people of color are underrepresented.
104
+ * Images generated by the model can contain biased content that depicts power differentials between people of color and people who are white, with white people in positions of privilege.
105
+ * The model is generally only usable for generating images based on text in English, limiting accessibility of the model for non-English speakers and potentially contributing to the biases in images generated by the model.
106
+
107
+ The [technical report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained-with-Demo--Vmlldzo4NjIxODA) discusses these issues in more detail, and also highlights potential sources of bias in the model development process.
108
+
109
+
110
+ ### Limitations and Bias Recommendations
111
+
112
+ * Users (both direct and downstream) should be made aware of the biases and limitations.
113
+ * Content that is potentially problematic should be filtered out, e.g., via automated models that detect violence or pornography.
114
+ * Further work on this model should include methods for balanced and just representations of people and cultures, for example, by curating the training dataset to be both diverse and inclusive.
115
+
116
+
117
+ ## Training
118
+
119
+ ### Training Data
120
+
121
+ The model developers used 3 datasets for the model:
122
+ * [Conceptual Captions Dataset](https://aclanthology.org/P18-1238/), which contains 3 million image and caption pairs.
123
+ * [Conceptual 12M](https://arxiv.org/abs/2102.08981), which contains 12 million image and caption pairs.
124
+ * The [OpenAI subset](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) of [YFCC100M](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/), which contains about 15 million images and that we further sub-sampled to 2 million images due to limitations in storage space. They used both title and description as caption and removed html tags, new lines and extra spaces.
125
+
126
+ For fine-tuning the image encoder, a subset of 2 million images were used.
127
+ All images (about 15 million) were used for training the Seq2Seq model.
128
+
129
+ ### Training Procedure
130
+
131
+ As described further in the [technical report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained-with-Demo--Vmlldzo4NjIxODA#our-dall-e-model-architecture) for DALL·E Mini, during training, images and descriptions are both available and pass through the system as follows:
132
+ * Images are encoded through a [VQGAN](https://arxiv.org/abs/2012.09841) encoder, which turns images into a sequence of tokens.
133
+ * Descriptions are encoded through a [BART](https://arxiv.org/abs/1910.13461) encoder.
134
+ * The output of the BART encoder and encoded images are fed through the BART decoder, which is an auto-regressive model whose goal is to predict the next token.
135
+ * Loss is the [softmax cross-entropy](https://wandb.ai/sauravm/Activation-Functions/reports/Activation-Functions-Softmax--VmlldzoxNDU1Njgy#%F0%9F%93%A2-softmax-+-cross-entropy-loss-(caution:-math-alert)) between the model prediction logits and the actual image encodings from the VQGAN.
136
+
137
+ The simplified training procedure for DALL·E Mega is as follows:
138
+
139
+ * **Hardware:** 1 pod TPU v3-256 = 32 nodes of TPU VM v3-8 (8 TPU per node) = 256 TPU v3
140
+ * **Optimizer:** Distributed Shampoo
141
+ * **Model Partition Specificiations:** 8 model parallel x 32 data parallel
142
+ * **Batch:** 44 samples per model x 32 data parallel x 3 gradient accumulation steps = 4224 increasing samples per update
143
+ * **Learning rate:** warmup to 0.0001 for 10,000 steps and then kept constant until plateau
144
+ * Gradient checkpointing used on each Encoder/Decoder layer (ie, MHA + FFN)
145
+ * Distributed Shampoo + Normformer Optimizations have proved to be effective and efficiently scaling this model.
146
+ * It should also be noted that the learning rate and other parameters are sometimes adjusted on the fly, and batch size increased over time as well.
147
+
148
+ There is more information about the full procedure and technical material in the DALL·E Mega [training journal](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mega-Training--VmlldzoxODMxMDI2#training-parameters).
149
+
150
+
151
+ ## Evaluation Results
152
+
153
+ The model developers discuss their results extensively in their [technical report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained-with-Demo--Vmlldzo4NjIxODA#the-results-of-our-dall-e-experiment) for DALL·E Mini, which provides comparisons between DALL·E Mini’s results with [DALL·E-pytorch](https://github.com/lucidrains/DALLE-pytorch), OpenAI’s [DALL·E](https://openai.com/blog/dall-e/), and models consisting of a generator coupled with the [CLIP neural network model](https://openai.com/blog/clip/).
154
+
155
+ For evaluation results related to DALL·E Mega, see this [technical report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy).
156
+
157
+ ## Environmental Impact
158
+
159
+ ### DALL·E Mini Estimated Emissions
160
+
161
+ *The model is 27 times smaller than the original DALL·E and was trained on a single TPU v3-8 for only 3 days.*
162
+
163
+ Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
164
+
165
+ * **Hardware Type:** TPU v3-8
166
+ * **Hours used:** 72 (3 days)
167
+ * **Cloud Provider:** GCP (as mentioned in the technical report)
168
+ * **Compute Region:** us-east1 (provided by model developers)
169
+ * **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 30.16 kg CO2 eq.
170
+
171
+ ### DALL·E Mega Estimated Emissions
172
+
173
+ DALL·E Mega is still training. So far, as on June 9, 2022, the model developers report that DALL·E Mega has been training for about 40-45 days on a TPU v3-256. Using those numbers, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
174
+
175
+ * **Hardware Type:** TPU v3-256
176
+ * **Hours used:** 960 - 1080 hours (40-45 days)
177
+ * **Cloud Provider:** Unknown
178
+ * **Compute Region:** Unknown
179
+ * **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** Unknown
180
+
181
+ ## Citation
182
+
183
+ ```bibtext
184
+ @misc{Dayma_DALL·E_Mini_2021,
185
+ author = {Dayma, Boris and Patil, Suraj and Cuenca, Pedro and Saifullah, Khalid and Abraham, Tanishq and Lê Khắc, Phúc and Melas, Luke and Ghosh, Ritobrata},
186
+ doi = {10.5281/zenodo.5146400},
187
+ month = {7},
188
+ title = {DALL·E Mini},
189
+ url = {https://github.com/borisdayma/dalle-mini},
190
+ year = {2021}
191
+ }
192
+ ```
193
+
194
+ *This model card was written by: Boris Dayma, Margaret Mitchell, Ezi Ozoani, Marissa Gerchick, Irene Solaiman, Clémentine Fourrier, Sasha Luccioni, Emily Witko, Nazneen Rajani, and Julian Herrera.*
config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "gelu",
4
+ "architectures": [
5
+ "eBart"
6
+ ],
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 16385,
9
+ "d_model": 1024,
10
+ "decoder_attention_heads": 16,
11
+ "decoder_ffn_dim": 2730,
12
+ "decoder_layers": 12,
13
+ "decoder_start_token_id": 16384,
14
+ "do_sample": true,
15
+ "dropout": 0.0,
16
+ "encoder_attention_heads": 16,
17
+ "encoder_ffn_dim": 2730,
18
+ "encoder_layers": 12,
19
+ "encoder_vocab_size": 50264,
20
+ "eos_token_id": 16385,
21
+ "force_ln_scale": false,
22
+ "gradient_checkpointing": true,
23
+ "image_length": 256,
24
+ "image_vocab_size": 16384,
25
+ "init_std": 0.02,
26
+ "is_encoder_decoder": true,
27
+ "ln_positions": "normformer",
28
+ "ln_type": "layernorm",
29
+ "max_length": 257,
30
+ "max_text_length": 64,
31
+ "min_length": 257,
32
+ "model_type": "dallebart",
33
+ "normalize_text": true,
34
+ "pad_token_id": 16385,
35
+ "scale_embedding": false,
36
+ "sinkhorn_iters": 1,
37
+ "tau_init": 0.05,
38
+ "tie_word_embeddings": false,
39
+ "transformers_version": "4.19.0.dev0",
40
+ "use_absolute_position_embeddings": true,
41
+ "use_alibi": false,
42
+ "use_bias": false,
43
+ "use_cache": true,
44
+ "use_cosine_attention": false,
45
+ "use_deepnet_scaling": false,
46
+ "use_final_ln_decoder": true,
47
+ "use_final_ln_encoder": true,
48
+ "use_glu": true,
49
+ "use_head_scale": false,
50
+ "use_scan": true,
51
+ "use_swin_position_embeddings": false
52
+ }
enwiki-words-frequency.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26bc86b265dacb1f3974fc5aec5fd2523319a2a5207c6ff5c3c602a0f146574b
3
+ size 34196068
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:df70127fbf804aaec7b9eeb54bc123283405ea74e5047bede4986638a384951a
3
+ size 1751336743
gitattributes.txt ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
28
+ enwiki-words-frequency.txt filter=lfs diff=lfs merge=lfs -text
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"errors": "replace", "bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false, "__type": "AddedToken"}, "add_prefix_space": false, "trim_offsets": true, "model_max_length": 1024, "special_tokens_map_file": null, "name_or_path": "boris/dalle-mini-tokenizer", "use_fast": true, "tokenizer_class": "DalleBartTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff