akhaliq HF staff commited on
Commit
9d5ad98
1 Parent(s): 261b6ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -218
README.md CHANGED
@@ -1,218 +1,12 @@
1
- <div align='center'>
2
- <h1>Emu3: Next-Token Prediction is All You Need</h1h1>
3
- <h3></h3>
4
-
5
- [Emu3 Team, BAAI](https://www.baai.ac.cn/english.html)
6
-
7
- | [Project Page](https://emu.baai.ac.cn) | [Paper](https://baai-solution.ks3-cn-beijing.ksyuncs.com/emu3/Emu3-tech-report.pdf?KSSAccessKeyId=AKLTgew6Kdg6RsK92QSfB2KLA&Expires=2591406552&Signature=6BvwfLVqvfww26Bhwvk3mG0FrL8%3D) | [🤗HF Models](https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f) |
8
-
9
-
10
- </div>
11
-
12
- <div align='center'>
13
- <img src="./assets/arch.png" class="interpolation-image" alt="arch." height="80%" width="70%" />
14
- </div>
15
-
16
- We introduce **Emu3**, a new suite of state-of-the-art multimodal models trained solely with **<i>next-token prediction</i>**! By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.
17
-
18
- ### Emu3 excels in both generation and perception
19
- **Emu3** outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.
20
-
21
- <div align='center'>
22
- <img src="./assets/comparison.png" class="interpolation-image" alt="comparison." height="80%" width="80%" />
23
- </div>
24
-
25
- ### Highlights
26
-
27
- - **Emu3** is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles.
28
- - **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
29
- - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
30
-
31
-
32
- ### TODO
33
-
34
- - [X] Release model weights of tokenizer, Emu3-Chat and Emu3-Gen
35
- - [X] Release the inference code.
36
- - [ ] Release the evaluation code.
37
- - [ ] Release training scripts for pretrain, sft and dpo.
38
-
39
-
40
- ### Setup
41
-
42
- Clone this repository and install required packages:
43
-
44
- ```shell
45
- git clone https://github.com/baaivision/Emu3
46
- cd Emu3
47
-
48
- pip install -r requirements.txt
49
- ```
50
-
51
- ### Model Weights
52
-
53
- | Model name | HF Weight |
54
- | ------------------ | ------------------------------------------------------- |
55
- | **Emu3-Chat** | [🤗 HF link](https://huggingface.co/BAAI/Emu3-Chat) |
56
- | **Emu3-Gen** | [🤗 HF link](https://huggingface.co/BAAI/Emu3-Gen) |
57
- | **Emu3-VisionTokenizer** | [🤗 HF link](https://huggingface.co/BAAI/Emu3-VisionTokenizer) |
58
-
59
- ### Quickstart
60
-
61
- #### Use 🤗Transformers to run Emu3-Gen for image generation
62
- ```python
63
- from PIL import Image
64
- from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
65
- from transformers.generation.configuration_utils import GenerationConfig
66
- from transformers.generation import LogitsProcessorList, PrefixConstrainedLogitsProcessor, UnbatchedClassifierFreeGuidanceLogitsProcessor
67
- import torch
68
-
69
- from emu3.mllm.processing_emu3 import Emu3Processor
70
-
71
-
72
- # model path
73
- EMU_HUB = "BAAI/Emu3-Gen"
74
- VQ_HUB = "BAAI/Emu3-VisionTokenizer"
75
-
76
- # prepare model and processor
77
- model = AutoModelForCausalLM.from_pretrained(
78
- EMU_HUB,
79
- device_map="cuda:0",
80
- torch_dtype=torch.bfloat16,
81
- attn_implementation="flash_attention_2",
82
- trust_remote_code=True,
83
- )
84
-
85
- tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True)
86
- image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)
87
- image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval()
88
- processor = Emu3Processor(image_processor, image_tokenizer, tokenizer)
89
-
90
- # prepare input
91
- POSITIVE_PROMPT = " masterpiece, film grained, best quality."
92
- NEGATIVE_PROMPT = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
93
-
94
- classifier_free_guidance = 3.0
95
- prompt = "a portrait of young girl."
96
- prompt += POSITIVE_PROMPT
97
-
98
- kwargs = dict(
99
- mode='G',
100
- ratio="1:1",
101
- image_area=model.config.image_area,
102
- return_tensors="pt",
103
- )
104
- pos_inputs = processor(text=prompt, **kwargs)
105
- neg_inputs = processor(text=NEGATIVE_PROMPT, **kwargs)
106
-
107
- # prepare hyper parameters
108
- GENERATION_CONFIG = GenerationConfig(
109
- use_cache=True,
110
- eos_token_id=model.config.eos_token_id,
111
- pad_token_id=model.config.pad_token_id,
112
- max_new_tokens=40960,
113
- do_sample=True,
114
- top_k=2048,
115
- )
116
-
117
- h, w = pos_inputs.image_size[0]
118
- constrained_fn = processor.build_prefix_constrained_fn(h, w)
119
- logits_processor = LogitsProcessorList([
120
- UnbatchedClassifierFreeGuidanceLogitsProcessor(
121
- classifier_free_guidance,
122
- model,
123
- unconditional_ids=neg_inputs.input_ids.to("cuda:0"),
124
- ),
125
- PrefixConstrainedLogitsProcessor(
126
- constrained_fn ,
127
- num_beams=1,
128
- ),
129
- ])
130
-
131
- # generate
132
- outputs = model.generate(
133
- pos_inputs.input_ids.to("cuda:0"),
134
- GENERATION_CONFIG,
135
- logits_processor=logits_processor
136
- )
137
-
138
- mm_list = processor.decode(outputs[0])
139
- for idx, im in enumerate(mm_list):
140
- if not isinstance(im, Image.Image):
141
- continue
142
- im.save(f"result_{idx}.png")
143
- ```
144
-
145
- #### Use 🤗Transformers to run Emu3-Chat for vision-language understanding
146
-
147
- ```python
148
- from PIL import Image
149
- from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
150
- from transformers.generation.configuration_utils import GenerationConfig
151
- import torch
152
-
153
- from emu3.mllm.processing_emu3 import Emu3Processor
154
-
155
-
156
- # model path
157
- EMU_HUB = "BAAI/Emu3-Chat"
158
- VQ_HUB = "BAAI/Emu3-VisionTokenier"
159
-
160
- # prepare model and processor
161
- model = AutoModelForCausalLM.from_pretrained(
162
- EMU_HUB,
163
- device_map="cuda:0",
164
- torch_dtype=torch.bfloat16,
165
- attn_implementation="flash_attention_2",
166
- trust_remote_code=True,
167
- )
168
-
169
- tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True)
170
- image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)
171
- image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval()
172
- processor = Emu3Processor(image_processor, image_tokenizer, tokenizer)
173
-
174
- # prepare input
175
- text = "Please describe the image"
176
- image = Image.open("assets/demo.png")
177
-
178
- inputs = processor(
179
- text=text,
180
- image=image,
181
- mode='U',
182
- padding_side="left",
183
- padding="longest",
184
- return_tensors="pt",
185
- )
186
-
187
- # prepare hyper parameters
188
- GENERATION_CONFIG = GenerationConfig(pad_token_id=tokenizer.pad_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id)
189
-
190
- # generate
191
- outputs = model.generate(
192
- inputs.input_ids.to("cuda:0"),
193
- GENERATION_CONFIG,
194
- max_new_tokens=320,
195
- )
196
-
197
- outputs = outputs[:, inputs.input_ids.shape[-1]:]
198
- print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
199
- ```
200
-
201
- ## Acknowledgement
202
-
203
- We thank the great work from [Emu Series](https://github.com/baaivision/Emu), [QWen2-VL](https://github.com/QwenLM/Qwen2-VL) and [MoVQGAN](https://github.com/ai-forever/MoVQGAN)
204
-
205
- <!--
206
- ## Citation
207
-
208
- If you find Emu3 useful for your research and applications, please consider starring this repository and citing:
209
-
210
- ```
211
- @article{Emu2,
212
- title={Generative Multimodal Models are In-Context Learners},
213
- author={Quan Sun and Yufeng Cui and Xiaosong Zhang and Fan Zhang and Qiying Yu and Zhengxiong Luo and Yueze Wang and Yongming Rao and Jingjing Liu and Tiejun Huang and Xinlong Wang},
214
- publisher={arXiv preprint arXiv:2312.13286},
215
- year={2023},
216
- }
217
- ```
218
- -->
 
1
+ ---
2
+ title: Emu3
3
+ emoji: 🌖
4
+ colorFrom: gray
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 5.0.0b1
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference