Spaces:

tttoaster
/

SEED-X-17B

Build error

App Files Files Community

yuyingge commited on May 4

Commit

590af54

•

1 Parent(s): 1264376

Add application file

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

License.txt +335 -0
configs/.DS_Store +0 -0
configs/clm_models/.DS_Store +0 -0
configs/clm_models/agent_seed_x_i.yaml +23 -0
configs/clm_models/llm_seed_x_i.yaml +3 -0
configs/discrete_model/.DS_Store +0 -0
configs/discrete_model/discrete_identity.yaml +1 -0
configs/processer/.DS_Store +0 -0
configs/processer/qwen_448_transform.yaml +4 -0
configs/sdxl_adapter/.DS_Store +0 -0
configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_full_with_latent_image_pretrain_no_normalize.yaml +20 -0
configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_pretrain_no_normalize.yaml +18 -0
configs/tokenizer/.DS_Store +0 -0
configs/tokenizer/clm_llama_tokenizer_224loc_anyres.yaml +2 -0
configs/visual_encoder/.DS_Store +0 -0
configs/visual_encoder/qwen_vitg_448.yaml +11 -0
pretrained/QwenViT/qwen_vit_G.pt +3 -0
requirements.txt +11 -0
seed_x/arrow.jpg +0 -0
seed_x/bank.png +0 -0
src/.DS_Store +0 -0
src/demo/__pycache__/conversation.cpython-311.pyc +0 -0
src/demo/__pycache__/conversation.cpython-38.pyc +0 -0
src/demo/__pycache__/utils.cpython-311.pyc +0 -0
src/demo/__pycache__/utils.cpython-38.pyc +0 -0
src/demo/configs/agent_13b_anyres_out_64_pretrain_merged.yaml +29 -0
src/demo/configs/agent_13b_in100_out64_rs5_merged_pretrain.yaml +22 -0
src/demo/configs/llama2chat13b_merged_100imgtokens.yaml +12 -0
src/demo/conversation.py +182 -0
src/demo/seed_llama_flask.py +379 -0
src/demo/seed_llama_gradio.py +465 -0
src/demo/utils.py +83 -0
src/inference/.DS_Store +0 -0
src/inference/__pycache__/any_res.cpython-311.pyc +0 -0
src/inference/__pycache__/any_res.cpython-38.pyc +0 -0
src/inference/any_res.py +257 -0
src/inference/eval_img2edit_seed_x.py +155 -0
src/inference/eval_img2text_seed_x.py +235 -0
src/inference/eval_text2img_seed_x.py +94 -0
src/models/detokenizer/__init__.py +1 -0
src/models/detokenizer/__pycache__/__init__.cpython-311.pyc +0 -0
src/models/detokenizer/__pycache__/__init__.cpython-38.pyc +0 -0
src/models/detokenizer/__pycache__/adapter_modules.cpython-311.pyc +0 -0
src/models/detokenizer/__pycache__/adapter_modules.cpython-38.pyc +0 -0
src/models/detokenizer/__pycache__/attention_processor.cpython-38.pyc +0 -0
src/models/detokenizer/__pycache__/ipa_utils.cpython-38.pyc +0 -0
src/models/detokenizer/__pycache__/pipeline_stable_diffusion_t2i_edit.cpython-38.pyc +0 -0
src/models/detokenizer/__pycache__/pipeline_stable_diffusion_xl_t2i_edit.cpython-311.pyc +0 -0
src/models/detokenizer/__pycache__/pipeline_stable_diffusion_xl_t2i_edit.cpython-38.pyc +0 -0
src/models/detokenizer/__pycache__/resampler.cpython-311.pyc +0 -0

License.txt ADDED Viewed

	@@ -0,0 +1,335 @@

+Tencent is pleased to support the open source community by making Seed-X available.
+Copyright (C) 2024 THL A29 Limited, a Tencent company.  All rights reserved.
+Seed-X is licensed under the Apache License Version 2.0 except for the third-party components listed below.
+Terms of the Apache License Version 2.0:
+--------------------------------------------------------------------
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+1. Definitions.
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+You must give any other recipients of the Work or Derivative Works a copy of this License; and
+You must cause any modified files to carry prominent notices stating that You changed the files; and
+You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
+If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+END OF TERMS AND CONDITIONS
+Other dependencies and licenses:
+Open Source Software Licensed under the Apache License Version 2.0:
+--------------------------------------------------------------------
+1. transformers
+Copyright 2018- The Hugging Face team. All rights reserved.
+Source code of this software can be obtained from: https://github.com/huggingface/transformers/blob/v4.30.2/
+2. diffusers
+Copyright 2023 The HuggingFace Team. All rights reserved.
+Source code of this software can be obtained from: https://github.com/huggingface/diffusers/blob/v0.25.0/
+A copy of Apache 2.0 has been included in this file.
+Open Source Software Licensed under the BSD 3-Clause License:
+--------------------------------------------------------------------
+1. torchvision
+Copyright (c) Soumith Chintala 2016,
+All rights reserved.
+Terms of the BSD 3-Clause License:
+--------------------------------------------------------------------
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+Open Source Software Licensed under the BSD 3-Clause License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. numpy
+Copyright (c) 2005-2021, NumPy Developers.
+All rights reserved.
+A copy of the BSD 3-Clause License is included in this file.
+For the license of other third party components, please refer to the following URL:
+https://github.com/numpy/numpy/blob/v1.20.1/LICENSES_bundled.txt
+Open Source Software Licensed under the BSD 3-Clause License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. torch
+Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
+Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
+Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
+Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
+Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+Copyright (c) 2011-2013 NYU                      (Clement Farabet)
+Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
+Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+A copy of the BSD 3-Clause License is included in this file.
+For the license of other third party components, please refer to the following URL:
+https://github.com/pytorch/pytorch/blob/v2.0.1/NOTICE
+Open Source Software Licensed under the LLAMA 2 Community License:
+--------------------------------------------------------------------
+1. Llama 2
+Copyright (c) Meta Platforms, Inc. All Rights Reserved.
+Terms of the LLAMA 2 COMMUNITY LICENSE AGREEMENT:
+--------------------------------------------------------------------
+LLAMA 2 COMMUNITY LICENSE AGREEMENT
+Llama 2 Version Release Date: July 18, 2023
+"Agreement" means the terms and conditions for use, reproduction, distribution and
+modification of the Llama Materials set forth herein.
+"Documentation" means the specifications, manuals and documentation
+accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
+libraries/llama-downloads/.
+"Licensee" or "you" means you, or your employer or any other person or entity (if
+you are entering into this Agreement on such person or entity's behalf), of the age
+required under applicable laws, rules or regulations to provide legal consent and that
+has legal authority to bind your employer or such other person or entity if you are
+entering in this Agreement on their behalf.
+"Llama 2" means the foundational large language models and software and
+algorithms, including machine-learning model code, trained model weights,
+inference-enabling code, training-enabling code, fine-tuning enabling code and other
+elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
+libraries/llama-downloads/.
+"Llama Materials" means, collectively, Meta's proprietary Llama 2 and
+Documentation (and any portion thereof) made available under this Agreement.
+"Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you
+are an entity, your principal place of business is in the EEA or Switzerland) and Meta
+Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+By clicking "I Accept" below or by using or distributing any portion or element of the
+Llama Materials, you agree to be bound by this Agreement.
+1. License Rights and Redistribution.
+      a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
+transferable and royalty-free limited license under Meta's intellectual property or
+other rights owned by Meta embodied in the Llama Materials to use, reproduce,
+distribute, copy, create derivative works of, and make modifications to the Llama
+Materials.
+      b. Redistribution and Use.
+            i. If you distribute or make the Llama Materials, or any derivative works
+thereof, available to a third party, you shall provide a copy of this Agreement to such
+third party.
+            ii.  If you receive Llama Materials, or any derivative works thereof, from
+a Licensee as part of an integrated end user product, then Section 2 of this
+Agreement will not apply to you.
+            iii. You must retain in all copies of the Llama Materials that you
+distribute the following attribution notice within a "Notice" text file distributed as a
+part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License,
+Copyright (c) Meta Platforms, Inc. All Rights Reserved."
+            iv. Your use of the Llama Materials must comply with applicable laws
+and regulations (including trade compliance laws and regulations) and adhere to the
+Acceptable Use Policy for the Llama Materials (available at
+https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into
+this Agreement.
+            v. You will not use the Llama Materials or any output or results of the
+Llama Materials to improve any other large language model (excluding Llama 2 or
+derivative works thereof).
+2. Additional Commercial Terms. If, on the Llama 2 version release date, the
+monthly active users of the products or services made available by or for Licensee,
+or Licensee's affiliates, is greater than 700 million monthly active users in the
+preceding calendar month, you must request a license from Meta, which Meta may
+grant to you in its sole discretion, and you are not authorized to exercise any of the
+rights under this Agreement unless or until Meta otherwise expressly grants you
+such rights.
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
+LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
+PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
+EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
+WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
+FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
+FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
+THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
+USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE
+LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
+NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
+AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
+CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
+IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
+ANY OF THE FOREGOING.
+5. Intellectual Property.
+      a. No trademark licenses are granted under this Agreement, and in
+connection with the Llama Materials, neither Meta nor Licensee may use any name
+or mark owned by or associated with the other or any of its affiliates, except as
+required for reasonable and customary use in describing and redistributing the
+Llama Materials.
+      b. Subject to Meta's ownership of Llama Materials and derivatives made by or
+for Meta, with respect to any derivative works and modifications of the Llama
+Materials that are made by you, as between you and Meta, you are and will be the
+owner of such derivative works and modifications.
+      c. If you institute litigation or other proceedings against Meta or any entity
+(including a cross-claim or counterclaim in a lawsuit) alleging that the Llama
+Materials or Llama 2 outputs or results, or any portion of any of the foregoing,
+constitutes infringement of intellectual property or other rights owned or licensable
+by you, then any licenses granted to you under this Agreement shall terminate as of
+the date such litigation or claim is filed or instituted. You will indemnify and hold
+harmless Meta from and against any claim by any third party arising out of or related
+to your use or distribution of the Llama Materials.
+6. Term and Termination. The term of this Agreement will commence upon your
+acceptance of this Agreement or access to the Llama Materials and will continue in
+full force and effect until terminated in accordance with the terms and conditions
+herein. Meta may terminate this Agreement if you are in breach of any term or
+condition of this Agreement. Upon termination of this Agreement, you shall delete
+and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the
+termination of this Agreement.
+7. Governing Law and Jurisdiction. This Agreement will be governed and
+construed under the laws of the State of California without regard to choice of law
+principles, and the UN Convention on Contracts for the International Sale of Goods
+does not apply to this Agreement. The courts of California shall have exclusive
+jurisdiction of any dispute arising out of this Agreement.
+Open Source Software Licensed under the Tongyi Qianwen LICENSE AGREEMENT:
+--------------------------------------------------------------------
+1. Qwen-VL
+Copyright (c) Alibaba Cloud. All Rights Reserved.
+Terms of the Tongyi Qianwen LICENSE AGREEMENT:
+--------------------------------------------------------------------
+Tongyi Qianwen LICENSE AGREEMENT
+Tongyi Qianwen Release Date: August 23, 2023
+By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
+1. Definitions
+    a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
+    b. "We"(or "Us") shall mean Alibaba Cloud.
+    c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
+    d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
+    e. "Tongyi Qianwen" shall mean the large language models (including Qwen-VL model and Qwen-VL-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
+    f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
+    g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
+    h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+2. Grant of Rights
+You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
+3. Redistribution
+You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+    a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
+    b. You shall cause any modified files to carry prominent notices stating that You changed the files;
+    c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
+    d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
+4. Restrictions
+If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
+5. Rules of use
+    a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
+    b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
+6. Intellectual Property
+    a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
+    b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
+    c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
+7. Disclaimer of Warranty and Limitation of Liability
+    a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
+    b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
+    c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
+    d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
+8. Survival and Termination.
+    a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
+    b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
+9. Governing Law and Jurisdiction.
+    a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
+    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.

configs/.DS_Store ADDED Viewed

Binary file (8.2 kB). View file

configs/clm_models/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/clm_models/agent_seed_x_i.yaml ADDED Viewed

	@@ -0,0 +1,23 @@

+_target_: src.models.mllm.seed_x.ContinuousLVLM.from_pretrained
+input_resampler:
+  _target_: src.models.tokenizer.qwen_visual.Resampler
+  grid_size: 8
+  embed_dim: 5120
+  num_heads: 32
+  kv_dim: 4096
+output_resampler:
+  _target_: src.models.tokenizer.qwen_visual.Resampler
+  grid_size: 8
+  embed_dim: 4096
+  num_heads: 32
+  kv_dim: 5120
+add_patch_pos: True
+vit_down: True
+mse: True
+lm_loss_scale: 1.0
+rec_loss_scale: 6.0
+pretrained_model_path: https://huggingface.co/AILab-CVC/SEED-X-17B/blob/main/seed_x_i/agent/pytorch_model.bin

configs/clm_models/llm_seed_x_i.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+_target_: src.models.mllm.modeling_llama_xformer.LlamaForCausalLM.from_pretrained
+pretrained_model_name_or_path: https://huggingface.co/AILab-CVC/SEED-X-17B/tree/main/seed_x_i/llm
+low_cpu_mem_usage: True

configs/discrete_model/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/discrete_model/discrete_identity.yaml ADDED Viewed

	@@ -0,0 +1 @@


1	+ _target_: src.models.tokenizer.discrete_models.DiscreteModleIdentity

configs/processer/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/processer/qwen_448_transform.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+_target_: src.processer.transforms.get_transform
+type: clip
+image_size: 448
+keep_ratio: False

configs/sdxl_adapter/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_full_with_latent_image_pretrain_no_normalize.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+_target_: src.models.detokenizer.adapter_modules.SDXLAdapterWithLatentImage.from_pretrained
+resampler:
+  _target_: src.models.detokenizer.resampler.ResamplerXLV2
+  dim: 1024
+  depth: 4
+  dim_head: 64
+  heads: 16
+  num_queries: 64
+  embedding_dim: 4096
+  output1_dim: 768
+  output2_dim: 1280
+  ff_mult: 4
+  normalize: False
+full_ft: True
+set_trainable_late: False
+vit_down: True
+pretrained_model_path: pretrained/seed_detokenizer/second_stage/pytorch_model.bin

configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_pretrain_no_normalize.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+_target_: src.models.detokenizer.adapter_modules.SDXLAdapter.from_pretrained
+resampler:
+  _target_: src.models.detokenizer.resampler.ResamplerXLV2
+  dim: 1024
+  depth: 4
+  dim_head: 64
+  heads: 16
+  num_queries: 64
+  embedding_dim: 4096
+  output1_dim: 768
+  output2_dim: 1280
+  ff_mult: 4
+  normalize: False
+vit_down: True
+pretrained_model_path: https://huggingface.co/AILab-CVC/SEED-X-17B/blob/main/seed_detokenizer/first_stage/pytorch_model.bin

configs/tokenizer/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/tokenizer/clm_llama_tokenizer_224loc_anyres.yaml ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ _target_: transformers.LlamaTokenizer.from_pretrained
2	+ pretrained_model_name_or_path: https://huggingface.co/AILab-CVC/SEED-X-17B/tree/main/cvlm_llama2_tokenizer_100img_and_224loc_addpatch

configs/visual_encoder/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/visual_encoder/qwen_vitg_448.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+_target_: src.models.tokenizer.qwen_visual.VisionTransformerWithAttnPool.from_pretrained
+heads: 16
+image_size: 448
+image_start_id": 151857
+layers: 48
+mlp_ratio: 4.9231
+output_dim: 4096
+patch_size: 14
+width: 1664
+pretrained_model_path: pretrained/QwenViT/qwen_vit_G.pt

pretrained/QwenViT/qwen_vit_G.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d951083fc79b07bdb84be61944eb263b8e14572fe2dc4fa80b0447f83064463c
+size 3871440281

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+torch==2.0.1
+hydra-core
+transformers==4.30.2
+diffusers==0.25.0
+sentencepiece
+opencv-python
+deepspeed
+pyrootutils
+xformers>=0.0.20
+accelerate
+transformers_stream_generator

seed_x/arrow.jpg ADDED Viewed

seed_x/bank.png ADDED Viewed

src/.DS_Store ADDED Viewed

Binary file (10.2 kB). View file

src/demo/__pycache__/conversation.cpython-311.pyc ADDED Viewed

Binary file (8.21 kB). View file

src/demo/__pycache__/conversation.cpython-38.pyc ADDED Viewed

Binary file (4.43 kB). View file

src/demo/__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (4.39 kB). View file

src/demo/__pycache__/utils.cpython-38.pyc ADDED Viewed

Binary file (2.32 kB). View file

src/demo/configs/agent_13b_anyres_out_64_pretrain_merged.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+_target_: src.models_clm.models.ContinuousLVLM.from_pretrained
+input_resampler:
+  _target_: src.models.qwen_visual.Resampler
+  grid_size: 8
+  embed_dim: 5120
+  num_heads: 32
+  kv_dim: 4096
+output_resampler:
+  _target_: src.models.qwen_visual.Resampler
+  grid_size: 8
+  embed_dim: 4096
+  num_heads: 32
+  kv_dim: 5120
+add_patch_pos: True
+vit_down: True
+mse: True
+lm_loss_scale: 1.0
+rec_loss_scale: 6.0
+#pretrained_model_path: /chat_sh/share_300719895/user/jinguozhu/codes/work_dirs/sft_exp_new_acc4/checkpoint-2000/pytorch_model.bin
+#pretrained_model_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/03_27_any_res_sft_from_merged_10k/checkpoint-9000/pytorch_model.bin
+#pretrained_model_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/03_27_any_res_sft_from_merged_10k/checkpoint-8000-merged/agent/pytorch_model.bin
+#pretrained_model_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/04_09_any_res_sft_editing_from_merged_10k_32a100_new_data/checkpoint-6000-merged/agent/pytorch_model.bin
+#pretrained_model_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/04_16_any_res_sft_editing_from_merged_H800_23k_16_gpu_2_new/checkpoint-6000-merged/agent/pytorch_model.bin
+#pretrained_model_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/04_16_any_res_sft_com_gen_from_merged_H800_23k/checkpoint-15000-merged/agent/pytorch_model.bin
+pretrained_model_path: /group/40034/yuyingge/SEED_X_inference/pretrained/seed_x_i/agent/pytorch_model.bin

src/demo/configs/agent_13b_in100_out64_rs5_merged_pretrain.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+_target_: src.models_clm.models.ContinuousLVLM.from_pretrained
+input_resampler:
+  _target_: src.models.qwen_visual.Resampler
+  grid_size: 10
+  embed_dim: 5120
+  num_heads: 32
+  kv_dim: 4096
+output_resampler:
+  _target_: src.models.qwen_visual.Resampler
+  grid_size: 16
+  embed_dim: 4096
+  num_heads: 32
+  kv_dim: 5120
+lm_loss_scale: 1.0
+rec_loss_scale: 5.0
+# pretrained_model_path: /chat_sh/share_300719895/user/sijiezhao/Program/2023/DiscreteLearning/train_output_clm_sh/1208_llama2chat13b_lora_clm_qwen-vit-448_pretrain_rs5_64a100pro/checkpoint-27000/pytorch_model.bin
+# pretrained_model_path: /apdcephfs_cq4/share_2942043/Multimodal/sijiezhao/DiscreteLearning/train_output/sft_from_1208_llama2chat13b_lora_clm_qwen-vit-448_pretrain_rs5_64a100pro_40k/ckpt-10000-merged/agent/pytorch_model.bin
+# pretrained_model_path: /apdcephfs_cq4/share_2942043/Multimodal/sijiezhao/DiscreteLearning/train_output/sft_from_1208_llama2chat13b_lora_clm_qwen-vit-448_pretrain_rs5_64a100pro_40k/ckpt-5000-merged/agent/pytorch_model.bin
+pretrained_model_path: /apdcephfs_cq4/share_2942043/Multimodal/sijiezhao/DiscreteLearning/train_output/sft_from_1211_llama2chat13b_lora_clm_qwen-vit-448_pretrain_rs5_64a100pro_grounding_27k/ckpt-4000-merged/agent/pytorch_model.bin

src/demo/configs/llama2chat13b_merged_100imgtokens.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+_target_: src.models_clm.modeling_llama_xformer.LlamaForCausalLM.from_pretrained
+# _target_: transformers.LlamaForCausalLM.from_pretrained
+# pretrained_model_name_or_path: /apdcephfs_cq4/share_2942043/Multimodal/sijiezhao/DiscreteLearning/train_output/sft_from_1208_llama2chat13b_lora_clm_qwen-vit-448_pretrain_rs5_64a100pro_40k/ckpt-10000-merged/llm
+# pretrained_model_name_or_path: /apdcephfs_cq4/share_2942043/Multimodal/sijiezhao/DiscreteLearning/train_output/sft_from_1208_llama2chat13b_lora_clm_qwen-vit-448_pretrain_rs5_64a100pro_40k/ckpt-5000-merged/llm
+#pretrained_model_name_or_path: /chat_sh/share_300719895/user/jinguozhu/codes/work_dirs/pretraining_anyres_newexp_v2/checkpoint-10000-merged/llm
+#pretrained_model_name_or_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/03_27_any_res_sft_from_merged_10k/checkpoint-8000-merged/llm
+#pretrained_model_name_or_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/04_09_any_res_sft_editing_from_merged_10k_32a100_new_data/checkpoint-6000-merged/llm
+#pretrained_model_name_or_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/04_16_any_res_sft_editing_from_merged_H800_23k_16_gpu_2_new/checkpoint-6000-merged/llm
+#pretrained_model_name_or_path: /chat_sh/share_300719895/user/yuyingge/jinguo_code/DiscreteLearning_debug/train_output/04_16_any_res_sft_com_gen_from_merged_H800_23k/checkpoint-15000-merged/llm
+pretrained_model_name_or_path: /group/40034/yuyingge/SEED_X_inference/pretrained/seed_x_i/llm
+low_cpu_mem_usage: True

src/demo/conversation.py ADDED Viewed

	@@ -0,0 +1,182 @@

+import dataclasses
+from enum import auto, Enum
+from typing import List, Tuple
+import io
+import base64
+import os
+from PIL import Image
+import copy
+IMG_FLAG = '<image>'
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+    TWO = auto()
+    MPT = auto()
+    PLAIN = auto()
+    LLAMA_2 = auto()
+def decode_image(encoded_image: str) -> Image:
+    decoded_bytes = base64.b64decode(encoded_image.encode('utf-8'))
+    buffer = io.BytesIO(decoded_bytes)
+    image = Image.open(buffer)
+    return image
+def encode_image(image: Image.Image, format: str = 'PNG') -> str:
+    with io.BytesIO() as buffer:
+        image.save(buffer, format=format)
+        encoded_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
+        return encoded_image
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+    system: str
+    roles: List[str]
+    messages: List[dict]  # multi-turn -> user & assistant -> {'images': [PIL.Image,], 'text': str}
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+    skip_next: bool = False
+    def get_prompt(self):
+        messages = copy.deepcopy(self.messages)
+        if self.sep_style == SeparatorStyle.SINGLE:
+            if self.system is None or self.system == '':
+                text = ''
+            else:
+                text = self.system + self.sep
+            images = []
+            for message in messages:
+                text += message['role'] + ": " + message['message']['text'] + self.sep
+                for image_path in message['message']['images']:
+                    image = Image.open(image_path).resize((256, 256))
+                    image_base64 = encode_image(image)
+                    images.append(image_base64)
+            text += self.roles[1] + ":"
+        elif self.sep_style == SeparatorStyle.LLAMA_2:
+            b_token = "[INST] "
+            e_token = " [/INST]"
+            if self.system is None or self.system == '':
+                text = ''
+            else:
+                text = f"<<SYS>>\n{self.system}\n<</SYS>>\n\n"
+            images = []
+            for idx, message in enumerate(messages):
+                # text += message['role'] + ": " + message['message']['text'] + self.sep
+                if idx % 2 == 0:
+                    text += b_token + message['message']['text'] + e_token + self.sep
+                else:
+                    text += message['message']['text'] + self.sep
+                for image_path in message['message']['images']:
+                    image = Image.open(image_path)
+                    image_base64 = encode_image(image)
+                    images.append(image_base64)
+        else:
+            raise NotImplementedError
+        return {'text': text, 'images': images}
+    # def update_image_ids(self, images_ids):
+    #     image_count = 0
+    #     for message in self.messages:
+    #         for idx in range(len(message['message']['images_ids'])):
+    #             if message['message']["images_ids"][idx] is None:
+    #                 message['message']["images_ids"][idx] = images_ids[image_count]
+    #             image_count += 1
+    #     assert len(images_ids) == image_count, print(len(images_ids), image_count)
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+    def to_gradio_chatbot(self):
+        dialog = []
+        for i, single_turn in enumerate(self.messages[self.offset:]):
+            single_turn = single_turn['message']
+            text_list = single_turn['text'].split(IMG_FLAG)
+            assert len(text_list) == len(single_turn['images']) + 1, print(text_list, len(single_turn['images']))
+            message = ''
+            for image_idx in range(len(single_turn['images'])):
+                # image = single_turn['images'][image_idx]
+                # image_base64 = encode_image(image)
+                # image_str = f'<img src="data:image/png;base64,{image_base64}" alt="user upload image" />'
+                image_path = single_turn['images'][image_idx]
+                if image_path == '':
+                    message += text_list[image_idx] + '<corrupt_image>'
+                else:
+                    message += text_list[image_idx] + f'![](file={image_path})'
+            message += text_list[-1]
+            if i % 2 == 0:
+                dialog.append([message, None])
+            else:
+                dialog[-1][-1] = message
+        return dialog
+    def copy(self):
+        return Conversation(system=self.system,
+                            roles=self.roles,
+                            messages=copy.deepcopy(self.messages),
+                            offset=self.offset,
+                            sep_style=self.sep_style,
+                            sep=self.sep,
+                            sep2=self.sep2,
+                            version=self.version)
+    def dict(self):
+        messages = copy.deepcopy(self.messages)
+        for message in messages:
+            for i in range(len(message['message']['images'])):
+                message['message']['images'][i] = os.path.basename(message['message']['images'][i])
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+conv_seed_vicuna = Conversation(
+    system="",
+    roles=("USER", "ASSISTANT"),
+    version="v2",
+    messages=[],
+    offset=0,
+    sep_style=SeparatorStyle.SINGLE,
+    sep='\n',
+)
+conv_seed_vicuna_system = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. ",
+    roles=("USER", "ASSISTANT"),
+    version="v2",
+    messages=[],
+    offset=0,
+    sep_style=SeparatorStyle.SINGLE,
+    sep='\n',
+)
+conv_seed_llama2 = Conversation(
+    system="",
+    roles=("[INST]", "[/INST]"),
+    version="v2",
+    messages=[],
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA_2,
+    sep='\n',
+)

src/demo/seed_llama_flask.py ADDED Viewed

	@@ -0,0 +1,379 @@

+import hydra
+import pyrootutils
+import torch
+import re
+import time
+from omegaconf import OmegaConf
+from flask import Flask, request
+from typing import Optional
+import transformers
+from dataclasses import dataclass, field
+import io
+import base64
+from PIL import Image
+import numpy as np
+import cv2
+from diffusers import AutoencoderKL, UNet2DConditionModel, EulerDiscreteScheduler
+pyrootutils.setup_root(__file__, indicator=".project-root", pythonpath=True)
+from src.data.any_res import process_anyres_image
+BOI_TOKEN = '<img>'
+BOP_TOKEN = '<patch>'
+EOI_TOKEN = '</img>'
+EOP_TOKEN = '</patch>'
+IMG_TOKEN = '<img_{:05d}>'
+IMG_FLAG = '<image>'
+num_img_in_tokens = 64
+num_img_out_tokens = 64
+resolution_grids = ['1x1', '1x2', '1x3', '1x4', '1x5', '1x6', '1x10', '2x1', '3x1', '4x1', '5x1', '6x1', '10x1', '2x2', '2x3', '3x2', '2x4', '4x2']
+base_resolution = 448
+app = Flask(__name__)
+def decode_image(encoded_image: str) -> Image:
+    decoded_bytes = base64.b64decode(encoded_image.encode('utf-8'))
+    buffer = io.BytesIO(decoded_bytes)
+    image = Image.open(buffer)
+    return image
+def encode_image(image: Image.Image, format: str = 'PNG') -> str:
+    with io.BytesIO() as buffer:
+        image.save(buffer, format=format)
+        encoded_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
+        return encoded_image
+@dataclass
+class Arguments:
+    image_transform: Optional[str] = field(default=None, metadata={"help": "config path of image transform"})
+    tokenizer: Optional[str] = field(default=None, metadata={"help": "config path of tokenizer used to initialize tokenizer"})
+    llm: Optional[str] = field(default=None, metadata={"help": "config path of llm"})
+    visual_encoder: Optional[str] = field(default=None, metadata={"help": "config path of visual encoder"})
+    sd_adapter: Optional[str] = field(default=None, metadata={"help": "config path of sd adapter"})
+    agent: Optional[str] = field(default=None, metadata={"help": "config path of agent model"})
+    diffusion_path: Optional[str] = field(default=None, metadata={"help": "diffusion model path"})
+    has_bbox: Optional[bool] = field(default=False, metadata={"help": "visualize the box"})
+    port: Optional[str] = field(default=80, metadata={"help": "network port"})
+    llm_device: Optional[str] = field(default='cuda:0', metadata={"help": "llm device"})
+    vit_sd_device: Optional[str] = field(default='cuda:0', metadata={"help": "sd and vit device"})
+    dtype: Optional[str] = field(default='fp16', metadata={"help": "mix percision"})
+    multi_resolution: Optional[bool] = field(default=False, metadata={"help": "multi resolution"})
+parser = transformers.HfArgumentParser(Arguments)
+args, = parser.parse_args_into_dataclasses()
+def extract_box(output_str):
+    boxes = re.findall('(.*?)<box_end>', output_str)
+    if len(boxes) >0:
+        bboxes = [[int(num) for num in re.findall('<loc-(\d+)>', box)] for box in boxes]
+    else:
+        bboxes = None
+    return bboxes
+def visualize_bbox(image, bboxes):
+    img_width, img_height = image.size
+    image = np.array(image)
+    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+    for bbox in bboxes:
+        x_center, y_center, box_width, box_height = bbox
+        x_center = x_center / 224 * img_width
+        y_center = y_center  / 224 * img_height
+        box_width = box_width /224 * img_width
+        box_height = box_height / 224 * img_height
+        x1 = int(x_center - box_width / 2)
+        y1 = int(y_center - box_height / 2)
+        x2 = int(x_center + box_width / 2)
+        y2 = int(y_center + box_height / 2)
+        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 4)
+    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+    image = Image.fromarray(image)
+    return image
+class LLMService:
+    def __init__(self, args) -> None:
+        self.llm_device = args.llm_device
+        self.vit_sd_device = args.vit_sd_device
+        dtype = args.dtype
+        if dtype == 'fp16':
+            self.dtype = torch.float16
+        elif dtype == 'bf16':
+            self.dtype = torch.bfloat16
+        else:
+            raise ValueError
+        image_transform_cfg = OmegaConf.load(args.image_transform)
+        self.image_transform = hydra.utils.instantiate(image_transform_cfg)
+        tokenizer_cfg = OmegaConf.load(args.tokenizer)
+        self.tokenizer = hydra.utils.instantiate(tokenizer_cfg)
+        visual_encoder_cfg = OmegaConf.load(args.visual_encoder)
+        self.visual_encoder = hydra.utils.instantiate(visual_encoder_cfg)
+        self.visual_encoder.eval().to(self.vit_sd_device, dtype=self.dtype)
+        print('Init visual encoder done')
+        llm_cfg = OmegaConf.load(args.llm)
+        llm = hydra.utils.instantiate(llm_cfg, torch_dtype=self.dtype)
+        print('Init llm done.')
+        agent_cfg = OmegaConf.load(args.agent)
+        self.agent = hydra.utils.instantiate(agent_cfg, llm=llm)
+        self.agent.eval().to(self.llm_device, dtype=self.dtype)
+        print('Init agent mdoel Done')
+        noise_scheduler = EulerDiscreteScheduler.from_pretrained(args.diffusion_path, subfolder="scheduler")
+        vae = AutoencoderKL.from_pretrained(args.diffusion_path, subfolder="vae").to(self.vit_sd_device, dtype=self.dtype)
+        unet = UNet2DConditionModel.from_pretrained(args.diffusion_path, subfolder="unet").to(dtype=self.dtype)
+        sd_adapter_cfg = OmegaConf.load(args.sd_adapter)
+        self.sd_adapter = hydra.utils.instantiate(sd_adapter_cfg, unet=unet).eval().to(dtype=self.dtype)
+        self.sd_adapter.init_pipe(vae=vae,
+                                  scheduler=noise_scheduler,
+                                  visual_encoder=self.visual_encoder.to("cpu"),
+                                  image_transform=self.image_transform,
+                                  discrete_model=None,
+                                  dtype=self.dtype,
+                                  device="cpu")
+        print('Init sd adapter pipe done.')
+        self.visual_encoder.to(self.vit_sd_device, dtype=self.dtype)
+        self.boi_token_id = self.tokenizer.encode(BOI_TOKEN, add_special_tokens=False)[0]
+        self.eoi_token_id = self.tokenizer.encode(EOI_TOKEN, add_special_tokens=False)[0]
+        self.bop_token_id = self.tokenizer.encode(BOP_TOKEN, add_special_tokens=False)[0]
+        self.eop_token_id = self.tokenizer.encode(EOP_TOKEN, add_special_tokens=False)[0]
+        self.multi_resolution = args.multi_resolution
+        if self.multi_resolution:
+            self.base_resolution = base_resolution
+            grid_pinpoints = []
+            for scale in resolution_grids:
+                s1, s2 = scale.split('x')
+                grid_pinpoints.append([int(s1)*base_resolution, int(s2)*base_resolution])
+            self.grid_pinpoints = grid_pinpoints
+service = LLMService(args)
+@app.route('/generate', methods=['GET', 'POST'])
+def generate():
+  with torch.no_grad():
+    request_info = request.get_json()
+    text_list = request_info['text'].split(IMG_FLAG)
+    image_list = request_info['images']
+    max_new_tokens = request_info.get('max_new_tokens', 256)
+    top_p = 0.5
+    force_boi = request_info.get('force_boi', False)
+    force_bbox = request_info.get('force_bbox', False)
+    assert len(text_list) == len(image_list) + 1
+    image_tokens = BOI_TOKEN + ''.join([IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)]) + EOI_TOKEN
+    input_images = []
+    if len(image_list) > 0:
+        image_tensor_list = []
+        embeds_cmp_mask = []
+        embeds_gen_mask = []
+        if service.multi_resolution:
+            patch_pos = []
+            image_patch_length = []
+            image_size_list = []
+        for idx, image_item in enumerate(image_list):
+            if isinstance(image_item, str):
+                image = decode_image(image_item)
+                print('after decode image size:', image.size)
+                input_images.append(image)
+                if service.multi_resolution:
+                    image_size_list.append(image.size)
+                    print('image size:', image.size)
+                    image_tensor, patch_pos_tensor = process_anyres_image(image, service.image_transform, service.grid_pinpoints, service.base_resolution)
+                    image_tensor_list.append(image_tensor)
+                    patch_pos.append(patch_pos_tensor)
+                    image_patch_length.append(image_tensor.shape[0])
+                    print('image_patch_length', image_patch_length)
+                    embeds_cmp_mask.extend([True]*image_tensor.shape[0])
+                    embeds_gen_mask.extend([False]*image_tensor.shape[0])
+                else:
+                    image_tensor = service.image_transform(image)
+                    image_tensor_list.append(image_tensor)
+                    embeds_cmp_mask.append(True)
+                    embeds_gen_mask.append(False)
+            else:
+                raise ValueError
+        if service.multi_resolution:
+            pixel_values = torch.cat(image_tensor_list).to(service.vit_sd_device, dtype=service.dtype)
+            patch_position = torch.cat(patch_pos, dim=0)
+            image_tokens_list = []
+            for patch_length in image_patch_length:
+                image_tokens = ''
+                for _ in range(patch_length-1):
+                    image_tokens +=  BOP_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOP_TOKEN
+                image_tokens += BOI_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOI_TOKEN
+                image_tokens_list.append(image_tokens)
+        else:
+            pixel_values = torch.stack(image_tensor_list).to(service.vit_sd_device, dtype=service.dtype)
+        image_embeds = service.visual_encoder(pixel_values)
+        image_embeds = image_embeds.to(service.llm_device)
+        embeds_cmp_mask = torch.tensor(embeds_cmp_mask, dtype=torch.bool).to(service.llm_device)
+        embeds_gen_mask = torch.tensor(embeds_gen_mask, dtype=torch.bool).to(service.llm_device)
+    else:
+        image_embeds = None
+        patch_position = 0
+        embeds_cmp_mask = None
+        embeds_gen_mask = None
+    if service.multi_resolution:
+        input_text = ''
+        for i, c in enumerate(text_list[:-1]):
+            input_text += c + image_tokens_list[i]
+        input_text += text_list[-1]
+    else:
+        input_text = image_tokens.join(text_list)
+    if force_boi:
+        input_text = input_text + BOI_TOKEN
+    if force_bbox:
+        input_text = input_text + '[[ <box_start>'
+    print('input_text:', input_text)
+    input_ids = service.tokenizer.encode(input_text, add_special_tokens=False)
+    input_ids = [service.tokenizer.bos_token_id] + input_ids
+    input_ids = torch.tensor(input_ids).to(service.llm_device, dtype=torch.long)
+    ids_cmp_mask = torch.zeros_like(input_ids, dtype=torch.bool).to(service.llm_device)
+    ids_gen_mask = torch.zeros_like(input_ids, dtype=torch.bool).to(service.llm_device)
+    if service.multi_resolution:
+        boi_indices = torch.where(torch.logical_or(input_ids == service.boi_token_id, input_ids == service.bop_token_id))[0].tolist()
+        eoi_indices = torch.where(torch.logical_or(input_ids == service.eoi_token_id, input_ids == service.eop_token_id))[0].tolist()
+    else:
+        boi_indices = torch.where(input_ids == service.boi_token_id)[0].tolist()
+        eoi_indices = torch.where(input_ids == service.eoi_token_id)[0].tolist()
+    for boi_idx, eoi_idx in zip(boi_indices, eoi_indices):
+        ids_cmp_mask[boi_idx + 1:eoi_idx] = True
+    input_ids = input_ids.unsqueeze(0)
+    ids_cmp_mask = ids_cmp_mask.unsqueeze(0)
+    ids_gen_mask = ids_gen_mask.unsqueeze(0)
+    error_msg = []
+    if service.multi_resolution:
+        output = service.agent.generate(
+            tokenizer=service.tokenizer,
+            input_ids=input_ids,
+            image_embeds=image_embeds,
+            patch_positions=patch_position,
+            embeds_cmp_mask=embeds_cmp_mask,
+            ids_cmp_mask=ids_cmp_mask,
+            num_img_gen_tokens=num_img_out_tokens,
+            max_new_tokens=max_new_tokens,
+            dtype=service.dtype,
+            device=service.llm_device,
+            top_p=top_p,
+        )
+    else:
+        output = service.agent.generate(
+            tokenizer=service.tokenizer,
+            input_ids=input_ids,
+            image_embeds=image_embeds,
+            embeds_cmp_mask=embeds_cmp_mask,
+            ids_cmp_mask=ids_cmp_mask,
+            num_img_gen_tokens=num_img_out_tokens,
+            max_new_tokens=max_new_tokens,
+            dtype=service.dtype,
+            device=service.llm_device,
+            top_p=top_p,
+        )
+    gen_imgs_base64_list = []
+    generated_text = output['text']
+    generated_text = generated_text.replace(EOI_TOKEN, IMG_FLAG).replace(service.tokenizer.eos_token, '')
+    if output['has_img_output']:
+        print('loading visual encoder and llm to CPU, and sd to GPU')
+        a = time.time()
+        service.agent = service.agent.to("cpu")
+        service.sd_adapter = service.sd_adapter.to(service.vit_sd_device, dtype=service.dtype)
+        print("Loading finished: ", time.time() - a)
+        img_gen_feat = output['img_gen_feat'].to(service.vit_sd_device, dtype=service.dtype)
+        for img_idx in range(output['num_gen_imgs']):
+            img_feat = img_gen_feat[img_idx:img_idx + 1]
+            generated_image = service.sd_adapter.generate(image_embeds=img_feat, num_inference_steps=50)[0]
+            image_base64 = encode_image(generated_image)
+            gen_imgs_base64_list.append(image_base64)
+        print('loading visual encoder and llm to GPU, and sd to CPU')
+        a = time.time()
+        service.sd_adapter = service.sd_adapter.to("cpu")
+        service.visual_encoder = service.visual_encoder.to(service.vit_sd_device, dtype=service.dtype)
+        service.agent = service.agent.to(service.vit_sd_device, dtype=service.dtype)
+        print("Loading finished: ", time.time() - a)
+    if args.has_bbox:
+        bboxes = extract_box(generated_text)
+        if bboxes is not None and len(input_images) > 0:
+            image_viz = visualize_bbox(input_images[0], bboxes)
+            image_base64 = encode_image(image_viz)
+            gen_imgs_base64_list.append(image_base64)
+            generated_text = re.sub(r'\[\[ <box_start>.*?<box_end>.*?\]\]', 'the green bounding box', generated_text)
+            generated_text += IMG_FLAG
+    print(input_text + generated_text)
+    return {'text': generated_text, 'images': gen_imgs_base64_list, 'error_msg': error_msg}
+if __name__ == '__main__':
+    app.run(host='0.0.0.0', port=args.port)

src/demo/seed_llama_gradio.py ADDED Viewed

	@@ -0,0 +1,465 @@

+import os
+import numpy as np
+import datetime
+import json
+from typing import Optional
+import transformers
+from dataclasses import dataclass, field
+import io
+import base64
+from PIL import Image
+import gradio as gr
+import time
+import hashlib
+import requests
+from utils import build_logger
+from conversation import conv_seed_llama2
+IMG_FLAG = '<image>'
+LOGDIR = 'log'
+logger = build_logger("gradio_seed_x", LOGDIR)
+headers = {"User-Agent": "SEED-X Client"}
+no_change_btn = gr.Button.update()
+enable_btn = gr.Button.update(interactive=True)
+disable_btn = gr.Button.update(interactive=False)
+@dataclass
+class Arguments:
+    server_port: Optional[int] = field(default=7860, metadata={"help": "network port"})
+    server_name: Optional[str] = field(default='0.0.0.0', metadata={"help": "network address"})
+    request_address: Optional[str] = field(default='http://127.0.0.1:7890/generate',
+                                           metadata={"help": "request address"})
+parser = transformers.HfArgumentParser(Arguments)
+args, = parser.parse_args_into_dataclasses()
+conv_seed_llama = conv_seed_llama2
+def decode_image(encoded_image: str) -> Image:
+    decoded_bytes = base64.b64decode(encoded_image.encode('utf-8'))
+    buffer = io.BytesIO(decoded_bytes)
+    image = Image.open(buffer)
+    return image
+def encode_image(image: Image.Image, format: str = 'PNG') -> str:
+    with io.BytesIO() as buffer:
+        image.save(buffer, format=format)
+        encoded_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
+    return encoded_image
+def get_conv_log_filename():
+    t = datetime.datetime.now()
+    name = os.path.join(LOGDIR, f"{t.year}-{t.month:02d}-{t.day:02d}-conv.json")
+    return name
+def get_conv_image_dir():
+    name = os.path.join(LOGDIR, 'images')
+    os.makedirs(name, exist_ok=True)
+    return name
+def get_image_name(image, image_dir=None):
+    buffer = io.BytesIO()
+    image.save(buffer, format='PNG')
+    image_bytes = buffer.getvalue()
+    md5 = hashlib.md5(image_bytes).hexdigest()
+    if image_dir is not None:
+        image_name = os.path.join(image_dir, md5 + '.png')
+    else:
+        image_name = md5 + '.png'
+    return image_name
+def resize_image_square(image, target_size=448):
+    resized_image = image.resize((target_size, target_size))
+    return resized_image
+def resize_image(image, max_size=512):
+    width, height = image.size
+    aspect_ratio = float(width) / float(height)
+    if width > height:
+        new_width = max_size
+        new_height = int(new_width / aspect_ratio)
+    else:
+        new_height = max_size
+        new_width = int(new_height * aspect_ratio)
+    resized_image = image.resize((new_width, new_height))
+    return resized_image
+def center_crop_image(image, max_aspect_ratio=1.5):
+    width, height = image.size
+    aspect_ratio = max(width, height) / min(width, height)
+    if aspect_ratio >= max_aspect_ratio:
+        if width > height:
+            new_width = int(height * max_aspect_ratio)
+            left = (width - new_width) // 2
+            right = (width + new_width) // 2
+            top = 0
+            bottom = height
+        else:
+            new_height = int(width * max_aspect_ratio)
+            left = 0
+            right = width
+            top = (height - new_height) // 2
+            bottom = (height + new_height) // 2
+        cropped_image = image.crop((left, top, right, bottom))
+        return cropped_image
+    else:
+        return image
+def vote_last_response(state, vote_type, request: gr.Request):
+    with open(get_conv_log_filename(), "a") as fout:
+        data = {
+            "tstamp": round(time.time(), 4),
+            "type": vote_type,
+            "state": state.dict(),
+            "ip": request.client.host,
+        }
+        fout.write(json.dumps(data) + "\n")
+def upvote_last_response(state, request: gr.Request):
+    logger.info(f"upvote. ip: {request.client.host}")
+    vote_last_response(state, "upvote", request)
+    return (disable_btn,) * 2
+def downvote_last_response(state, request: gr.Request):
+    logger.info(f"downvote. ip: {request.client.host}")
+    vote_last_response(state, "downvote", request)
+    return (disable_btn,) * 2
+def regenerate(dialog_state, request: gr.Request):
+    logger.info(f"regenerate. ip: {request.client.host}")
+    if dialog_state.messages[-1]['role'] == dialog_state.roles[1]:
+        dialog_state.messages.pop()
+    return (
+               dialog_state,
+               dialog_state.to_gradio_chatbot(),
+           ) + (disable_btn,) * 4
+def clear_history(request: gr.Request):
+    logger.info(f"clear_history. ip: {request.client.host}")
+    dialog_state = conv_seed_llama.copy()
+    input_state = init_input_state()
+    return (dialog_state, input_state, dialog_state.to_gradio_chatbot()) + (disable_btn,) * 4
+def init_input_state():
+    return {'images': [], 'text': ''}
+def add_text(dialog_state, input_state, text, request: gr.Request):
+    logger.info(f"add_text. ip: {request.client.host}.")
+    # if len(input_state['text']) == 0:
+    if text is None or len(text) == 0:
+        # dialog_state.skip_next = True
+        return (dialog_state, input_state, "", dialog_state.to_gradio_chatbot()) + (no_change_btn,) * 4
+    input_state['text'] += text
+    if len(dialog_state.messages) > 0 and dialog_state.messages[-1]['role'] == dialog_state.roles[0]:
+        dialog_state.messages[-1]['message'] = input_state
+    else:
+        dialog_state.messages.append({'role': dialog_state.roles[0], 'message': input_state})
+    print('add_text: ', dialog_state.to_gradio_chatbot())
+    return (dialog_state, input_state, "", dialog_state.to_gradio_chatbot()) + (disable_btn,) * 4
+def is_blank(image):
+    image_array = np.array(image)
+    unique_colors = np.unique(image_array)
+    print('unique_colors', len(unique_colors))
+    return len(unique_colors) == 1
+def add_image(dialog_state, input_state, image, request: gr.Request):
+    logger.info(f"add_image. ip: {request.client.host}.")
+    if image is None:
+        return (dialog_state, input_state, None, dialog_state.to_gradio_chatbot()) + (no_change_btn,) * 4
+    image = image.convert('RGB')
+    print('image size:', image.size)
+    image = center_crop_image(image, max_aspect_ratio=10)
+    image_dir = get_conv_image_dir()
+    image_path = get_image_name(image=image, image_dir=image_dir)
+    if not os.path.exists(image_path):
+        image.save(image_path)
+    input_state['images'].append(image_path)
+    input_state['text'] += IMG_FLAG
+    if len(dialog_state.messages) > 0 and dialog_state.messages[-1]['role'] == dialog_state.roles[0]:
+        dialog_state.messages[-1]['message'] = input_state
+    else:
+        dialog_state.messages.append({'role': dialog_state.roles[0], 'message': input_state})
+    print('add_image:', dialog_state)
+    return (dialog_state, input_state, None, dialog_state.to_gradio_chatbot()) + (disable_btn,) * 4
+def http_bot(dialog_state, input_state, max_new_tokens, max_turns, force_image_gen, force_bbox,
+             request: gr.Request):
+    logger.info(f"http_bot. ip: {request.client.host}")
+    print('input_state:', input_state)
+    if len(dialog_state.messages) == 0 or dialog_state.messages[-1]['role'] != dialog_state.roles[0] or len(
+            dialog_state.messages[-1]['message']['text'].strip(' ?.;!/')) == 0:
+        return (dialog_state, input_state, dialog_state.to_gradio_chatbot()) + (no_change_btn,) * 4
+    if len(dialog_state.messages) > max_turns * 2:
+        output_state = init_input_state()
+        output_state['text'] = 'Error: History exceeds maximum rounds, please clear history and restart.'
+        dialog_state.messages.append({'role': dialog_state.roles[1], 'message': output_state})
+        input_state = init_input_state()
+        return (dialog_state, input_state, dialog_state.to_gradio_chatbot()) + (disable_btn,) * 3 + (enable_btn,)
+    prompt = dialog_state.get_prompt()
+    payload = {
+        'text': prompt['text'],
+        'max_new_tokens': int(max_new_tokens),
+        'images': prompt['images'],
+        'force_boi': force_image_gen,
+        'force_bbox': force_bbox,
+    }
+    print(
+        'request: ', {
+            'text': prompt['text'],
+            'max_new_tokens': int(max_new_tokens),
+        })
+    print('request_address', args.request_address)
+    response = requests.request(method="POST", url=args.request_address, headers=headers, json=payload)
+    results = response.json()
+    print('response: ', {'text': results['text'], 'error_msg': results['error_msg']})
+    output_state = init_input_state()
+    image_dir = get_conv_image_dir()
+    output_state['text'] = results['text']
+    for image_base64 in results['images']:
+        if image_base64 == '':
+            image_path = ''
+        else:
+            image = decode_image(image_base64)
+            image = image.convert('RGB')
+            image_path = get_image_name(image=image, image_dir=image_dir)
+            if not os.path.exists(image_path):
+                image.save(image_path)
+        output_state['images'].append(image_path)
+    dialog_state.messages.append({'role': dialog_state.roles[1], 'message': output_state})
+    vote_last_response(dialog_state, 'common', request)
+    input_state = init_input_state()
+    chatbot = update_error_msg(dialog_state.to_gradio_chatbot(), results['error_msg'])
+    return (dialog_state, input_state, chatbot) + (enable_btn,) * 4
+def update_error_msg(chatbot, error_msg):
+    if len(error_msg) > 0:
+        info = '\n-------------\nSome errors occurred during response, please clear history and restart.\n' + '\n'.join(
+            error_msg)
+        chatbot[-1][-1] = chatbot[-1][-1] + info
+    return chatbot
+def load_demo(request: gr.Request):
+    logger.info(f"load_demo. ip: {request.client.host}")
+    dialog_state = conv_seed_llama.copy()
+    input_state = init_input_state()
+    return dialog_state, input_state
+title = ("""
+# SEED-X-I
+[[Paper]](https://arxiv.org/abs/2404.14396) [[Code]](https://github.com/AILab-CVC/SEED-X)
+Demo of a general instruction-tuned model SEED-X-I (17B) from the foundation model SEED-X.
+SEED-X-I can follow multimodal instruction (including images with **dynamic resolutions**) and make responses with **images, texts and bounding boxes** in multi-turn conversation.
+SEED-X-I **does not support image manipulation**. If you want to experience **SEED-X-Edit** for high-precision image editing, please refer to [[Inference Code]](https://github.com/AILab-CVC/SEED-X).
+Due to insufficient GPU memory, when generating images, we need to offload the LLM to the CPU and move the de-tokenizer to the CPU, which will **result in a long processing time**. If you want to experience the normal model inference speed, you can run [[Inference Code]](https://github.com/AILab-CVC/SEED-X) locally.
+## Tips:
+* Check out the conversation examples (at the bottom) for inspiration.
+* You can adjust "Max History Rounds" to try a conversation with up to five rounds. For more turns, you can download our checkpoints from GitHub and deploy them locally for inference.
+* Our demo supports a mix of images and texts as input. You can freely upload an image or enter text, and then click on "Add Image/Text". You can repeat the former step multiple times, and click on "Submit" for model inference at last.
+* You can click "Force Image Generation" to compel the model to produce images when necessary. For example, our model might struggle to generate images when there is an excessive amount of text-only context.
+* You can click "Force Bounding Box" to compel the model to produce bounding box for object detection.
+* SEED-X was trained with English-only data. It may process with other languages due to the inherent capabilities from LLaMA, but might not stable.
+""")
+css = """
+img {
+  font-family: 'Helvetica';
+  font-weight: 300;
+  line-height: 2;
+  text-align: center;
+  width: auto;
+  height: auto;
+  display: block;
+  position: relative;
+}
+img:before {
+  content: " ";
+  display: block;
+  position: absolute;
+  top: -10px;
+  left: 0;
+  height: calc(100% + 10px);
+  width: 100%;
+  background-color: rgb(230, 230, 230);
+  border: 2px dotted rgb(200, 200, 200);
+  border-radius: 5px;
+}
+img:after {
+  content: " ";
+  display: block;
+  font-size: 16px;
+  font-style: normal;
+  font-family: FontAwesome;
+  color: rgb(100, 100, 100);
+  position: absolute;
+  top: 5px;
+  left: 0;
+  width: 100%;
+  text-align: center;
+}
+"""
+if __name__ == '__main__':
+    examples_mix = [
+        ['seed_x/bank.png', 'Can I conntect with an advisor on Sunday?'],
+        ['seed_x/ground.png',
+         'Is there anything in the image that can protect me from catching the flu virus when I go out? Show me the location.'],
+        ['seed_x/arrow.jpg', 'What is the object pointed by the red arrow?'],
+        ['seed_x/shanghai.png', 'Where was this image taken? Explain your answer.'],
+        ['seed_x/GPT4.png', 'How long does it take to make GPT-4 safer?'],
+        ['seed_x/twitter.png',
+         'Please provide a comprehensive description of this image.'],
+    ]
+    examples_text = [
+        ['I want to build a two story cabin in the woods, with many commanding windows. Can you show me a picture?'],
+        ['Use your imagination to design a concept image for Artificial General Intelligence (AGI). Show me an image.'],
+        [
+            'Can you design an illustration for “The Three-Body Problem” to depict a scene from the novel? Show me a picture.'],
+        [
+            'My four year old son loves toy trains. Can you design a fancy birthday cake for him? Please generate a picture.'],
+        [
+            'Generate an image of a portrait of young nordic girl, age 25, freckled skin, neck tatoo, blue eyes 35mm lens, photography, ultra details.'],
+        ['Generate an impressionist painting of an astronaut in a jungle.']
+    ]
+    with gr.Blocks(css=css) as demo:
+        gr.Markdown(title)
+        dialog_state = gr.State()
+        input_state = gr.State()
+        with gr.Row():
+            with gr.Column(scale=3):
+                with gr.Row():
+                    image = gr.Image(type='pil', label='input_image')
+                with gr.Row():
+                    text = gr.Textbox(lines=5,
+                                      show_label=False,
+                                      label='input_text',
+                                      elem_id='textbox',
+                                      placeholder="Enter text or add image, and press submit,").style(container=False)
+                with gr.Row():
+                    add_image_btn = gr.Button("Add Image")
+                    add_text_btn = gr.Button("Add Text")
+                    submit_btn = gr.Button("Submit")
+                with gr.Row():
+                    max_new_tokens = gr.Slider(minimum=64,
+                                               maximum=1024,
+                                               value=768,
+                                               step=64,
+                                               interactive=True,
+                                               label="Max Output Tokens")
+                    max_turns = gr.Slider(minimum=1, maximum=9, value=3, step=1, interactive=True,
+                                          label="Max History Rounds")
+                    force_img_gen = gr.Radio(choices=[True, False], value=False, label='Force Image Generation')
+                    force_bbox = gr.Radio(choices=[True, False], value=False, label='Force Bounding Box')
+            with gr.Column(scale=7):
+                chatbot = gr.Chatbot(elem_id='chatbot', label="SEED-X-I").style(height=700)
+                with gr.Row():
+                    upvote_btn = gr.Button(value="👍  Upvote", interactive=False)
+                    downvote_btn = gr.Button(value="👎  Downvote", interactive=False)
+                    regenerate_btn = gr.Button(value="🔄  Regenerate", interactive=False)
+                    clear_btn = gr.Button(value="🗑️  Clear history", interactive=False)
+        with gr.Row():
+            with gr.Column(scale=0.7):
+                gr.Examples(examples=examples_mix, label='Input examples', inputs=[image, text])
+            with gr.Column(scale=0.3):
+                gr.Examples(examples=examples_text, label='Input examples', inputs=[text])
+        # Register listeners
+        btn_list = [upvote_btn, downvote_btn, regenerate_btn, clear_btn]
+        upvote_btn.click(upvote_last_response, [dialog_state], [upvote_btn, downvote_btn])
+        downvote_btn.click(downvote_last_response, [dialog_state], [upvote_btn, downvote_btn])
+        regenerate_btn.click(regenerate, [dialog_state], [dialog_state, chatbot] + btn_list).then(
+            http_bot, [dialog_state, input_state, max_new_tokens, max_turns, force_img_gen, force_bbox],
+            [dialog_state, input_state, chatbot] + btn_list)
+        add_image_btn.click(add_image, [dialog_state, input_state, image],
+                            [dialog_state, input_state, image, chatbot] + btn_list)
+        add_text_btn.click(add_text, [dialog_state, input_state, text],
+                           [dialog_state, input_state, text, chatbot] + btn_list)
+        submit_btn.click(
+            add_image, [dialog_state, input_state, image], [dialog_state, input_state, image, chatbot] + btn_list).then(
+            add_text, [dialog_state, input_state, text],
+            [dialog_state, input_state, text, chatbot, upvote_btn, downvote_btn, regenerate_btn, clear_btn]).then(
+            http_bot,
+            [dialog_state, input_state, max_new_tokens, max_turns, force_img_gen, force_bbox],
+            [dialog_state, input_state, chatbot] + btn_list)
+        clear_btn.click(clear_history, None, [dialog_state, input_state, chatbot] + btn_list)
+        demo.load(load_demo, None, [dialog_state, input_state])
+    demo.launch(server_name=args.server_name, server_port=args.server_port, enable_queue=True)

src/demo/utils.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import datetime
+import logging
+import logging.handlers
+import os
+import sys
+handler = None
+def build_logger(logger_name, logger_dir):
+    global handler
+    formatter = logging.Formatter(
+        fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+    # Set the format of root handlers
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
+    logging.getLogger().handlers[0].setFormatter(formatter)
+    # Redirect stdout and stderr to loggers
+    stdout_logger = logging.getLogger("stdout")
+    stdout_logger.setLevel(logging.INFO)
+    sl = StreamToLogger(stdout_logger, logging.INFO)
+    sys.stdout = sl
+    stderr_logger = logging.getLogger("stderr")
+    stderr_logger.setLevel(logging.ERROR)
+    sl = StreamToLogger(stderr_logger, logging.ERROR)
+    sys.stderr = sl
+    # Get logger
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.INFO)
+    # Add a file handler for all loggers
+    if handler is None:
+        os.makedirs(logger_dir, exist_ok=True)
+        filename = os.path.join(logger_dir, logger_name + '.log')
+        handler = logging.handlers.TimedRotatingFileHandler(filename, when='D', utc=True)
+        handler.setFormatter(formatter)
+        for name, item in logging.root.manager.loggerDict.items():
+            if isinstance(item, logging.Logger):
+                item.addHandler(handler)
+    return logger
+class StreamToLogger(object):
+    """
+    Fake file-like stream object that redirects writes to a logger instance.
+    """
+    def __init__(self, logger, log_level=logging.INFO):
+        self.terminal = sys.stdout
+        self.logger = logger
+        self.log_level = log_level
+        self.linebuf = ''
+    def __getattr__(self, attr):
+        return getattr(self.terminal, attr)
+    def write(self, buf):
+        temp_linebuf = self.linebuf + buf
+        self.linebuf = ''
+        for line in temp_linebuf.splitlines(True):
+            # From the io.TextIOWrapper docs:
+            #   On output, if newline is None, any '\n' characters written
+            #   are translated to the system default line separator.
+            # By default sys.stdout.write() expects '\n' newlines and then
+            # translates them so this is still cross platform.
+            if line[-1] == '\n':
+                self.logger.log(self.log_level, line.rstrip())
+            else:
+                self.linebuf += line
+    def flush(self):
+        if self.linebuf != '':
+            self.logger.log(self.log_level, self.linebuf.rstrip())
+        self.linebuf = ''

src/inference/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

src/inference/__pycache__/any_res.cpython-311.pyc ADDED Viewed

Binary file (12.2 kB). View file

src/inference/__pycache__/any_res.cpython-38.pyc ADDED Viewed

Binary file (7.47 kB). View file

src/inference/any_res.py ADDED Viewed

	@@ -0,0 +1,257 @@

+import base64
+import torch
+import math
+import ast
+from PIL import Image
+from io import BytesIO
+def select_best_resolution(original_size, possible_resolutions):
+    """
+    Selects the best resolution from a list of possible resolutions based on the original size.
+    Args:
+        original_size (tuple): The original size of the image in the format (width, height).
+        possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
+    Returns:
+        tuple: The best fit resolution in the format (width, height).
+    """
+    original_width, original_height = original_size
+    best_fit = None
+    max_effective_resolution = 0
+    min_wasted_resolution = float('inf')
+    for width, height in possible_resolutions:
+        scale = min(width / original_width, height / original_height)
+        downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
+        effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
+        wasted_resolution = (width * height) - effective_resolution
+        if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
+            max_effective_resolution = effective_resolution
+            min_wasted_resolution = wasted_resolution
+            best_fit = (width, height)
+    return best_fit
+def select_best_resolution_v2(original_size, possible_resolutions):
+    """
+    Selects the best resolution from a list of possible resolutions based on the original size and aspect ratio.
+    Args:
+        original_size (tuple): The original size of the image in the format (width, height).
+        possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
+    Returns:
+        tuple: The best fit resolution in the format (width, height).
+    """
+    original_width, original_height = original_size
+    original_aspect_ratio = original_height / original_width
+    original_area = original_width * original_height
+    best_fit = None
+    min_aspect_ratio_diff = float('inf')
+    min_area_ratio = float('inf')
+    for width, height in possible_resolutions:
+        aspect_ratio = height / width
+        area = width * height
+        aspect_ratio_diff = max(aspect_ratio, original_aspect_ratio) / min(aspect_ratio, original_aspect_ratio)
+        area_ratio = max(area, original_area) / min(area, original_area)
+        if aspect_ratio_diff < min_aspect_ratio_diff or (aspect_ratio_diff == min_aspect_ratio_diff and area_ratio < min_area_ratio):
+            min_aspect_ratio_diff = aspect_ratio_diff
+            min_area_ratio = area_ratio
+            best_fit = (width, height)
+    return best_fit
+def resize_and_pad_image(image, target_resolution, keep_ratio=False):
+    """
+    Resize and pad an image to a target resolution
+    Args:
+        image (PIL.Image.Image): The input image.
+        target_resolution (tuple): The target resolution (width, height) of the image.
+    Returns:
+        PIL.Image.Image: The resized and padded image.
+    """
+    original_width, original_height = image.size
+    target_width, target_height = target_resolution
+    if keep_ratio:
+        # maintaining aspect ratio
+        scale_w = target_width / original_width
+        scale_h = target_height / original_height
+        if scale_w < scale_h:
+            new_width = target_width
+            new_height = min(math.ceil(original_height * scale_w), target_height)
+        else:
+            new_height = target_height
+            new_width = min(math.ceil(original_width * scale_h), target_width)
+        # Resize the image
+        resized_image = image.resize((new_width, new_height))
+        new_image = Image.new('RGB', (target_width, target_height), (0, 0, 0))
+        paste_x = (target_width - new_width) // 2
+        paste_y = (target_height - new_height) // 2
+        new_image.paste(resized_image, (paste_x, paste_y))
+    else:
+        # not maintaining aspect ratio
+        new_image = image.resize((target_width, target_height))
+    return new_image
+def divide_to_patches(image, patch_size):
+    """
+    Divides an image into patches of a specified size.
+    Args:
+        image (PIL.Image.Image): The input image.
+        patch_size (int): The size of each patch.
+    Returns:
+        list: A list of PIL.Image.Image objects representing the patches.
+    """
+    patches = []
+    width, height = image.size
+    for i in range(0, height, patch_size):
+        for j in range(0, width, patch_size):
+            box = (j, i, j + patch_size, i + patch_size)
+            patch = image.crop(box)
+            patches.append(patch)
+    return patches
+def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
+    """
+    Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
+    Args:
+        image_size (tuple): The size of the input image in the format (width, height).
+        grid_pinpoints (str): A string representation of a list of possible resolutions.
+        patch_size (int): The size of each image patch.
+    Returns:
+        tuple: The shape of the image patch grid in the format (width, height).
+    """
+    if type(grid_pinpoints) is list:
+        possible_resolutions = grid_pinpoints
+    else:
+        possible_resolutions = ast.literal_eval(grid_pinpoints)
+    width1, height1 = select_best_resolution(image_size, possible_resolutions)
+    width2, height2 = select_best_resolution_v2(image_size, possible_resolutions)
+    if width1*height1 > width2*height2:
+        width, height = width2, height2
+    else:
+        width, height = width1, height1
+    return width // patch_size, height // patch_size
+def process_anyres_image(image, image_transform, grid_pinpoints, base_image_size):
+    """
+    Process an image with variable resolutions.
+    Args:
+        image (PIL.Image.Image): The input image to be processed.
+        image_transform: The image processor object.
+        grid_pinpoints (str): A string representation of a list of possible resolutions.
+    Returns:
+        torch.Tensor: A tensor containing the processed image patches.
+    """
+    if type(grid_pinpoints) is list:
+        possible_resolutions = grid_pinpoints
+    else:
+        possible_resolutions = ast.literal_eval(grid_pinpoints)
+    # best_resolution = select_best_resolution(image.size, possible_resolutions)
+    width1, height1 = select_best_resolution(image.size, possible_resolutions)
+    width2, height2 = select_best_resolution_v2(image.size, possible_resolutions)
+    if width1*height1 > width2*height2:
+        width, height = width2, height2
+    else:
+        width, height = width1, height1
+    best_resolution = [width, height]
+    image_padded = resize_and_pad_image(image, best_resolution)
+    patches = divide_to_patches(image_padded, base_image_size)
+    image_original_resize = image.resize((base_image_size, base_image_size))
+    image_patches =  patches + [image_original_resize] # add the original image as the last patch
+    image_patches = [image_transform(image_patch)
+                     for image_patch in image_patches]
+    patch_grid = (best_resolution[0]//base_image_size, best_resolution[1]//base_image_size)
+    x_index = (torch.arange(patch_grid[0]).repeat(patch_grid[1], 1)  + 0.5)/patch_grid[0]
+    y_index = (torch.arange(patch_grid[1]).unsqueeze(1).repeat(1, patch_grid[0]) + 0.5)/patch_grid[1]
+    patch_pos = torch.stack([x_index, y_index], dim=-1).flatten(0, 1) # h*w, 2
+    origin_pos = torch.tensor([[0.5, 0.5]])
+    patch_pos  = torch.cat([patch_pos, origin_pos], dim=0) # h*w+1, 2
+    return torch.stack(image_patches, dim=0), patch_pos
+def load_image_from_base64(image):
+    return Image.open(BytesIO(base64.b64decode(image)))
+def anyres_data_collate(batch, tokenizer, dataset_name=None):
+    results = {}
+    keys = batch[0].keys()
+    for key in keys:
+        cur = [batch[i][key] for i in range(len(batch)) if batch[i][key] is not None]
+        if len(cur) == 0:
+            results[key] = None
+        elif isinstance(cur[0], torch.Tensor):
+            if key in ['embeds_gen_mask', 'embeds_cmp_mask', 'images', 'images_patch_length', 'patch_position', 'image_size']:
+                results[key] = torch.cat(cur, dim=0)
+            else:
+                if key in ['input_ids']:
+                    results[key] = torch.nn.utils.rnn.pad_sequence(cur, batch_first=True, padding_value=tokenizer.pad_token_id)
+                elif key in ['attention_mask']:
+                    results[key] = torch.nn.utils.rnn.pad_sequence(cur, batch_first=True, padding_value=0)
+                elif key in ['labels']:
+                    results[key] = torch.nn.utils.rnn.pad_sequence(cur, batch_first=True, padding_value=-100)
+                elif key in ['ids_gen_mask', 'ids_cmp_mask']:
+                    results[key] = torch.nn.utils.rnn.pad_sequence(cur, batch_first=True, padding_value=False)
+                else:
+                    results[key] = torch.stack(cur, dim=0)
+        else:
+            results[key] = cur
+    results['dataset_name'] = dataset_name
+    return results
+def anyres_data_collate_old(batch, dataset_name=None):
+    results = {}
+    keys = batch[0].keys()
+    for key in keys:
+        cur = [batch[i][key] for i in range(len(batch)) if batch[i][key] is not None]
+        if len(cur) == 0:
+            results[key] = None
+        elif isinstance(cur[0], torch.Tensor):
+            if key in ['embeds_gen_mask', 'embeds_cmp_mask', 'images', 'images_patch_length', 'patch_position', 'image_size']:
+                results[key] = torch.cat(cur, dim=0)
+            else:
+                results[key] = torch.stack(cur, dim=0)
+        else:
+            results[key] = cur
+    results['dataset_name'] = dataset_name
+    return results

src/inference/eval_img2edit_seed_x.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import hydra
+import torch
+import os
+import re
+import pyrootutils
+from PIL import Image
+from omegaconf import OmegaConf
+from diffusers import AutoencoderKL, UNet2DConditionModel, EulerDiscreteScheduler, Transformer2DModel
+from any_res import process_anyres_image
+pyrootutils.setup_root(__file__, indicator='.project-root', pythonpath=True)
+BOI_TOKEN = '<img>'
+BOP_TOKEN = '<patch>'
+EOI_TOKEN = '</img>'
+EOP_TOKEN = '</patch>'
+IMG_TOKEN = '<img_{:05d}>'
+resolution_grids = ['1x1']
+base_resolution = 448
+device = 'cuda:0'
+device1 = 'cuda:1'
+dtype = torch.float16
+dtype_str = 'fp16'
+num_img_in_tokens = 64
+num_img_out_tokens = 64
+instruction_prompt = '[INST] {instruction} [/INST]\n'
+save_dir = 'vis'
+os.makedirs(save_dir, exist_ok=True)
+tokenizer_cfg_path = 'configs/tokenizer/clm_llama_tokenizer_224loc_anyres.yaml'
+image_transform_cfg_path = 'configs/processer/qwen_448_transform.yaml'
+visual_encoder_cfg_path = 'configs/visual_encoder/qwen_vitg_448.yaml'
+llm_cfg_path = 'configs/clm_models/llm_seed_x_edit.yaml'
+agent_cfg_path = 'configs/clm_models/agent_seed_x_edit.yaml'
+adapter_cfg_path = 'configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_full_with_latent_image_pretrain_no_normalize.yaml'
+discrete_model_cfg_path = 'configs/discrete_model/discrete_identity.yaml'
+diffusion_model_path = 'pretrained/stable-diffusion-xl-base-1.0'
+tokenizer_cfg = OmegaConf.load(tokenizer_cfg_path)
+tokenizer = hydra.utils.instantiate(tokenizer_cfg)
+image_transform_cfg = OmegaConf.load(image_transform_cfg_path)
+image_transform = hydra.utils.instantiate(image_transform_cfg)
+visual_encoder_cfg = OmegaConf.load(visual_encoder_cfg_path)
+visual_encoder = hydra.utils.instantiate(visual_encoder_cfg)
+visual_encoder.eval().to(device1, dtype=dtype)
+print('Init visual encoder done')
+llm_cfg = OmegaConf.load(llm_cfg_path)
+llm = hydra.utils.instantiate(llm_cfg, torch_dtype=dtype)
+print('Init llm done.')
+agent_model_cfg = OmegaConf.load(agent_cfg_path)
+agent_model = hydra.utils.instantiate(agent_model_cfg, llm=llm)
+agent_model.eval().to(device, dtype=dtype)
+print('Init agent mdoel Done')
+noise_scheduler = EulerDiscreteScheduler.from_pretrained(diffusion_model_path, subfolder="scheduler")
+print('init vae')
+vae = AutoencoderKL.from_pretrained(diffusion_model_path, subfolder="vae").to(device1, dtype=dtype)
+print('init unet')
+unet = UNet2DConditionModel.from_pretrained(diffusion_model_path, subfolder="unet").to(device1, dtype=dtype)
+adapter_cfg = OmegaConf.load(adapter_cfg_path)
+adapter = hydra.utils.instantiate(adapter_cfg, unet=unet).to(device1, dtype=dtype).eval()
+discrete_model_cfg = OmegaConf.load(discrete_model_cfg_path)
+discrete_model = hydra.utils.instantiate(discrete_model_cfg).to(device1).eval()
+print('Init adapter done')
+adapter.init_pipe(vae=vae,
+                scheduler=noise_scheduler,
+                visual_encoder=visual_encoder,
+                image_transform=image_transform,
+                dtype=dtype,
+                device=device1)
+print('Init adapter pipe done')
+boi_token_id = tokenizer.encode(BOI_TOKEN, add_special_tokens=False)[0]
+eoi_token_id = tokenizer.encode(EOI_TOKEN, add_special_tokens=False)[0]
+bop_token_id = tokenizer.encode(BOP_TOKEN, add_special_tokens=False)[0]
+eop_token_id = tokenizer.encode(EOP_TOKEN, add_special_tokens=False)[0]
+grid_pinpoints = []
+for scale in resolution_grids:
+    s1, s2 = scale.split('x')
+    grid_pinpoints.append([int(s1)*base_resolution, int(s2)*base_resolution])
+grid_pinpoints = grid_pinpoints
+image_path = 'demo_images/car.jpg'
+instruction = 'Make it under the sunset'
+image = Image.open(image_path).convert('RGB')
+source_image = image.resize((1024, 1024))
+image_tensor, patch_pos_tensor = process_anyres_image(image, image_transform, grid_pinpoints, base_resolution)
+embeds_cmp_mask = torch.tensor([True]*image_tensor.shape[0]).to(device, dtype=torch.bool)
+patch_pos = [patch_pos_tensor]
+patch_position = torch.cat(patch_pos, dim=0)
+image_tensor = image_tensor.to(device1, dtype=dtype)
+patch_length = image_tensor.shape[0]
+image_tokens = ''
+for _ in range(patch_length-1):
+    image_tokens +=  BOP_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOP_TOKEN
+image_tokens += BOI_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOI_TOKEN
+prompt = instruction_prompt.format_map({'instruction': image_tokens + instruction})
+input_ids = tokenizer.encode(prompt, add_special_tokens=False)
+input_ids = [tokenizer.bos_token_id] + input_ids
+input_ids = torch.tensor(input_ids).to(device, dtype=torch.long)
+ids_cmp_mask = torch.zeros_like(input_ids, dtype=torch.bool)
+boi_indices = torch.where(torch.logical_or(input_ids == boi_token_id, input_ids == bop_token_id))[0].tolist()
+eoi_indices = torch.where(torch.logical_or(input_ids == eoi_token_id, input_ids == eop_token_id))[0].tolist()
+for boi_idx, eoi_idx in zip(boi_indices, eoi_indices):
+    ids_cmp_mask[boi_idx + 1:eoi_idx] = True
+input_ids = input_ids.unsqueeze(0)
+ids_cmp_mask = ids_cmp_mask.unsqueeze(0)
+with torch.no_grad():
+    image_embeds = visual_encoder(image_tensor)
+    image_embeds = image_embeds.to(device)
+    output = agent_model.generate(tokenizer=tokenizer,
+                                input_ids=input_ids,
+                                image_embeds=image_embeds,
+                                embeds_cmp_mask=embeds_cmp_mask,
+                                patch_positions=patch_position,
+                                ids_cmp_mask=ids_cmp_mask,
+                                max_new_tokens=512,
+                                num_img_gen_tokens=num_img_out_tokens)
+text = re.sub('<[^>]*>', '', output['text'])
+print(text)
+if output['has_img_output']:
+    images = adapter.generate(image_embeds=output['img_gen_feat'].to(device1), latent_image=source_image, num_inference_steps=50)
+    save_path = os.path.join(save_dir, str(len(os.listdir(save_dir))) + '_' + instruction + '.jpg')
+    images[0].save(save_path)
+torch.cuda.empty_cache()

src/inference/eval_img2text_seed_x.py ADDED Viewed

	@@ -0,0 +1,235 @@

+import hydra
+import torch
+import os
+import pyrootutils
+from PIL import Image
+import re
+import cv2
+import numpy as np
+from omegaconf import OmegaConf
+from diffusers import AutoencoderKL, UNet2DConditionModel, EulerDiscreteScheduler
+from any_res import process_anyres_image
+pyrootutils.setup_root(__file__, indicator='.project-root', pythonpath=True)
+def visualize_bbox(image, bboxes, save_path):
+    img_width, img_height = image.size
+    image = np.array(image)
+    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+    for bbox in bboxes:
+        x_center, y_center, box_width, box_height = bbox
+        x_center = x_center / 224 * img_width
+        y_center = y_center  / 224 * img_height
+        box_width = box_width /224 * img_width
+        box_height = box_height / 224 * img_height
+        x1 = int(x_center - box_width / 2)
+        y1 = int(y_center - box_height / 2)
+        x2 = int(x_center + box_width / 2)
+        y2 = int(y_center + box_height / 2)
+        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
+    cv2.imwrite(save_path, image)
+def extract_box(output_str):
+    boxes = re.findall('<box_start>(.*?)<box_end>', output_str)
+    if len(boxes) >0:
+        bboxes = [[int(num) for num in re.findall('<loc-(\d+)>', box)] for box in boxes]
+    else:
+        bboxes = None
+    return bboxes
+BOI_TOKEN = '<img>'
+BOP_TOKEN = '<patch>'
+EOI_TOKEN = '</img>'
+EOP_TOKEN = '</patch>'
+IMG_TOKEN = '<img_{:05d}>'
+instruction_prompt = '[INST] {instruction} [/INST]\n'
+resolution_grids = ['1x1', '1x2', '1x3', '2x1', '3x1', '1x4', '4x1', '2x2']
+base_resolution = 448
+device = 'cuda:0'
+device1 = 'cuda:1'
+dtype = torch.float16
+dtype_str = 'fp16'
+num_img_in_tokens = 64
+num_img_out_tokens = 64
+tokenizer_cfg_path = 'configs/tokenizer/clm_llama_tokenizer_224loc_anyres.yaml'
+image_transform_cfg_path = 'configs/processer/qwen_448_transform.yaml'
+visual_encoder_cfg_path = 'configs/visual_encoder/qwen_vitg_448.yaml'
+llm_cfg_path = 'configs/clm_models/llm_seed_x_i.yaml'
+agent_cfg_path = 'configs/clm_models/agent_seed_x_i.yaml'
+adapter_cfg_path = 'configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_pretrain_no_normalize.yaml'
+discrete_model_cfg_path = 'configs/discrete_model/discrete_identity.yaml'
+diffusion_model_path = 'pretrained/stable-diffusion-xl-base-1.0'
+tokenizer_cfg = OmegaConf.load(tokenizer_cfg_path)
+tokenizer = hydra.utils.instantiate(tokenizer_cfg)
+image_transform_cfg = OmegaConf.load(image_transform_cfg_path)
+image_transform = hydra.utils.instantiate(image_transform_cfg)
+visual_encoder_cfg = OmegaConf.load(visual_encoder_cfg_path)
+visual_encoder = hydra.utils.instantiate(visual_encoder_cfg)
+visual_encoder.eval().to(device1, dtype=dtype)
+print('Init visual encoder done')
+llm_cfg = OmegaConf.load(llm_cfg_path)
+llm = hydra.utils.instantiate(llm_cfg, torch_dtype=dtype)
+print('Init llm done.')
+agent_model_cfg = OmegaConf.load(agent_cfg_path)
+agent_model = hydra.utils.instantiate(agent_model_cfg, llm=llm)
+agent_model.eval().to(device, dtype=dtype)
+print('Init agent mdoel Done')
+noise_scheduler = EulerDiscreteScheduler.from_pretrained(diffusion_model_path, subfolder="scheduler")
+print('init vae')
+vae = AutoencoderKL.from_pretrained(diffusion_model_path, subfolder="vae").to(device1, dtype=dtype)
+print('init unet')
+unet = UNet2DConditionModel.from_pretrained(diffusion_model_path, subfolder="unet").to(device1, dtype=dtype)
+adapter_cfg = OmegaConf.load(adapter_cfg_path)
+adapter = hydra.utils.instantiate(adapter_cfg, unet=unet).to(device1, dtype=dtype).eval()
+discrete_model_cfg = OmegaConf.load(discrete_model_cfg_path)
+discrete_model = hydra.utils.instantiate(discrete_model_cfg).to(device1).eval()
+print('Init adapter done')
+adapter.init_pipe(vae=vae,
+                  scheduler=noise_scheduler,
+                  visual_encoder=visual_encoder,
+                  image_transform=image_transform,
+                  discrete_model=discrete_model,
+                  dtype=dtype,
+                  device=device1)
+print('Init adapter pipe done')
+boi_token_id = tokenizer.encode(BOI_TOKEN, add_special_tokens=False)[0]
+eoi_token_id = tokenizer.encode(EOI_TOKEN, add_special_tokens=False)[0]
+bop_token_id = tokenizer.encode(BOP_TOKEN, add_special_tokens=False)[0]
+eop_token_id = tokenizer.encode(EOP_TOKEN, add_special_tokens=False)[0]
+grid_pinpoints = []
+for scale in resolution_grids:
+    s1, s2 = scale.split('x')
+    grid_pinpoints.append([int(s1)*base_resolution, int(s2)*base_resolution])
+grid_pinpoints = grid_pinpoints
+# image comprehension
+image_path = 'demo_images/advisor.png'
+image = Image.open(image_path).convert('RGB')
+image_tensor, patch_pos_tensor = process_anyres_image(image, image_transform, grid_pinpoints, base_resolution)
+embeds_cmp_mask = torch.tensor([True]*image_tensor.shape[0]).to(device, dtype=torch.bool)
+patch_pos = [patch_pos_tensor]
+patch_position = torch.cat(patch_pos, dim=0)
+image_tensor = image_tensor.to(device1, dtype=dtype)
+patch_length = image_tensor.shape[0]
+image_tokens = ''
+for _ in range(patch_length-1):
+    image_tokens +=  BOP_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOP_TOKEN
+image_tokens += BOI_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOI_TOKEN
+question = 'Can I conntect with an advisor on Sunday?'
+prompt = instruction_prompt.format_map({'instruction': image_tokens + question})
+input_ids = tokenizer.encode(prompt, add_special_tokens=False)
+input_ids = [tokenizer.bos_token_id] + input_ids
+input_ids = torch.tensor(input_ids).to(device, dtype=torch.long)
+ids_cmp_mask = torch.zeros_like(input_ids, dtype=torch.bool)
+boi_indices = torch.where(torch.logical_or(input_ids == boi_token_id, input_ids == bop_token_id))[0].tolist()
+eoi_indices = torch.where(torch.logical_or(input_ids == eoi_token_id, input_ids == eop_token_id))[0].tolist()
+for boi_idx, eoi_idx in zip(boi_indices, eoi_indices):
+    ids_cmp_mask[boi_idx + 1:eoi_idx] = True
+input_ids = input_ids.unsqueeze(0)
+ids_cmp_mask = ids_cmp_mask.unsqueeze(0)
+with torch.no_grad():
+    image_embeds = visual_encoder(image_tensor)
+    image_embeds = image_embeds.to(device)
+    output = agent_model.generate(tokenizer=tokenizer,
+                                input_ids=input_ids,
+                                image_embeds=image_embeds,
+                                embeds_cmp_mask=embeds_cmp_mask,
+                                patch_positions=patch_position,
+                                ids_cmp_mask=ids_cmp_mask,
+                                max_new_tokens=512,
+                                num_img_gen_tokens=num_img_out_tokens)
+text = re.sub('<[^>]*>', '', output['text'])
+print(text)
+# detection
+image_path = 'demo_images/ground.png'
+image = Image.open(image_path).convert('RGB')
+image_tensor, patch_pos_tensor = process_anyres_image(image, image_transform, grid_pinpoints, base_resolution)
+embeds_cmp_mask = torch.tensor([True]*image_tensor.shape[0]).to(device, dtype=torch.bool)
+patch_pos = [patch_pos_tensor]
+patch_position = torch.cat(patch_pos, dim=0)
+image_tensor = image_tensor.to(device1, dtype=dtype)
+patch_length = image_tensor.shape[0]
+image_tokens = ''
+for _ in range(patch_length-1):
+    image_tokens +=  BOP_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOP_TOKEN
+image_tokens += BOI_TOKEN + ''.join(IMG_TOKEN.format(int(item)) for item in range(num_img_in_tokens)) + EOI_TOKEN
+question = 'Is there anything in the image that can protect me from catching the flu virus when I go out? Show me the location.'
+prompt = instruction_prompt.format_map({'instruction': image_tokens + question})
+input_ids = tokenizer.encode(prompt, add_special_tokens=False)
+input_ids = [tokenizer.bos_token_id] + input_ids
+input_ids = torch.tensor(input_ids).to(device, dtype=torch.long)
+ids_cmp_mask = torch.zeros_like(input_ids, dtype=torch.bool)
+boi_indices = torch.where(torch.logical_or(input_ids == boi_token_id, input_ids == bop_token_id))[0].tolist()
+eoi_indices = torch.where(torch.logical_or(input_ids == eoi_token_id, input_ids == eop_token_id))[0].tolist()
+for boi_idx, eoi_idx in zip(boi_indices, eoi_indices):
+    ids_cmp_mask[boi_idx + 1:eoi_idx] = True
+input_ids = input_ids.unsqueeze(0)
+ids_cmp_mask = ids_cmp_mask.unsqueeze(0)
+with torch.no_grad():
+    image_embeds = visual_encoder(image_tensor)
+    image_embeds = image_embeds.to(device)
+    output = agent_model.generate(tokenizer=tokenizer,
+                                input_ids=input_ids,
+                                image_embeds=image_embeds,
+                                embeds_cmp_mask=embeds_cmp_mask,
+                                patch_positions=patch_position,
+                                ids_cmp_mask=ids_cmp_mask,
+                                max_new_tokens=512,
+                                num_img_gen_tokens=num_img_out_tokens)
+print(output['text'])
+bbox = extract_box(output['text'])
+if bbox is not None:
+    save_path = 'vis/ground.png'
+    visualize_bbox(image, bbox, save_path)

src/inference/eval_text2img_seed_x.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import hydra
+import torch
+import os
+import pyrootutils
+from PIL import Image
+from omegaconf import OmegaConf
+from diffusers import AutoencoderKL, UNet2DConditionModel, EulerDiscreteScheduler
+pyrootutils.setup_root(__file__, indicator='.project-root', pythonpath=True)
+BOI_TOKEN = '<img>'
+EOI_TOKEN = '</img>'
+IMG_TOKEN = '<img_{:05d}>'
+device = 'cuda:0'
+device_2 = 'cuda:1'
+dtype = torch.float16
+dtype_str = 'fp16'
+num_img_in_tokens = 64
+num_img_out_tokens = 64
+instruction_prompt = '[INST] Generate an image: {caption} [/INST]\n'
+tokenizer_cfg_path = 'configs/tokenizer/clm_llama_tokenizer_224loc_anyres.yaml'
+image_transform_cfg_path = 'configs/processer/qwen_448_transform.yaml'
+visual_encoder_cfg_path = 'configs/visual_encoder/qwen_vitg_448.yaml'
+llm_cfg_path = 'configs/clm_models/llm_seed_x_i.yaml'
+agent_cfg_path = 'configs/clm_models/agent_seed_x_i.yaml'
+adapter_cfg_path = 'configs/sdxl_adapter/sdxl_qwen_vit_resampler_l4_q64_pretrain_no_normalize.yaml'
+discrete_model_cfg_path = 'configs/discrete_model/discrete_identity.yaml'
+diffusion_model_path = 'pretrained/stable-diffusion-xl-base-1.0'
+save_dir = 'vis'
+os.makedirs(save_dir, exist_ok=True)
+tokenizer_cfg = OmegaConf.load(tokenizer_cfg_path)
+tokenizer = hydra.utils.instantiate(tokenizer_cfg)
+image_transform_cfg = OmegaConf.load(image_transform_cfg_path)
+image_transform = hydra.utils.instantiate(image_transform_cfg)
+visual_encoder_cfg = OmegaConf.load(visual_encoder_cfg_path)
+visual_encoder = hydra.utils.instantiate(visual_encoder_cfg)
+visual_encoder.eval().to(device_2, dtype=dtype)
+print('Init visual encoder done')
+llm_cfg = OmegaConf.load(llm_cfg_path)
+llm = hydra.utils.instantiate(llm_cfg, torch_dtype=dtype)
+print('Init llm done.')
+agent_model_cfg = OmegaConf.load(agent_cfg_path)
+agent_model = hydra.utils.instantiate(agent_model_cfg, llm=llm)
+agent_model.eval().to(device, dtype=dtype)
+print('Init agent mdoel Done')
+noise_scheduler = EulerDiscreteScheduler.from_pretrained(diffusion_model_path, subfolder="scheduler")
+print('init vae')
+vae = AutoencoderKL.from_pretrained(diffusion_model_path, subfolder="vae").to(device_2, dtype=dtype)
+print('init unet')
+unet = UNet2DConditionModel.from_pretrained(diffusion_model_path, subfolder="unet").to(device_2, dtype=dtype)
+adapter_cfg = OmegaConf.load(adapter_cfg_path)
+adapter = hydra.utils.instantiate(adapter_cfg, unet=unet).to(device_2, dtype=dtype).eval()
+discrete_model_cfg = OmegaConf.load(discrete_model_cfg_path)
+discrete_model = hydra.utils.instantiate(discrete_model_cfg).to(device_2).eval()
+print('Init adapter done')
+adapter.init_pipe(vae=vae,
+                  scheduler=noise_scheduler,
+                  visual_encoder=visual_encoder,
+                  image_transform=image_transform,
+                  discrete_model=discrete_model,
+                  dtype=dtype,
+                  device=device_2)
+print('Init adapter pipe done')
+caption = 'A cybernetic soldier, enhanced with advanced weapons systems and tactical analysis software, on a mission behind enemy lines.'
+prompt = instruction_prompt.format_map({'caption': caption})
+prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
+input_ids = torch.tensor([tokenizer.bos_token_id] + prompt_ids).to(device, dtype=torch.long).unsqueeze(0)
+output = agent_model.generate(tokenizer=tokenizer, input_ids=input_ids, num_img_gen_tokens=num_img_out_tokens)
+print(output['has_img_output'])
+print(output['text'])
+if output['has_img_output']:
+    images = adapter.generate(image_embeds=output['img_gen_feat'].to(device_2), num_inference_steps=50)
+    save_path = os.path.join(save_dir, caption.replace('.', '') + '.png')
+    images[0].save(save_path)
+torch.cuda.empty_cache()

src/models/detokenizer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

src/models/detokenizer/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (182 Bytes). View file

src/models/detokenizer/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (175 Bytes). View file

src/models/detokenizer/__pycache__/adapter_modules.cpython-311.pyc ADDED Viewed

Binary file (14 kB). View file

src/models/detokenizer/__pycache__/adapter_modules.cpython-38.pyc ADDED Viewed

Binary file (7.31 kB). View file

src/models/detokenizer/__pycache__/attention_processor.cpython-38.pyc ADDED Viewed

Binary file (7.4 kB). View file

src/models/detokenizer/__pycache__/ipa_utils.cpython-38.pyc ADDED Viewed

Binary file (397 Bytes). View file

src/models/detokenizer/__pycache__/pipeline_stable_diffusion_t2i_edit.cpython-38.pyc ADDED Viewed

Binary file (28.3 kB). View file

src/models/detokenizer/__pycache__/pipeline_stable_diffusion_xl_t2i_edit.cpython-311.pyc ADDED Viewed

Binary file (53 kB). View file

src/models/detokenizer/__pycache__/pipeline_stable_diffusion_xl_t2i_edit.cpython-38.pyc ADDED Viewed

Binary file (36.8 kB). View file

src/models/detokenizer/__pycache__/resampler.cpython-311.pyc ADDED Viewed

Binary file (16.2 kB). View file