|
--- |
|
datasets: |
|
- benchang1110/TaiVision-pretrain-1M-v2.0 |
|
language: |
|
- zh |
|
library_name: transformers |
|
pipeline_tag: image-text-to-text |
|
base_model: |
|
- benchang1110/TaiVisionLM-base-v1 |
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
![TaivisionLM](TaivisionLM.png) |
|
## Model Details |
|
|
|
## English |
|
# TaiVisionLM: The First of Its Kind! 🚀 |
|
|
|
🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟 |
|
|
|
✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️ |
|
|
|
Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖 |
|
|
|
|
|
## 繁體中文 |
|
# 台視: 台灣視覺語言模型!! 🚀 |
|
|
|
🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟 |
|
|
|
✨ TaiVisionLM 可以用 transformers 載入、微調和使用!⚡️ |
|
|
|
準備好體驗"臺視"了嗎?讓我們開始吧!🖼️🤖 |
|
|
|
|
|
|
|
--- |
|
|
|
### Model Description |
|
|
|
## English |
|
This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/google/siglip-base-patch16-224) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together. |
|
Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma). |
|
|
|
Here's the summary of the development process: |
|
|
|
1) **Unimodal pretraining** |
|
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat). |
|
2) **Feature Alignment** |
|
- We trained the vision projector and language model using LoRA using 1M image-text pairs to align visual and textual features. |
|
This model is the finetuned version of [benchang1110/TaiVisionLM-base-v1](https://huggingface.co/benchang1110/TaiVisionLM-base-v1). We fintuned the model using 1M image-text pairs. The finetuned model will generate a longer and more detailed description of the image. |
|
|
|
3) **Task Specific Training** |
|
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering. |
|
We will undergo this stage after the dataset is ready! |
|
|
|
|
|
- **Developed by:** [benchang1110](https://huggingface.co/benchang1110) |
|
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text) |
|
- **Language(s) (NLP):** *Traditional Chinese* |
|
|
|
## 繁體中文 |
|
這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。 |
|
其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。 |
|
|
|
以下是開發過程的摘要: |
|
|
|
1) **單模態預訓練** |
|
- 在這個階段,我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器,以及我們自己訓練的語言模型([Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat))。 |
|
2) **特徵對齊** |
|
- 我們使用了100萬個圖片和文本的配對來訓練圖像投影器 (visual projector),並使用 LoRA 來微調語言模型的權重。 |
|
這個模型是 [benchang1110/TaiVisionLM-base-v1](https://huggingface.co/benchang1110/TaiVisionLM-base-v1) 的微調版本。我們使用了100萬個圖片和文本的配對來微調模型。微調後的模型將生成更長、更詳細的圖片描述。 |
|
3) **任務特定訓練** |
|
- 對齊後的模型將進行進一步的訓練,針對短描述、詳細描述和簡單視覺問答等任務。我們將在數據集準備好後進行這一階段的訓練! |
|
|
|
|
|
- **創作者:** [benchang1110](https://huggingface.co/benchang1110) |
|
- **模型類型:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text) |
|
- **語言:** 繁體中文 |
|
|
|
--- |
|
|
|
## How to Get Started with the Model |
|
|
|
## English |
|
|
|
In Transformers, you can load the model and do inference as follows: |
|
|
|
**IMPORTANT NOTE:** TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set ```trust_remote_code=True``` when loading the model. It will download the ```configuration_taivisionlm.py```, ```modeling_taivisionlm.py``` and ```processing_taivisionlm.py``` files from the repo. You can check out the content of these files under the *Files and Versions* tab and pin the specific versions if you have any concerns regarding malicious code. |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig |
|
from PIL import Image |
|
import requests |
|
import torch |
|
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True) |
|
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda') |
|
model.eval() |
|
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
text = "描述圖片" |
|
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda') |
|
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0]) |
|
print(outputs) |
|
``` |
|
|
|
## 中文 |
|
利用 transformers,可以用下面程式碼進行推論: |
|
|
|
**重要通知:** 台視 (TaiVisionLM) 還沒被整合進transformers,因此在下載模型時要使用 ```trust_remote_code=True```,下載模型將會使用```configuration_taivisionlm.py```、 ```modeling_taivisionlm.py``` 和 ```processing_taivisionlm.py``` 這三個檔案,若擔心有惡意程式碼,請先點選右方 *Files and Versions* 來查看程式碼內容。 |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig |
|
from PIL import Image |
|
import requests |
|
import torch |
|
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True) |
|
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda') |
|
model.eval() |
|
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
text = "描述圖片" |
|
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda') |
|
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0]) |
|
print(outputs) |
|
``` |
|
|
|
|
|
### Comparision with prior model ([benchang1110/TaiVisionLM-base-v1](https://huggingface.co/benchang1110/TaiVisionLM-base-v1)) |
|
|
|
- **Example 1** |
|
![smile](https://www.slidecow.com/wp-content/uploads/2018/04/Setting-Up-The-Slide-Text-1000x563.jpg) |
|
* TaiVisionLM-base-v1: |
|
卡通插圖描繪掛在家門口的標誌,上下方以卡通插圖的方式呈現。 |
|
* TaiVisionLM-base-v2: |
|
這張圖片呈現了發人深省的對比。圖片中央,白色文字中的「Smile」以粗體黑色字母書寫。文字略微有些傾斜,為原本靜止的圖片增添了動感。背景是一個鮮明的白色,突顯文字並確立其在圖片中的重要性。 |
|
背景並非僅僅是白色的;它與黑色文字形成鮮明對比,創造出引人注目的視覺效果。文字、背景和形狀和諧合作,每個元素都互相襯托,形成和諧的構圖。 |
|
圖片底部右角有微妙的脊狀邊緣。脊狀的輪廓為圖片增添了一種深度,吸引觀眾的注意力,探索圖片的整體背景。脊狀邊緣與圖片整體的設計相輔相成,增強了節奏和能量氛圍。 |
|
整體而言,這張圖片是一個色彩和形狀的和諧結合,每個元素都經過精心放置,創造出視覺上令人愉悅的構圖。使用黑色、粗體字和微妙的脊狀邊緣增添了神秘感,將其印象擴展到更深層,既引人入勝又引人思考。 |
|
|
|
|
|
- **Example 2** |
|
![paris](https://static.toiimg.com/thumb/msid-53891743,width-748,height-499,resizemode=4,imgsize-152022/.jpg) |
|
* TaiVisionLM-base-v1: |
|
這是一幅攝影作品,展示了巴黎的鐵塔被水景所環繞 |
|
* TaiVisionLM-base-v2: |
|
這幅圖片捕捉到法國著名地標艾菲爾鐵塔的令人驚嘆的景觀。天空呈現明亮的藍色,與周圍的綠意交織,形成令人驚嘆的構圖。這座高聳的拱門塗上淺棕色的艾菲爾鐵塔,自豪地矗立在畫面右側。它旁邊是河流,它的平靜水域反射著上方的藍天。 |
|
在遠處,其他著名地標的蹤影可見,包括一座標誌性的橋樑和一座城堡般的摩天大樓,為場景增添深度和尺度。前景中的樹木增添了一抹綠意,為鐵塔的淺褐色和藍天的色彩提供了清新的對比。 |
|
這張圖片是從水面上觀看艾菲爾鐵塔的角度拍攝的,提供了對整個景觀的鳥瞰視角。這個視角可以全面地觀察到艾菲爾鐵塔及其周圍環境,展現了它的壯麗以及位於其中的生命。這張圖片中沒有任何虛構的內容,所有描述都是基於圖片中可見的元素。</s> |
|
### Training Procedure |
|
|
|
- **Feature Alignment** |
|
|
|
| Data size | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay | |
|
|--------------|-------------------|---------------|--------|------------|--------------| |
|
| 1.35M | 4 | 5e-3 | 1 | 1024 | 0 | |
|
|
|
We use full-parameter finetuning for the projector and apply LoRA to the language model. |
|
|
|
We will update the training procedure once we have more resources to train the model on the whole dataset. |
|
![metric](metrics.png) |
|
### Compute Infrastructure |
|
- **Feature Alignment** |
|
1xV100(32GB), took approximately 45 GPU hours. |