Spaces:

lixin4ever
/

VideoLLaMA2

Running on Zero

App Files Files Community

ClownRat commited on 22 days ago

Commit

1d5815e

•

1 Parent(s): 4049b6f

Update videollama2 codebase.

Browse files

Files changed (50) hide show

VideoLLaMA2/README.md +23 -9
VideoLLaMA2/pyproject.toml +1 -1
VideoLLaMA2/requirements.txt +3 -2
VideoLLaMA2/scripts/custom/finetune.sh +10 -11
VideoLLaMA2/scripts/custom/finetune_lora.sh +10 -11
VideoLLaMA2/scripts/custom/finetune_qlora.sh +10 -11
VideoLLaMA2/scripts/eval/eval_video_cap_msvc.sh +16 -16
VideoLLaMA2/scripts/eval/eval_video_mcqa_egoschema.sh +1 -1
VideoLLaMA2/scripts/eval/eval_video_mcqa_mvbench.sh +1 -1
VideoLLaMA2/scripts/eval/eval_video_mcqa_perception_test_mcqa.sh +1 -1
VideoLLaMA2/scripts/eval/eval_video_mcqa_videomme.sh +1 -1
VideoLLaMA2/scripts/eval/{eval_video_oqa_vcgpt_activitynet.sh → eval_video_oqa_activitynet.sh} +1 -1
VideoLLaMA2/scripts/eval/{eval_video_oqa_vcgpt_msvd.sh → eval_video_oqa_msvd.sh} +1 -1
VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_1_correctness.sh +5 -5
VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_2_detail.sh +7 -7
VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_3_context.sh +4 -4
VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_4_temporal.sh +4 -4
VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_5_consistency.sh +1 -1
VideoLLaMA2/scripts/siglip/finetune_gemma2.sh +0 -75
VideoLLaMA2/scripts/siglip/finetune_mistral.sh +0 -75
VideoLLaMA2/scripts/siglip/finetune_phi3.sh +0 -75
VideoLLaMA2/scripts/siglip/finetune_qwen2.sh +0 -75
VideoLLaMA2/scripts/siglip/pretrain_gemma2.sh +0 -75
VideoLLaMA2/scripts/siglip/pretrain_mistral.sh +0 -75
VideoLLaMA2/scripts/siglip/pretrain_phi3.sh +0 -75
VideoLLaMA2/scripts/siglip/pretrain_qwen2.sh +0 -75
VideoLLaMA2/scripts/vllava/finetune.sh +9 -10
VideoLLaMA2/scripts/vllava/finetune_qwen2.sh +0 -74
VideoLLaMA2/scripts/vllava/pretrain.sh +9 -10
VideoLLaMA2/scripts/vllava/pretrain_qwen2.sh +0 -74
VideoLLaMA2/videollama2/__init__.py +2 -2
VideoLLaMA2/videollama2/eval/inference_video_cap_msvc.py +50 -8
VideoLLaMA2/videollama2/eval/inference_video_mcqa_egoschema.py +15 -10
VideoLLaMA2/videollama2/eval/inference_video_mcqa_mvbench.py +1 -1
VideoLLaMA2/videollama2/eval/inference_video_mcqa_perception_test_mcqa.py +1 -1
VideoLLaMA2/videollama2/eval/inference_video_mcqa_videomme.py +1 -1
VideoLLaMA2/videollama2/eval/inference_video_oqa_activitynet.py +16 -8
VideoLLaMA2/videollama2/mm_utils.py +2 -1
VideoLLaMA2/videollama2/model/__init__.py +110 -131
VideoLLaMA2/videollama2/model/encoder.py +9 -1
VideoLLaMA2/videollama2/model/videollama2_arch.py +3 -0
VideoLLaMA2/videollama2/model/videollama2_gemma2.py +0 -157
VideoLLaMA2/videollama2/model/videollama2_llama.py +6 -6
VideoLLaMA2/videollama2/model/videollama2_mistral.py +1 -1
VideoLLaMA2/videollama2/model/videollama2_mixtral.py +1 -1
VideoLLaMA2/videollama2/model/videollama2_phi3.py +0 -157
VideoLLaMA2/videollama2/model/videollama2_qwen2.py +1 -1
VideoLLaMA2/videollama2/serve/gradio_web_server_adhoc.py +21 -15
VideoLLaMA2/videollama2/train.py +20 -32
VideoLLaMA2/videollama2/train_flash_attn.py +0 -12

VideoLLaMA2/README.md CHANGED Viewed

@@ -19,6 +19,12 @@ VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Vid
 </h5>
 <details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary><p>
 <!--  may -->
@@ -36,6 +42,8 @@ VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Vid
 ## 📰 News
 * **[2024.07.30]**  Release checkpoints of [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base) and [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B).
 * **[2024.06.25]**  🔥🔥 As of Jun 25, our [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) is the **Top-1** ~7B-sized VideoLLM on the [MLVU Leaderboard](https://github.com/JUNJIE99/MLVU?tab=readme-ov-file#trophy-mini-leaderboard).
 * **[2024.06.18]**  🔥🔥 As of Jun 18, our [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) is the **Top-1** ~7B-sized VideoLLM on the [VideoMME Leaderboard](https://video-mme.github.io/home_page.html#leaderboard).
@@ -51,8 +59,8 @@ Basic Dependencies:
 * Python >= 3.8
 * Pytorch >= 2.2.0
 * CUDA Version >= 11.8
-* transformers >= 4.41.2 (for mistral tokenizer)
-* tokenizers >= 0.19.1 (for mistral tokenizer)
 **[Online Mode]** Install required packages (better for development):
 ```bash
@@ -74,11 +82,12 @@ pip install flash-attn==2.5.8 --no-build-isolation
 ## 🚀 Main Results
 ### Multi-Choice Video QA & Video Captioning
-<p><img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/9cc4a5ae-d850-4eef-bd51-83688b94698e" width="800" "/></p>
 ###  Open-Ended Video QA
-<p><img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/2ed7aa53-db56-4829-8375-85aefbc5120a" width="800" "/></p>
 ## :earth_americas: Model Zoo
 | Model Name     | Model Type | Visual Encoder | Language Decoder | # Training Frames |
@@ -89,6 +98,11 @@ pip install flash-attn==2.5.8 --no-build-isolation
 | [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)  | 16 |
 | [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base)  | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)  | 8 |
 | [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)  | 8 |
 ## [🤗 Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2)
@@ -251,7 +265,7 @@ VideoLLaMA2
 ...
 --data_path datasets/custom_sft/custom.json
 --data_folder datasets/custom_sft/
---pretrain_mm_mlp_adapter CONNECTOR_DOWNLOAD_PATH (e.g., DAMO-NLP-SG/VideoLLaMA2-7B-Base)
 ...
 ```
@@ -269,7 +283,7 @@ def inference():
     disable_torch_init()
     # Video Inference
-    modal = 'videp'
     modal_path = 'assets/cat_and_chicken.mp4'
     instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
     # Reply:
@@ -282,9 +296,9 @@ def inference():
     # Reply:
     # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
-    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
     # Base model inference (only need to replace model_path)
-    # model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
     model, processor, tokenizer = model_init(model_path)
     output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

 </h5>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/video-question-answering-on-perception-test)](https://paperswithcode.com/sota/video-question-answering-on-perception-test?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/zero-shot-video-question-answer-on-video-mme-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-video-mme-1?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/zero-shot-video-question-answer-on-video-mme)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-video-mme?p=videollama-2-advancing-spatial-temporal) <br>
 <details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary><p>
 <!--  may -->
 ## 📰 News
+* **[2024.10.15]**   Release checkpoints of [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) and [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)
+* **[2024.08.14]**   Release checkpoints of [VideoLLaMA2-72B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B-Base) and [VideoLLaMA2-72B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B)
 * **[2024.07.30]**  Release checkpoints of [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base) and [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B).
 * **[2024.06.25]**  🔥🔥 As of Jun 25, our [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) is the **Top-1** ~7B-sized VideoLLM on the [MLVU Leaderboard](https://github.com/JUNJIE99/MLVU?tab=readme-ov-file#trophy-mini-leaderboard).
 * **[2024.06.18]**  🔥🔥 As of Jun 18, our [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) is the **Top-1** ~7B-sized VideoLLM on the [VideoMME Leaderboard](https://video-mme.github.io/home_page.html#leaderboard).
 * Python >= 3.8
 * Pytorch >= 2.2.0
 * CUDA Version >= 11.8
+* transformers == 4.40.0 (for reproducing paper results)
+* tokenizers == 0.19.1
 **[Online Mode]** Install required packages (better for development):
 ```bash
 ## 🚀 Main Results
 ### Multi-Choice Video QA & Video Captioning
+<p><img src="https://github.com/user-attachments/assets/e87fe4cf-07ea-4fde-998b-a0c63671c3b4" width="800" "/></p>
 ###  Open-Ended Video QA
+<p><img src="https://github.com/user-attachments/assets/80b16c04-75ac-43b8-bc22-6952fdf994bb" width="800" "/></p>
 ## :earth_americas: Model Zoo
 | Model Name     | Model Type | Visual Encoder | Language Decoder | # Training Frames |
 | [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)  | 16 |
 | [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base)  | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)  | 8 |
 | [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)  | 8 |
+| [VideoLLaMA2-72B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B-Base)  | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)  | 8 |
+| [VideoLLaMA2-72B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)  | 8 |
+| [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) | Base | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+| [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)  | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
 ## [🤗 Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2)
 ...
 --data_path datasets/custom_sft/custom.json
 --data_folder datasets/custom_sft/
+--pretrain_mm_mlp_adapter CONNECTOR_DOWNLOAD_PATH (e.g., DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base)
 ...
 ```
     disable_torch_init()
     # Video Inference
+    modal = 'video'
     modal_path = 'assets/cat_and_chicken.mp4'
     instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
     # Reply:
     # Reply:
     # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
+    model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F'
     # Base model inference (only need to replace model_path)
+    # model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base'
     model, processor, tokenizer = model_init(model_path)
     output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

VideoLLaMA2/pyproject.toml CHANGED Viewed

@@ -14,7 +14,7 @@ classifiers = [
 ]
 dependencies = [
     "torch==2.2.0", "torchvision==0.17.0",
-    "transformers==4.42.3", "tokenizers==0.19.1",
     "deepspeed==0.13.1", "accelerate==0.26.1",
     "peft==0.4.0", "timm==1.0.3", "numpy==1.24.4",
     "decord==0.6.0", "imageio==2.34.0", "imageio-ffmpeg==0.4.9",

 ]
 dependencies = [
     "torch==2.2.0", "torchvision==0.17.0",
+    "transformers==4.40.0", "tokenizers==0.19.1",
     "deepspeed==0.13.1", "accelerate==0.26.1",
     "peft==0.4.0", "timm==1.0.3", "numpy==1.24.4",
     "decord==0.6.0", "imageio==2.34.0", "imageio-ffmpeg==0.4.9",

VideoLLaMA2/requirements.txt CHANGED Viewed

@@ -2,7 +2,7 @@
 # basic dependencies
 torch==2.2.0
 torchvision==0.17.0
-transformers==4.42.3
 tokenizers==0.19.1
 deepspeed==0.13.1
 accelerate==0.26.1
@@ -36,4 +36,5 @@ uvicorn
 fastapi
 tensorboard
 wandb
-tabulate

 # basic dependencies
 torch==2.2.0
 torchvision==0.17.0
+transformers==4.40.0
 tokenizers==0.19.1
 deepspeed==0.13.1
 accelerate==0.26.1
 fastapi
 tensorboard
 wandb
+tabulate
+spaces==0.29.2

VideoLLaMA2/scripts/custom/finetune.sh CHANGED Viewed

@@ -5,7 +5,7 @@ ARG_WORLD_SIZE=${1:-1}
 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
-ARG_RANK=0
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
@@ -28,8 +28,8 @@ GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$L
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2
-RUN_NAME=downstream_sft_settings
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
@@ -38,18 +38,18 @@ torchrun --nnodes $WORLD_SIZE \
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
-    videollama2/train_flash_attn.py \
     --deepspeed scripts/zero3.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
-    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B-Base/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
-    --num_frames 8 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
@@ -58,7 +58,6 @@ torchrun --nnodes $WORLD_SIZE \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_downstream_sft
+RUN_NAME=siglip_tcv35_7b_16f
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
+    videollama2/train.py \
     --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
+    --num_frames 16 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

VideoLLaMA2/scripts/custom/finetune_lora.sh CHANGED Viewed

@@ -5,7 +5,7 @@ ARG_WORLD_SIZE=${1:-1}
 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
-ARG_RANK=0
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
@@ -28,8 +28,8 @@ GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$L
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2
-RUN_NAME=downstream_sft_settings_lora
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
@@ -38,19 +38,19 @@ torchrun --nnodes $WORLD_SIZE \
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
-    videollama2/train_flash_attn.py \
     --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
     --deepspeed scripts/zero3.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
-    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B-Base/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
-    --num_frames 8 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
@@ -59,7 +59,6 @@ torchrun --nnodes $WORLD_SIZE \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_downstream_sft
+RUN_NAME=siglip_tcv35_7b_16f_lora
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
+    videollama2/train.py \
     --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
     --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
+    --num_frames 16 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

VideoLLaMA2/scripts/custom/finetune_qlora.sh CHANGED Viewed

@@ -5,7 +5,7 @@ ARG_WORLD_SIZE=${1:-1}
 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
-ARG_RANK=0
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
@@ -28,8 +28,8 @@ GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$L
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2
-RUN_NAME=downstream_sft_settings_qlora
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
@@ -38,19 +38,19 @@ torchrun --nnodes $WORLD_SIZE \
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
-    videollama2/train_flash_attn.py \
     --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
     --deepspeed scripts/zero2.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
-    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2-7B-Base/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
-    --num_frames 8 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
@@ -59,7 +59,6 @@ torchrun --nnodes $WORLD_SIZE \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_downstream_sft
+RUN_NAME=siglip_tcv35_7b_16f_qlora
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
+    videollama2/train.py \
     --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
     --deepspeed scripts/zero2.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
+    --num_frames 16 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

VideoLLaMA2/scripts/eval/eval_video_cap_msvc.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
@@ -12,11 +12,11 @@ IFS=',' read -ra GPULIST <<< "$gpu_list"
 GPUS_PER_TASK=1
 CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
-output_file=${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/merge.json
 # judge if the number of json lines is 0
 if [ ! -f "$output_file" ] || [ $(cat "$output_file" | wc -l) -eq 0 ]; then
-    rm -f ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/*.json
 fi
 if [ ! -f "$output_file" ]; then
@@ -25,9 +25,9 @@ if [ ! -f "$output_file" ]; then
         gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
         TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_cap_msvc.py \
           --model-path ${CKPT} \
-          --video-folder ${EVAL_DATA_DIR}/MSVC \
-          --question-file ${EVAL_DATA_DIR}/MSVC/msvc.json \
-          --output-file ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
           --num-chunks $CHUNKS \
           --chunk-idx $IDX &
     done
@@ -39,28 +39,28 @@ if [ ! -f "$output_file" ]; then
     #Loop through the indices and concatenate each file.
     for IDX in $(seq 0 $((CHUNKS-1))); do
-        cat ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
     done
 fi
-AZURE_API_KEY=""
-AZURE_API_ENDPOINT=""
-AZURE_API_DEPLOYNAME=""
-python3 videollama2/new_eval/eval_video_cap_msvc_correctness.py \
     --pred-path $output_file \
-    --output-dir ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/correctness_gpt \
-    --output-json ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/correctness_results.json \
     --api-key $AZURE_API_KEY \
     --api-endpoint $AZURE_API_ENDPOINT \
     --api-deployname $AZURE_API_DEPLOYNAME \
     --num-tasks 4 \
-python3 videollama2/new_eval/eval_video_cap_msvc_detailedness.py \
     --pred-path $output_file \
-    --output-dir ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/detailedness_gpt \
-    --output-json ${OUTPUT_DIR}/MSVC/answers/${CKPT_NAME}/detailedness_results.json \
     --api-key $AZURE_API_KEY \
     --api-endpoint $AZURE_API_ENDPOINT \
     --api-deployname $AZURE_API_DEPLOYNAME \

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
 GPUS_PER_TASK=1
 CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+output_file=${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/merge.json
 # judge if the number of json lines is 0
 if [ ! -f "$output_file" ] || [ $(cat "$output_file" | wc -l) -eq 0 ]; then
+    rm -f ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/*.json
 fi
 if [ ! -f "$output_file" ]; then
         gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
         TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_cap_msvc.py \
           --model-path ${CKPT} \
+          --video-folder ${EVAL_DATA_DIR}/msvc \
+          --question-file ${EVAL_DATA_DIR}/msvc/msvc.json \
+          --output-file ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
           --num-chunks $CHUNKS \
           --chunk-idx $IDX &
     done
     #Loop through the indices and concatenate each file.
     for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
     done
 fi
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+python3 videollama2/eval/eval_video_cap_msvc_correctness.py \
     --pred-path $output_file \
+    --output-dir ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/correctness_gpt \
+    --output-json ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/correctness_results.json \
     --api-key $AZURE_API_KEY \
     --api-endpoint $AZURE_API_ENDPOINT \
     --api-deployname $AZURE_API_DEPLOYNAME \
     --num-tasks 4 \
+python3 videollama2/eval/eval_video_cap_msvc_detailedness.py \
     --pred-path $output_file \
+    --output-dir ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/detailedness_gpt \
+    --output-json ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/detailedness_results.json \
     --api-key $AZURE_API_KEY \
     --api-endpoint $AZURE_API_ENDPOINT \
     --api-deployname $AZURE_API_DEPLOYNAME \

VideoLLaMA2/scripts/eval/eval_video_mcqa_egoschema.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/eval/eval_video_mcqa_mvbench.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/eval/eval_video_mcqa_perception_test_mcqa.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/eval/eval_video_mcqa_videomme.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/eval/{eval_video_oqa_vcgpt_activitynet.sh → eval_video_oqa_activitynet.sh} RENAMED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/eval/{eval_video_oqa_vcgpt_msvd.sh → eval_video_oqa_msvd.sh} RENAMED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_1_correctness.sh CHANGED Viewed

@@ -1,8 +1,8 @@
 set -x
-EVAL_DATA_DIR=dataset/videollm_eval
-OUTPUT_DIR=eval
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
@@ -18,7 +18,7 @@ if [ ! -f "$output_file" ]; then
     for IDX in $(seq 0 $((CHUNKS-1))); do
         # select the GPUs for the task
         gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
-        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/new_eval/inference_video_oqa_vcgpt_general.py \
             --model-path ${CKPT} \
             --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
             --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/generic_qa.json \
@@ -48,7 +48,7 @@ AZURE_API_KEY=your_key
 AZURE_API_ENDPOINT=your_endpoint
 AZURE_API_DEPLOYNAME=your_deployname
-python3 videollama2/new_eval/eval_video_oqa_vcgpt_1_correctness.py \
     --pred-path ${output_file} \
     --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/gpt \
     --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/results.json \

 set -x
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
     for IDX in $(seq 0 $((CHUNKS-1))); do
         # select the GPUs for the task
         gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_oqa_vcgpt_general.py \
             --model-path ${CKPT} \
             --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
             --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/generic_qa.json \
 AZURE_API_ENDPOINT=your_endpoint
 AZURE_API_DEPLOYNAME=your_deployname
+python3 videollama2/eval/eval_video_oqa_vcgpt_1_correctness.py \
     --pred-path ${output_file} \
     --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/gpt \
     --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/results.json \

VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_2_detail.sh CHANGED Viewed

@@ -1,8 +1,8 @@
 set -x
-EVAL_DATA_DIR=dataset/videollm_eval
-OUTPUT_DIR=eval
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
@@ -48,11 +48,11 @@ AZURE_API_KEY=your_key
 AZURE_API_ENDPOINT=your_endpoint
 AZURE_API_DEPLOYNAME=your_deployname
-python3 videollama2/eval/eval_benchmark_2_detailed_orientation.py \
     --pred-path ${output_file} \
     --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/gpt \
     --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/results.json \
-    --api-key "35632dae7dd94d0a93338db373c63893" \
-    --api-endpoint https://damo-openai-gpt4v-test.openai.azure.com \
-    --api-deployname gpt-35-turbo \
     --num-tasks 4

 set -x
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
 AZURE_API_ENDPOINT=your_endpoint
 AZURE_API_DEPLOYNAME=your_deployname
+python3 videollama2/eval/eval_video_oqa_vcgpt_2_detailed_orientation.py \
     --pred-path ${output_file} \
     --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/gpt \
     --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
     --num-tasks 4

VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_3_context.sh CHANGED Viewed

@@ -1,8 +1,8 @@
 set -x
-EVAL_DATA_DIR=dataset/videollm_eval
-OUTPUT_DIR=eval
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
@@ -48,7 +48,7 @@ AZURE_API_KEY=your_key
 AZURE_API_ENDPOINT=your_endpoint
 AZURE_API_DEPLOYNAME=your_deployname
-python3 videollama2/eval/eval_benchmark_3_context.py \
     --pred-path ${output_file} \
     --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/gpt \
     --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/results.json \

 set -x
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
 AZURE_API_ENDPOINT=your_endpoint
 AZURE_API_DEPLOYNAME=your_deployname
+python3 videollama2/eval/eval_video_oqa_vcgpt_3_context.py \
     --pred-path ${output_file} \
     --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/gpt \
     --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/results.json \

VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_4_temporal.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
@@ -40,9 +40,9 @@ if [ ! -f "$output_file" ]; then
 fi
-AZURE_API_KEY=a7f9bc087b7143a69d59a68f01a2b450
-AZURE_API_ENDPOINT=https://vl-australiaeast.openai.azure.com
-AZURE_API_DEPLOYNAME=gpt35-turbo-0613
 python3 videollama2/eval/eval_video_oqa_vcgpt_4_temporal.py \
     --pred-path ${output_file} \

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
 fi
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
 python3 videollama2/eval/eval_video_oqa_vcgpt_4_temporal.py \
     --pred-path ${output_file} \

VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_5_consistency.sh CHANGED Viewed

@@ -2,7 +2,7 @@ set -x
 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
-CKPT=DAMO-NLP-SG/VideoLLaMA2-7B
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

 EVAL_DATA_DIR=eval
 OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
 CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
 gpu_list="${CUDA_VISIBLE_DEVICES:-0}"

VideoLLaMA2/scripts/siglip/finetune_gemma2.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16667
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=128
-LOCAL_BATCH_SIZE=4
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2gemma2_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_gemma2 \
-    --model_path google/gemma-2-2b-it \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
-    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
-    --data_folder ${DATA_DIR}/videollava_sft/ \
-    --mm_vision_select_layer -2 \
-    --image_aspect_ratio pad \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
-    --num_train_epochs 3 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 200 \
-    --save_total_limit 99 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --report_to tensorboard \
-    --run_name finetune_$RUN_NAME \

VideoLLaMA2/scripts/siglip/finetune_mistral.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16667
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=128
-LOCAL_BATCH_SIZE=4
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2mistral_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
-    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
-    --data_folder ${DATA_DIR}/videollava_sft/ \
-    --mm_vision_select_layer -2 \
-    --image_aspect_ratio pad \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
-    --num_train_epochs 3 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 200 \
-    --save_total_limit 99 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --report_to wandb \
-    --run_name finetune_$RUN_NAME \

VideoLLaMA2/scripts/siglip/finetune_phi3.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16667
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=128
-LOCAL_BATCH_SIZE=4
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2phi3_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_phi3 \
-    --model_path microsoft/Phi-3-mini-4k-instruct \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
-    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
-    --data_folder ${DATA_DIR}/videollava_sft/ \
-    --mm_vision_select_layer -2 \
-    --image_aspect_ratio pad \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
-    --num_train_epochs 3 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 200 \
-    --save_total_limit 99 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --report_to tensorboard \
-    --run_name finetune_$RUN_NAME \

VideoLLaMA2/scripts/siglip/finetune_qwen2.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=128
-LOCAL_BATCH_SIZE=4
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2qwen2_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_qwen2 \
-    --model_path Qwen/Qwen2-7B-Instruct \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
-    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
-    --data_folder ${DATA_DIR}/videollava_sft/ \
-    --mm_vision_select_layer -2 \
-    --image_aspect_ratio pad \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --report_to tensorboard \
-    --run_name $RUN_NAME \

VideoLLaMA2/scripts/siglip/pretrain_gemma2.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=256
-LOCAL_BATCH_SIZE=4
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2gemma2_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE  \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_gemma2 \
-    --model_path google/gemma-2-2b-it \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --tune_mm_mlp_adapter True \
-    --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
-    --data_folder ${DATA_DIR}/videollava_pt/ \
-    --mm_vision_select_layer -2 \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 1e-3 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --lazy_preprocess True \
-    --report_to tensorboard \
-    --run_name pretrain_$RUN_NAME \

VideoLLaMA2/scripts/siglip/pretrain_mistral.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=256
-LOCAL_BATCH_SIZE=8
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2mistral_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE  \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --tune_mm_mlp_adapter True \
-    --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
-    --data_folder ${DATA_DIR}/videollava_pt/ \
-    --mm_vision_select_layer -2 \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 1e-3 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 16 \
-    --lazy_preprocess True \
-    --report_to tensorboard \
-    --run_name pretrain_$RUN_NAME \

VideoLLaMA2/scripts/siglip/pretrain_phi3.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=256
-LOCAL_BATCH_SIZE=8
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2phi3_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE  \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_phi3 \
-    --model_path microsoft/Phi-3-mini-4k-instruct \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --tune_mm_mlp_adapter True \
-    --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
-    --data_folder ${DATA_DIR}/videollava_pt/ \
-    --mm_vision_select_layer -2 \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 1e-3 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --lazy_preprocess True \
-    --report_to tensorboard \
-    --run_name pretrain_$RUN_NAME \

VideoLLaMA2/scripts/siglip/pretrain_qwen2.sh DELETED Viewed

@@ -1,75 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=256
-LOCAL_BATCH_SIZE=8
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-echo $GRADIENT_ACCUMULATION_STEPS
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2qwen2_siglip
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE  \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_qwen2 \
-    --model_path Qwen/Qwen2-7B-Instruct \
-    --vision_tower google/siglip-so400m-patch14-384 \
-    --mm_projector_type stc_connector_v35 \
-    --tune_mm_mlp_adapter True \
-    --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
-    --data_folder ${DATA_DIR}/videollava_pt/ \
-    --mm_vision_select_layer -2 \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 1e-3 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --lazy_preprocess True \
-    --report_to tensorboard \
-    --run_name $RUN_NAME \

VideoLLaMA2/scripts/vllava/finetune.sh CHANGED Viewed

@@ -5,7 +5,7 @@ ARG_WORLD_SIZE=${1:-1}
 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
-ARG_RANK=0
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
@@ -28,8 +28,8 @@ GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$L
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2
-RUN_NAME=vllava_settings
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
@@ -38,18 +38,18 @@ torchrun --nnodes $WORLD_SIZE \
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
-    videollama2/train_flash_attn.py \
     --deepspeed scripts/zero3.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
     --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
-    --num_frames 8 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
@@ -58,7 +58,6 @@ torchrun --nnodes $WORLD_SIZE \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_vllava
+RUN_NAME=siglip_tcv35_7b_16f
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
+    videollama2/train.py \
     --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
     --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
     --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
     --data_folder ${DATA_DIR}/videollava_sft/ \
     --mm_vision_select_layer -2 \
     --image_aspect_ratio pad \
+    --num_frames 16 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
     --per_device_train_batch_size $LOCAL_BATCH_SIZE \
     --per_device_eval_batch_size 4 \
     --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
     --save_strategy "steps" \
     --save_steps 500 \
     --save_total_limit 99 \

VideoLLaMA2/scripts/vllava/finetune_qwen2.sh DELETED Viewed

@@ -1,74 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=128
-LOCAL_BATCH_SIZE=4
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2qwen2
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_qwen2 \
-    --model_path Qwen/Qwen2-7B-Instruct \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
-    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
-    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
-    --data_folder ${DATA_DIR}/videollava_sft/ \
-    --mm_vision_select_layer -2 \
-    --image_aspect_ratio pad \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --report_to tensorboard \
-    --run_name $RUN_NAME \

VideoLLaMA2/scripts/vllava/pretrain.sh CHANGED Viewed

@@ -5,7 +5,7 @@ ARG_WORLD_SIZE=${1:-1}
 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
-ARG_RANK=0
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
@@ -28,8 +28,8 @@ GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$L
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2
-RUN_NAME=vllava_settings
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
@@ -38,17 +38,17 @@ torchrun --nnodes $WORLD_SIZE \
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
-    videollama2/train_flash_attn.py \
     --deepspeed scripts/zero3.json \
-    --model_type videollama2 \
-    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
     --tune_mm_mlp_adapter True \
     --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
     --data_folder ${DATA_DIR}/videollava_pt/ \
     --mm_vision_select_layer -2 \
-    --num_frames 8 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
@@ -69,6 +69,5 @@ torchrun --nnodes $WORLD_SIZE \
     --model_max_length 2048 \
     --gradient_checkpointing True \
     --dataloader_num_workers 4 \
-    --lazy_preprocess True \
     --report_to tensorboard \
     --run_name $RUN_NAME \

 ARG_NPROC_PER_NODE=${2:-8}
 ARG_MASTER_ADDR="127.0.0.1"
 ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
 # Multiple conditions
 if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
 # Log Arguments
 export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_vllava
+RUN_NAME=siglip_tcv35_7b_16f
 DATA_DIR=datasets
 OUTP_DIR=work_dirs
     --master_addr=$MASTER_ADDR \
     --master_port=$MASTER_PORT \
     --node_rank $RANK \
+    videollama2/train.py \
     --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
     --tune_mm_mlp_adapter True \
     --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
     --data_folder ${DATA_DIR}/videollava_pt/ \
     --mm_vision_select_layer -2 \
+    --num_frames 16 \
     --bf16 True \
     --tf32 True \
     --fp16 False \
     --model_max_length 2048 \
     --gradient_checkpointing True \
     --dataloader_num_workers 4 \
     --report_to tensorboard \
     --run_name $RUN_NAME \

VideoLLaMA2/scripts/vllava/pretrain_qwen2.sh DELETED Viewed

@@ -1,74 +0,0 @@
-#!/bin/bash
-# Environment Variables
-ARG_WORLD_SIZE=${1:-1}
-ARG_NPROC_PER_NODE=${2:-8}
-ARG_MASTER_ADDR="127.0.0.1"
-ARG_MASTER_PORT=16666
-ARG_RANK=0
-# Multiple conditions
-if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
-    WORLD_SIZE=$ARG_WORLD_SIZE
-    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
-fi
-if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
-    MASTER_ADDR=$ARG_MASTER_ADDR
-    MASTER_PORT=$ARG_MASTER_PORT
-    RANK=$ARG_RANK
-fi
-echo "WORLD_SIZE: $WORLD_SIZE"
-echo "NPROC_PER_NODE: $NPROC_PER_NODE"
-# Training Arguments
-GLOBAL_BATCH_SIZE=256
-LOCAL_BATCH_SIZE=8
-GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
-# Log Arguments
-export TRANSFORMERS_OFFLINE=1
-export WANDB_PROJECT=videollama2qwen2
-RUN_NAME=vllava_settings
-DATA_DIR=datasets
-OUTP_DIR=work_dirs
-torchrun --nnodes $WORLD_SIZE \
-    --nproc_per_node $NPROC_PER_NODE  \
-    --master_addr=$MASTER_ADDR \
-    --master_port=$MASTER_PORT \
-    --node_rank $RANK \
-    videollama2/train_flash_attn.py \
-    --deepspeed scripts/zero3.json \
-    --model_type videollama2_qwen2 \
-    --model_path Qwen/Qwen2-7B-Instruct \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type stc_connector \
-    --tune_mm_mlp_adapter True \
-    --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
-    --data_folder ${DATA_DIR}/videollava_pt/ \
-    --mm_vision_select_layer -2 \
-    --num_frames 8 \
-    --bf16 True \
-    --tf32 True \
-    --fp16 False \
-    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 500 \
-    --save_total_limit 99 \
-    --learning_rate 1e-3 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --lazy_preprocess True \
-    --report_to tensorboard \
-    --run_name $RUN_NAME \

VideoLLaMA2/videollama2/__init__.py CHANGED Viewed

@@ -58,7 +58,7 @@ def mm_infer(image_or_video, instruct, model, tokenizer, modal='video', **kwargs
         tensor = None
     else:
         tensor = image_or_video.half().cuda()
-        tensor = [(tensor, modal_token)]
     # 2. text preprocess (tag process & generate prompt).
     if isinstance(instruct, str):
@@ -93,7 +93,7 @@ def mm_infer(image_or_video, instruct, model, tokenizer, modal='video', **kwargs
     do_sample = kwargs.get('do_sample', False)
     temperature = kwargs.get('temperature', 0.2 if do_sample else 0.0)
     top_p = kwargs.get('top_p', 0.9)
-    max_new_tokens = kwargs.get('max_new_tokens', 1024)
     with torch.inference_mode():
         output_ids = model.generate(

         tensor = None
     else:
         tensor = image_or_video.half().cuda()
+        tensor = [(tensor, modal)]
     # 2. text preprocess (tag process & generate prompt).
     if isinstance(instruct, str):
     do_sample = kwargs.get('do_sample', False)
     temperature = kwargs.get('temperature', 0.2 if do_sample else 0.0)
     top_p = kwargs.get('top_p', 0.9)
+    max_new_tokens = kwargs.get('max_new_tokens', 2048)
     with torch.inference_mode():
         output_ids = model.generate(

VideoLLaMA2/videollama2/eval/inference_video_cap_msvc.py CHANGED Viewed

@@ -5,6 +5,8 @@ import json
 import warnings
 from tqdm import tqdm
 import sys
 sys.path.append('./')
 from videollama2 import model_init, mm_infer
@@ -25,6 +27,44 @@ def get_chunk(lst, n, k):
     return chunks[k]
 def run_inference(args):
     disable_torch_init()
@@ -37,16 +77,16 @@ def run_inference(args):
     os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
     ans_file = open(answer_file, "w")
-    video_formats = ['.mp4', '.avi', '.mov', '.mkv']
     # Iterate over each sample in the ground truth file
-    for idx, sample in enumerate(tqdm(gt_questions)):
-        video_name = sample['video_path']
-        question = sample['question']
-        answer = sample['captions']
-        video_path = os.path.join(args.video_folder, video_name)
-        video_tensor = processor['video'](video_path)
         output = mm_infer(
             video_tensor,
@@ -73,6 +113,8 @@ if __name__ == "__main__":
     parser.add_argument("--num-chunks", type=int, default=1)
     parser.add_argument("--chunk-idx", type=int, default=0)
     parser.add_argument("--device", type=str, required=False, default='cuda:0')
     args = parser.parse_args()
     run_inference(args)

 import warnings
 from tqdm import tqdm
+from torch.utils.data import Dataset, DataLoader
 import sys
 sys.path.append('./')
 from videollama2 import model_init, mm_infer
     return chunks[k]
+class MSVCDataset(Dataset):
+    video_formats = ['.mp4', '.webm', '.avi', '.mov', '.mkv']
+    def __init__(self, folder, questions, processor):
+        self.folder = folder
+        self.questions = questions
+        self.processor = processor
+    def __len__(self):
+        return len(self.questions)
+    def __getitem__(self, idx):
+        sample = self.questions[idx]
+        video_name = sample['video_path']
+        question   = sample['question']
+        answer     = sample['captions']
+        video_path = os.path.join(self.folder, video_name)
+        video_tensor = self.processor(video_path)
+        return {
+            'video':       video_tensor,
+            'video_name':  video_name,
+            'question':    question,
+            'answer':      answer,
+        }
+def collate_fn(batch):
+    vid  = [x['video'] for x in batch]
+    v_id = [x['video_name'] for x in batch]
+    qus  = [x['question'] for x in batch]
+    ans  = [x['answer'] for x in batch]
+    return vid, v_id, qus, ans
 def run_inference(args):
     disable_torch_init()
     os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
     ans_file = open(answer_file, "w")
+    assert args.batch_size == 1, "Batch size must be 1 for inference"
+    dataset = MSVCDataset(args.video_folder, gt_questions, processor['video'])
+    dataloader = DataLoader(dataset, shuffle=False, batch_size=args.batch_size, num_workers=args.num_workers, collate_fn=collate_fn)
     # Iterate over each sample in the ground truth file
+    for idx, (video_tensors, video_names, questions, answers) in enumerate(tqdm(dataloader)):
+        video_tensor = video_tensors[0]
+        video_name   = video_names[0]
+        question     = questions[0]
+        answer       = answers[0]
         output = mm_infer(
             video_tensor,
     parser.add_argument("--num-chunks", type=int, default=1)
     parser.add_argument("--chunk-idx", type=int, default=0)
     parser.add_argument("--device", type=str, required=False, default='cuda:0')
+    parser.add_argument("--batch-size", type=int, required=False, default=1)
+    parser.add_argument("--num-workers", type=int, required=False, default=8)
     args = parser.parse_args()
     run_inference(args)

VideoLLaMA2/videollama2/eval/inference_video_mcqa_egoschema.py CHANGED Viewed

@@ -62,7 +62,7 @@ class EgoschemaDataset(Dataset):
         axs = [a0, a1, a2, a3, a4]
         ops = ['(A)', '(B)', '(C)', '(D)', '(E)']
-        instruct = f'Question: {question}\nOptions:\n(A) {a0}\n(B) {a1}\n(C) {a2}\n(D) {a3}\n(E) {a4}\nAnswer with the option\'s letter from the given choices directly and only give the best option.'
         return {
             'q_uid': q_uid,
@@ -90,7 +90,8 @@ def egoschema_dump(ans_file, line, outputs):
         output = output.replace('Answer', '')
         pred_answer = re.findall('[\(\ ]*[A-E][\)\ ]*', output)
         try:
-            assert len(pred_answer) >= 1, 'The video \"{}\" output \"{}\" is not in the expected format'.format(line['q_uid'], instruct + '\n' + output)
             pred_answer = pred_answer[0].strip()
             pred_answer = pred_answer.strip('()')
             pred_idx = letters.index(pred_answer)
@@ -117,14 +118,18 @@ def run_inference(args):
         video_tensor = line['video'][0]
         instruct = line['instruct'][0]
-        pred = mm_infer(
-            video_tensor,
-            instruct,
-            model=model,
-            tokenizer=tokenizer,
-            modal='video',
-            do_sample=False,
-        )
         egoschema_dump(ans_file, line, [pred])

         axs = [a0, a1, a2, a3, a4]
         ops = ['(A)', '(B)', '(C)', '(D)', '(E)']
+        instruct = f'Select the best answer to the following multiple-choice question based on the video.\n{question}\nOptions:\n(A) {a0}\n(B) {a1}\n(C) {a2}\n(D) {a3}\n(E) {a4}\nAnswer with the option\'s letter from the given choices directly and only give the best option. The best answer is: '
         return {
             'q_uid': q_uid,
         output = output.replace('Answer', '')
         pred_answer = re.findall('[\(\ ]*[A-E][\)\ ]*', output)
         try:
+            assert len(pred_answer) >= 1, 'The video \"{}\" instruct: \n\"{}\"\n output: \n\"{}\"\n is not in the expected format'.format(line['q_uid'], instruct, output)
             pred_answer = pred_answer[0].strip()
             pred_answer = pred_answer.strip('()')
             pred_idx = letters.index(pred_answer)
         video_tensor = line['video'][0]
         instruct = line['instruct'][0]
+        try:
+            pred = mm_infer(
+                video_tensor,
+                instruct,
+                model=model,
+                tokenizer=tokenizer,
+                modal='video',
+                do_sample=False,
+            )
+        except:
+            traceback.print_exc()
+            pred = 'C'
         egoschema_dump(ans_file, line, [pred])

VideoLLaMA2/videollama2/eval/inference_video_mcqa_mvbench.py CHANGED Viewed

@@ -141,7 +141,7 @@ def mvbench_dump(vid, instruct, letters, options, output):
             pred_idx = letters.index(pred_answer)
             find_flag = True
-        assert find_flag, 'The video \"{}\" output: \n\"{}\" is not in the expected format'.format(vid, instruct + '\n' + output)
     except:
         traceback.print_exc()
         pred_idx = 2

             pred_idx = letters.index(pred_answer)
             find_flag = True
+        assert find_flag, 'The video \"{}\" instruct: \n\"{}\"\n output: \n\"{}\"\n is not in the expected format'.format(vid, instruct, output)
     except:
         traceback.print_exc()
         pred_idx = 2

VideoLLaMA2/videollama2/eval/inference_video_mcqa_perception_test_mcqa.py CHANGED Viewed

@@ -129,7 +129,7 @@ def run_inference(args):
             output = output.replace('Answer', '')
             pred_answer = re.findall('\(*[A-C]\)*', output)
             try:
-                assert len(pred_answer) >= 1, 'The video \"{}\" output \"{}\" is not in the expected format'.format(video_id, instruct + '\n' + output)
                 pred_answer = pred_answer[0].strip()
                 # if not pred_answer.startswith('('):
                 pred_answer = pred_answer.strip('()')

             output = output.replace('Answer', '')
             pred_answer = re.findall('\(*[A-C]\)*', output)
             try:
+                assert len(pred_answer) >= 1, 'The video \"{}\" instruct: \n\"{}\"\n output: \n\"{}\"\n is not in the expected format'.format(video_id, instruct, output)
                 pred_answer = pred_answer[0].strip()
                 # if not pred_answer.startswith('('):
                 pred_answer = pred_answer.strip('()')

VideoLLaMA2/videollama2/eval/inference_video_mcqa_videomme.py CHANGED Viewed

@@ -219,7 +219,7 @@ def videomme_dump(record, instruct, options, output):
             pred_idx = letters.index(pred_answer)
             find_flag = True
-        assert find_flag, 'The video \"{}\" output: \n\"{}\" is not in the expected format'.format(record['youtube_id'], instruct + '\n' + output)
     except:
         traceback.print_exc()
         pred_idx = 2

             pred_idx = letters.index(pred_answer)
             find_flag = True
+        assert find_flag, 'The video \"{}\" instruct: \n\"{}\"\n output: \n\"{}\"\n is not in the expected format'.format(record['youtube_id'], instruct, output)
     except:
         traceback.print_exc()
         pred_idx = 2

VideoLLaMA2/videollama2/eval/inference_video_oqa_activitynet.py CHANGED Viewed

@@ -49,6 +49,7 @@ class ActivitynetDataset(Dataset):
         question_id = sample['question_id']
         answer      = answer['answer']
         for fmt in self.video_formats:  # Added this line
             temp_path = os.path.join(args.video_folder, f"v_{video_name}{fmt}")
             if os.path.exists(temp_path):
@@ -60,6 +61,9 @@ class ActivitynetDataset(Dataset):
                 video_path = temp_path
                 break
         video_tensor = self.processor(video_path)
         return {
@@ -109,14 +113,18 @@ def run_inference(args):
         # question = question + '\n' + 'Answer the question using a single word or a short phrase with multiple words.'
-        output = mm_infer(
-            video_tensor,
-            question,
-            model=model,
-            tokenizer=tokenizer,
-            modal='video',
-            do_sample=False,
-        )
         sample_set = {'id': question_id, 'question': question, 'answer': answer, 'pred': output}
         ans_file.write(json.dumps(sample_set) + "\n")

         question_id = sample['question_id']
         answer      = answer['answer']
+        video_path = None
         for fmt in self.video_formats:  # Added this line
             temp_path = os.path.join(args.video_folder, f"v_{video_name}{fmt}")
             if os.path.exists(temp_path):
                 video_path = temp_path
                 break
+        if video_path is None:
+            raise FileNotFoundError(f"Video file not found for {os.path.join(args.video_folder, video_name)}")
         video_tensor = self.processor(video_path)
         return {
         # question = question + '\n' + 'Answer the question using a single word or a short phrase with multiple words.'
+        try:
+            output = mm_infer(
+                video_tensor,
+                question,
+                model=model,
+                tokenizer=tokenizer,
+                modal='video',
+                do_sample=False,
+            )
+        except:
+            traceback.print_exc()
+            output = "error"
         sample_set = {'id': question_id, 'question': question, 'answer': answer, 'pred': output}
         ans_file.write(json.dumps(sample_set) + "\n")

VideoLLaMA2/videollama2/mm_utils.py CHANGED Viewed

@@ -5,6 +5,7 @@ import base64
 import traceback
 from io import BytesIO
 import torch
 import imageio
 import numpy as np
@@ -172,7 +173,7 @@ def process_video(video_path, processor, s=None, e=None, aspect_ratio='pad', num
         if os.path.isdir(video_path):
             video_data = [Image.open(os.path.join(video_path, frame_files[f_idx])) for f_idx in sampled_frame_indices]
         elif video_path.endswith('.gif'):
-            video_data = [Image.fromarray(frame) for idx, frame in enumerate(gif_reader) if idx in sampled_frame_indices]
         else:
             video_data = [Image.fromarray(frame) for frame in vreader.get_batch(sampled_frame_indices).asnumpy()]

 import traceback
 from io import BytesIO
+import cv2
 import torch
 import imageio
 import numpy as np
         if os.path.isdir(video_path):
             video_data = [Image.open(os.path.join(video_path, frame_files[f_idx])) for f_idx in sampled_frame_indices]
         elif video_path.endswith('.gif'):
+            video_data = [Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_RGBA2RGB)) for idx, frame in enumerate(gif_reader) if idx in sampled_frame_indices]
         else:
             video_data = [Image.fromarray(frame) for frame in vreader.get_batch(sampled_frame_indices).asnumpy()]

VideoLLaMA2/videollama2/model/__init__.py CHANGED Viewed

@@ -22,12 +22,10 @@ import torch
 from transformers import PretrainedConfig, AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
 from .projector import load_mm_projector
-from .videollama2_llama import Videollama2LlamaForCausalLM, Videollama2Config
 from .videollama2_mistral import Videollama2MistralForCausalLM, Videollama2MistralConfig
 from .videollama2_mixtral import Videollama2MixtralForCausalLM, Videollama2MixtralConfig
 from .videollama2_qwen2 import Videollama2Qwen2ForCausalLM, Videollama2Qwen2Config
-from .videollama2_gemma2 import Videollama2Gemma2ForCausalLM, Videollama2Gemma2Config
-from .videollama2_phi3 import Videollama2Phi3ForCausalLM, Videollama2Phi3Config
 VLLMs = {
@@ -36,8 +34,14 @@ VLLMs = {
     "videollama2_mistral": Videollama2MistralForCausalLM,
     "videollama2_mixtral": Videollama2MixtralForCausalLM,
     "videollama2_qwen2": Videollama2Qwen2ForCausalLM,
-    "videollama2_gemma2": Videollama2Gemma2ForCausalLM,
-    "videollama2_phi3": Videollama2Phi3ForCausalLM,
 }
@@ -69,137 +73,112 @@ def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, l
     if use_flash_attn:
         kwargs['attn_implementation'] = 'flash_attention_2'
-    if "videollama" in model_name.lower() or 'vlb' in model_name.lower():
-        # NOTE: lora/qlora model loading
-        if 'lora' in model_name.lower() or 'qlora' in model_name.lower():
-            if model_base is None:
-                cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
-                # NOTE: AutoConfig will modify `_name_or_path` property to `model_path` if `model_path` is not None.
-                # cfg_pretrained = AutoConfig.from_pretrained(model_path, token=token)
-                model_base = model_base if model_base is not None else cfg_pretrained._name_or_path
-            lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
-            # NOTE: remove qlora training quantization config
-            if hasattr(lora_cfg_pretrained, 'quantization_config'):
-                del lora_cfg_pretrained.quantization_config
-            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False, token=token)
-            print('Loading VideoLLaMA from base model...')
-            if 'vicuna' in model_base.lower():
-                model = Videollama2LlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
-            elif 'mistral' in model_base.lower():
-                model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
-            else:
-                model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
-            token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
-            if model.lm_head.weight.shape[0] != token_num:
-                model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
-                model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
-            print('Loading additional VideoLLaMA weights...')
-            if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
-                non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
-            else:
-                # this is probably from HF Hub
-                from huggingface_hub import hf_hub_download
-                def load_from_hf(repo_id, filename, subfolder=None):
-                    cache_file = hf_hub_download(
-                        repo_id=repo_id,
-                        filename=filename,
-                        subfolder=subfolder)
-                    return torch.load(cache_file, map_location='cpu')
-                non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
-            non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
-            if any(k.startswith('model.model.') for k in non_lora_trainables):
-                non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
-            model.load_state_dict(non_lora_trainables, strict=False)
-            from peft import PeftModel
-            print('Loading LoRA weights...')
-            model = PeftModel.from_pretrained(model, model_path)
-            print('Merging LoRA weights...')
-            model = model.merge_and_unload()
-            print('Model is loaded...')
-        elif model_base is not None or '-base' in model_name.lower():
-            # NOTE: Base/Pretrain model loading
-            print('Loading VideoLLaMA 2 from base model...')
-            cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
-            # NOTE: AutoConfig will modify `_name_or_path` property to `model_path` if `model_path` is not None.
-            # cfg_pretrained = AutoConfig.from_pretrained(model_path, token=token)
-            model_base = model_base if model_base is not None else cfg_pretrained._name_or_path
-            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False, token=token)
-            if 'vicuna' in model_base.lower():
-                model = Videollama2LlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            elif 'mistral' in model_base.lower():
-                model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            elif 'mixtral' in model_base.lower():
-                model = Videollama2MixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            elif 'qwen2' in model_base.lower():
-                model = Videollama2Qwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            elif 'gemma2' in model_base.lower():
-                model = Videollama2Gemma2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            elif 'phi3' in model_base.lower():
-                model = Videollama2Phi3ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            else:
-                model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            # NOTE; loading vision-language projector
-            # * old codes for loading local mm_projector.bin
-            # mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
-            # mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
-            # model.load_state_dict(mm_projector_weights, strict=False)
-            # * new codes which supports loading mm_projector.bin both offline and online
-            mm_projector_weights = load_mm_projector(model_path, token=token)
-            model.load_state_dict(mm_projector_weights, strict=False)
         else:
-            # NOTE: SFT model loading
-            cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
-            model_base = cfg_pretrained._name_or_path
-            if 'vicuna' in model_base.lower():
-                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
-                model = Videollama2LlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
-            elif 'mistral' in model_base.lower():
-                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
-                model = Videollama2MistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
-            elif 'mixtral' in model_base.lower():
-                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
-                model = Videollama2MixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
-            elif 'qwen2' in model_base.lower():
-                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
-                model = Videollama2Qwen2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
-            elif 'gemma2' in model_base.lower():
-                model = Videollama2Gemma2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            elif 'phi3' in model_base.lower():
-                model = Videollama2Phi3ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
-            else:
-                # NOTE: mistral-based model is our default model.
-                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
-                model = Videollama2MistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
-    else:
-        # Load language model
-        if model_base is not None:
-            # PEFT model
-            from peft import PeftModel
-            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
-            model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs)
-            print(f"Loading LoRA weights from {model_path}")
-            model = PeftModel.from_pretrained(model, model_path)
-            print(f"Merging weights")
-            model = model.merge_and_unload()
-            print('Convert to FP16...')
-            model.to(torch.float16)
         else:
-            use_fast = False
-            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
-            model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
     processor = None
-    if "videollama" in model_name.lower() or 'vlb' in model_name.lower():
         vision_tower = model.get_vision_tower()
         if not vision_tower.is_loaded:
             vision_tower.load_model()

 from transformers import PretrainedConfig, AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
 from .projector import load_mm_projector
+from .videollama2_llama import Videollama2LlamaForCausalLM, Videollama2LlamaConfig
 from .videollama2_mistral import Videollama2MistralForCausalLM, Videollama2MistralConfig
 from .videollama2_mixtral import Videollama2MixtralForCausalLM, Videollama2MixtralConfig
 from .videollama2_qwen2 import Videollama2Qwen2ForCausalLM, Videollama2Qwen2Config
 VLLMs = {
     "videollama2_mistral": Videollama2MistralForCausalLM,
     "videollama2_mixtral": Videollama2MixtralForCausalLM,
     "videollama2_qwen2": Videollama2Qwen2ForCausalLM,
+}
+VLLMConfigs = {
+    "videollama2": Videollama2MistralConfig,
+    "videollama2_llama": Videollama2LlamaConfig,
+    "videollama2_mistral": Videollama2MistralConfig,
+    "videollama2_mixtral": Videollama2MixtralConfig,
+    "videollama2_qwen2": Videollama2Qwen2Config,
 }
     if use_flash_attn:
         kwargs['attn_implementation'] = 'flash_attention_2'
+    config = AutoConfig.from_pretrained(model_path)
+    # judge model type
+    model_type = config.model_type
+    # judge pretrain/finetune
+    try:
+        is_pretraining = config.tune_mm_mlp_adapter
+    except:
+        is_pretraining = False
+    # NOTE: lora/qlora model loading
+    if 'lora' in model_name.lower() or 'qlora' in model_name.lower():
+        cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
+        # NOTE: AutoConfig will modify `_name_or_path` property to `model_path` if `model_path` is not None.
+        # cfg_pretrained = AutoConfig.from_pretrained(model_path, token=token)
+        model_base = model_base if model_base is not None else cfg_pretrained._name_or_path
+        # NOTE: remove qlora training quantization config
+        if hasattr(config, 'quantization_config'):
+            del config.quantization_config
+        tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False, token=token)
+        print('Loading VideoLLaMA lora model...')
+        if 'vicuna' in model_base.lower():
+            model = Videollama2LlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif 'mistral' in model_base.lower():
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
         else:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
+        if model.lm_head.weight.shape[0] != token_num:
+            model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+            model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+        print('Loading additional VideoLLaMA weights...')
+        if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
+            non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
+        else:
+            # this is probably from HF Hub
+            from huggingface_hub import hf_hub_download
+            def load_from_hf(repo_id, filename, subfolder=None):
+                cache_file = hf_hub_download(
+                    repo_id=repo_id,
+                    filename=filename,
+                    subfolder=subfolder)
+                return torch.load(cache_file, map_location='cpu')
+            non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
+        non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
+        if any(k.startswith('model.model.') for k in non_lora_trainables):
+            non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
+        model.load_state_dict(non_lora_trainables, strict=False)
+        from peft import PeftModel
+        print('Loading LoRA weights...')
+        model = PeftModel.from_pretrained(model, model_path)
+        print('Merging LoRA weights...')
+        model = model.merge_and_unload()
+        print('Model is loaded...')
+    elif model_base is not None or is_pretraining:
+        # NOTE: Base/Pretrain model loading
+        print('Loading VideoLLaMA 2 from base model...')
+        cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
+        # NOTE: AutoConfig will modify `_name_or_path` property to `model_path` if `model_path` is not None.
+        # cfg_pretrained = AutoConfig.from_pretrained(model_path, token=token)
+        model_base = model_base if model_base is not None else cfg_pretrained._name_or_path
+        tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False, token=token)
+        if model_type in ['videollama2', 'videollama2_mistral']:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_mixtral']:
+            model = Videollama2MixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_qwen2']:
+            model = Videollama2Qwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
         else:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        # NOTE; loading vision-language projector
+        # * old codes for loading local mm_projector.bin
+        # mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
+        # mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+        # model.load_state_dict(mm_projector_weights, strict=False)
+        # * new codes which supports loading mm_projector.bin both offline and online
+        mm_projector_weights = load_mm_projector(model_path, token=token)
+        model.load_state_dict(mm_projector_weights, strict=False)
+    elif 'videollama2' in model_type:
+        # NOTE: SFT model loading
+        tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
+        if model_type in ['videollama2', 'videollama2_mistral']:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_mixtral']:
+            model = Videollama2MixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_qwen2']:
+            model = Videollama2Qwen2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        else:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True, token=token)
+        model = AutoModelForCausalLM.from_pretrained(model_path, config=config, **kwargs)
     processor = None
+    if "videollama" in model_type:
         vision_tower = model.get_vision_tower()
         if not vision_tower.is_loaded:
             vision_tower.load_model()

VideoLLaMA2/videollama2/model/encoder.py CHANGED Viewed

@@ -88,6 +88,10 @@ class CLIPVisionTower(nn.Module):
     def num_patches_per_side(self):
         return self.config.image_size // self.config.patch_size
 class SiglipVisionTower(nn.Module):
@@ -165,7 +169,11 @@ class SiglipVisionTower(nn.Module):
     @property
     def num_patches_per_side(self):
         return self.config.image_size // self.config.patch_size
 def build_vision_tower(vision_tower_cfg, **kwargs):
     vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))

     def num_patches_per_side(self):
         return self.config.image_size // self.config.patch_size
+    @property
+    def image_size(self):
+        return self.config.image_size
 class SiglipVisionTower(nn.Module):
     @property
     def num_patches_per_side(self):
         return self.config.image_size // self.config.patch_size
+    @property
+    def image_size(self):
+        return self.config.image_size
 def build_vision_tower(vision_tower_cfg, **kwargs):
     vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))

VideoLLaMA2/videollama2/model/videollama2_arch.py CHANGED Viewed

@@ -117,6 +117,7 @@ class Videollama2MetaForCausalLM(ABC):
         data_batch = []
         for i, (data, modal) in enumerate(images):
             if modal == 'image':
                 data = data.expand(num_frames, -1, -1, -1)
             else:
@@ -125,6 +126,8 @@ class Videollama2MetaForCausalLM(ABC):
         data_batch = torch.stack(data_batch, dim=0)
         assert len(data_batch.size()) == 5
         batch_size = data_batch.size(0)

         data_batch = []
         for i, (data, modal) in enumerate(images):
+            print(data, modal.shape)
             if modal == 'image':
                 data = data.expand(num_frames, -1, -1, -1)
             else:
         data_batch = torch.stack(data_batch, dim=0)
+        print(data_batch.shape)
         assert len(data_batch.size()) == 5
         batch_size = data_batch.size(0)

VideoLLaMA2/videollama2/model/videollama2_gemma2.py DELETED Viewed

@@ -1,157 +0,0 @@
-# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
-#    Copyright 2023 Haotian Liu
-#
-#    Licensed under the Apache License, Version 2.0 (the "License");
-#    you may not use this file except in compliance with the License.
-#    You may obtain a copy of the License at
-#
-#        http://www.apache.org/licenses/LICENSE-2.0
-#
-#    Unless required by applicable law or agreed to in writing, software
-#    distributed under the License is distributed on an "AS IS" BASIS,
-#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#    See the License for the specific language governing permissions and
-#    limitations under the License.
-from typing import List, Optional, Tuple, Union
-import torch
-import torch.nn as nn
-from torch.nn import CrossEntropyLoss
-from transformers import AutoConfig, AutoModelForCausalLM, \
-                         Gemma2Config, Gemma2Model, Gemma2ForCausalLM
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from transformers.generation.utils import GenerateOutput
-from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
-class Videollama2Gemma2Config(Gemma2Config):
-    model_type = "videollama2_gemma2"
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        self.model_type = "videollama2_gemma2"
-class Videollama2Gemma2Model(Videollama2MetaModel, Gemma2Model):
-    config_class = Videollama2Gemma2Config
-    def __init__(self, config: Gemma2Config):
-        super(Videollama2Gemma2Model, self).__init__(config)
-class Videollama2Gemma2ForCausalLM(Gemma2ForCausalLM, Videollama2MetaForCausalLM):
-    config_class = Videollama2Gemma2Config
-    def __init__(self, config, **kwargs):
-        super(Gemma2ForCausalLM, self).__init__(config)
-        self.model = Videollama2Gemma2Model(config)
-        # self.pretraining_tp = config.pretraining_tp
-        self.vocab_size = config.vocab_size
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-        # Initialize weights and apply final processing
-        self.post_init()
-    def get_model(self):
-        return self.model
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        labels: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        images: Optional[torch.FloatTensor] = None,
-        return_dict: Optional[bool] = None,
-        **kwargs
-    ) -> Union[Tuple, CausalLMOutputWithPast]:
-        if inputs_embeds is None:
-            (
-                input_ids,
-                attention_mask,
-                past_key_values,
-                inputs_embeds,
-                labels
-            ) = self.prepare_inputs_labels_for_multimodal(
-                input_ids,
-                attention_mask,
-                past_key_values,
-                labels,
-                images
-            )
-        outputs = super().forward(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            past_key_values=past_key_values,
-            inputs_embeds=inputs_embeds,
-            labels=labels,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict
-        )
-        outputs.labels = labels
-        return outputs
-    @torch.no_grad()
-    def generate(
-        self,
-        inputs: Optional[torch.Tensor] = None,
-        images: Optional[torch.Tensor] = None,
-        **kwargs,
-    ) -> Union[GenerateOutput, torch.LongTensor]:
-        position_ids = kwargs.pop("position_ids", None)
-        attention_mask = kwargs.pop("attention_mask", None)
-        if "inputs_embeds" in kwargs:
-            raise NotImplementedError("`inputs_embeds` is not supported")
-        if images is not None:
-            (
-                input_ids,
-                attention_mask,
-                past_key_values,
-                inputs_embeds,
-                _
-            ) = self.prepare_inputs_labels_for_multimodal(
-                input_ids=inputs,
-                attention_mask=attention_mask,
-                past_key_values=None,
-                labels=None,
-                images=images
-            )
-        else:
-            inputs_embeds = self.get_model().embed_tokens(inputs)
-        return super().generate(
-            position_ids=position_ids,
-            attention_mask=attention_mask,
-            inputs_embeds=inputs_embeds,
-            **kwargs
-        )
-    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
-        images = kwargs.pop("images", None)
-        _inputs = super().prepare_inputs_for_generation(
-            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
-        )
-        if images is not None:
-            _inputs['images'] = images
-        return _inputs
-AutoConfig.register("videollama2_gemma2", Videollama2Gemma2Config)
-AutoModelForCausalLM.register(Videollama2Gemma2Config, Videollama2Gemma2ForCausalLM)

VideoLLaMA2/videollama2/model/videollama2_llama.py CHANGED Viewed

@@ -27,7 +27,7 @@ from transformers.generation.utils import GenerateOutput
 from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
-class Videollama2Config(LlamaConfig):
     model_type = "videollama2_llama"
     def __init__(self, **kwargs):
@@ -36,14 +36,14 @@ class Videollama2Config(LlamaConfig):
 class Videollama2LlamaModel(Videollama2MetaModel, LlamaModel):
-    config_class = Videollama2Config
     def __init__(self, config: LlamaConfig):
         super(Videollama2LlamaModel, self).__init__(config)
 class Videollama2LlamaForCausalLM(LlamaForCausalLM, Videollama2MetaForCausalLM):
-    config_class = Videollama2Config
     def __init__(self, config, **kwargs):
         super(LlamaForCausalLM, self).__init__(config)
@@ -98,7 +98,7 @@ class Videollama2LlamaForCausalLM(LlamaForCausalLM, Videollama2MetaForCausalLM):
             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
-            return_dict=return_dict
         )
         outputs.labels = labels
@@ -151,5 +151,5 @@ class Videollama2LlamaForCausalLM(LlamaForCausalLM, Videollama2MetaForCausalLM):
         return _inputs
-AutoConfig.register("videollama2_llama", Videollama2Config)
-AutoModelForCausalLM.register(Videollama2Config, Videollama2LlamaForCausalLM)

 from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+class Videollama2LlamaConfig(LlamaConfig):
     model_type = "videollama2_llama"
     def __init__(self, **kwargs):
 class Videollama2LlamaModel(Videollama2MetaModel, LlamaModel):
+    config_class = Videollama2LlamaConfig
     def __init__(self, config: LlamaConfig):
         super(Videollama2LlamaModel, self).__init__(config)
 class Videollama2LlamaForCausalLM(LlamaForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2LlamaConfig
     def __init__(self, config, **kwargs):
         super(LlamaForCausalLM, self).__init__(config)
             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
         )
         outputs.labels = labels
         return _inputs
+AutoConfig.register("videollama2_llama", Videollama2LlamaConfig)
+AutoModelForCausalLM.register(Videollama2LlamaConfig, Videollama2LlamaForCausalLM)

VideoLLaMA2/videollama2/model/videollama2_mistral.py CHANGED Viewed

@@ -100,7 +100,7 @@ class Videollama2MistralForCausalLM(MistralForCausalLM, Videollama2MetaForCausal
             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
-            return_dict=return_dict
         )
         outputs.labels = labels

             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
         )
         outputs.labels = labels

VideoLLaMA2/videollama2/model/videollama2_mixtral.py CHANGED Viewed

@@ -99,7 +99,7 @@ class Videollama2MixtralForCausalLM(MixtralForCausalLM, Videollama2MetaForCausal
             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
-            return_dict=return_dict
         )
     @torch.no_grad()

             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
         )
     @torch.no_grad()

VideoLLaMA2/videollama2/model/videollama2_phi3.py DELETED Viewed

@@ -1,157 +0,0 @@
-# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
-#    Copyright 2023 Haotian Liu
-#
-#    Licensed under the Apache License, Version 2.0 (the "License");
-#    you may not use this file except in compliance with the License.
-#    You may obtain a copy of the License at
-#
-#        http://www.apache.org/licenses/LICENSE-2.0
-#
-#    Unless required by applicable law or agreed to in writing, software
-#    distributed under the License is distributed on an "AS IS" BASIS,
-#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#    See the License for the specific language governing permissions and
-#    limitations under the License.
-from typing import List, Optional, Tuple, Union
-import torch
-import torch.nn as nn
-from torch.nn import CrossEntropyLoss
-from transformers import AutoConfig, AutoModelForCausalLM, PretrainedConfig, \
-                         Phi3Config, Phi3Model, Phi3ForCausalLM
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from transformers.generation.utils import GenerateOutput
-from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
-class Videollama2Phi3Config(Phi3Config):
-    model_type = "videollama2_phi3"
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        self.model_type = "videollama2_phi3"
-class Videollama2Phi3Model(Videollama2MetaModel, Phi3Model):
-    config_class = Videollama2Phi3Config
-    def __init__(self, config: Phi3Config):
-        super(Videollama2Phi3Model, self).__init__(config)
-class Videollama2Phi3ForCausalLM(Phi3ForCausalLM, Videollama2MetaForCausalLM):
-    config_class = Videollama2Phi3Config
-    def __init__(self, config, **kwargs):
-        super(Phi3ForCausalLM, self).__init__(config)
-        self.model = Videollama2Phi3Model(config)
-        # self.pretraining_tp = config.pretraining_tp
-        self.vocab_size = config.vocab_size
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-        # Initialize weights and apply final processing
-        self.post_init()
-    def get_model(self):
-        return self.model
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        labels: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        images: Optional[torch.FloatTensor] = None,
-        return_dict: Optional[bool] = None,
-        **kwargs
-    ) -> Union[Tuple, CausalLMOutputWithPast]:
-        if inputs_embeds is None:
-            (
-                input_ids,
-                attention_mask,
-                past_key_values,
-                inputs_embeds,
-                labels
-            ) = self.prepare_inputs_labels_for_multimodal(
-                input_ids,
-                attention_mask,
-                past_key_values,
-                labels,
-                images
-            )
-        outputs = super().forward(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            past_key_values=past_key_values,
-            inputs_embeds=inputs_embeds,
-            labels=labels,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict
-        )
-        outputs.labels = labels
-        return outputs
-    @torch.no_grad()
-    def generate(
-        self,
-        inputs: Optional[torch.Tensor] = None,
-        images: Optional[torch.Tensor] = None,
-        **kwargs,
-    ) -> Union[GenerateOutput, torch.LongTensor]:
-        position_ids = kwargs.pop("position_ids", None)
-        attention_mask = kwargs.pop("attention_mask", None)
-        if "inputs_embeds" in kwargs:
-            raise NotImplementedError("`inputs_embeds` is not supported")
-        if images is not None:
-            (
-                input_ids,
-                attention_mask,
-                past_key_values,
-                inputs_embeds,
-                _
-            ) = self.prepare_inputs_labels_for_multimodal(
-                input_ids=inputs,
-                attention_mask=attention_mask,
-                past_key_values=None,
-                labels=None,
-                images=images
-            )
-        else:
-            inputs_embeds = self.get_model().embed_tokens(inputs)
-        return super().generate(
-            position_ids=position_ids,
-            attention_mask=attention_mask,
-            inputs_embeds=inputs_embeds,
-            **kwargs
-        )
-    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
-        images = kwargs.pop("images", None)
-        _inputs = super().prepare_inputs_for_generation(
-            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
-        )
-        if images is not None:
-            _inputs['images'] = images
-        return _inputs
-AutoConfig.register("videollama2_phi3", Videollama2Phi3Config)
-AutoModelForCausalLM.register(Videollama2Phi3Config, Videollama2Phi3ForCausalLM)

VideoLLaMA2/videollama2/model/videollama2_qwen2.py CHANGED Viewed

@@ -98,7 +98,7 @@ class Videollama2Qwen2ForCausalLM(Qwen2ForCausalLM, Videollama2MetaForCausalLM):
             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
-            return_dict=return_dict
         )
     @torch.no_grad()

             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
         )
     @torch.no_grad()

VideoLLaMA2/videollama2/serve/gradio_web_server_adhoc.py CHANGED Viewed

@@ -129,20 +129,26 @@ def generate(image, video, message, chatbot, textbox_in, temperature, top_p, max
         one_turn_chat[0] += "\n" + show_images
     # 2. not first run case
     else:
-        previous_image = re.findall(r'<img src="./file=(.+?)"', chatbot[0][0])
-        previous_video = re.findall(r'<video controls playsinline width="500" style="display: inline-block;"  src="./file=(.+?)"', chatbot[0][0])
-        if len(previous_image) > 0:
-            previous_image = previous_image[0]
-            # 2.1 new image append or pure text input will start a new conversation
-            if previous_image != image:
-                message.clear()
-                one_turn_chat[0] += "\n" + show_images if image is not None else ""
-        elif len(previous_video) > 0:
-            previous_video = previous_video[0]
-            # 2.2 new video append or pure text input will start a new conversation
-            if previous_video != video:
-                message.clear()
-                one_turn_chat[0] += "\n" + show_images if video is not None else ""
     message.append({'role': 'user', 'content': textbox_in})
     text_en_out = handler.generate(data, message, temperature=temperature, top_p=top_p, max_output_tokens=max_output_tokens)
@@ -173,7 +179,7 @@ def clear_history(message, chatbot):
 # 2. The operation or tensor which requires cuda are limited in those functions wrapped via spaces.GPU
 # 3. The function can't return tensor or other cuda objects.
-model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-16F'
 handler = Chat(model_path, load_8bit=False, load_4bit=True)

         one_turn_chat[0] += "\n" + show_images
     # 2. not first run case
     else:
+        # scanning the last image or video
+        length = len(chatbot)
+        for i in range(length - 1, -1, -1):
+            previous_image = re.findall(r'<img src="./file=(.+?)"', chatbot[i][0])
+            previous_video = re.findall(r'<video controls playsinline width="500" style="display: inline-block;"  src="./file=(.+?)"', chatbot[i][0])
+            if len(previous_image) > 0:
+                previous_image = previous_image[-1]
+                # 2.1 new image append or pure text input will start a new conversation
+                if (video is not None) or (image is not None and os.path.basename(previous_image) != os.path.basename(image)):
+                    message.clear()
+                    one_turn_chat[0] += "\n" + show_images
+                break
+            elif len(previous_video) > 0:
+                previous_video = previous_video[-1]
+                # 2.2 new video append or pure text input will start a new conversation
+                if image is not None or (video is not None and os.path.basename(previous_video) != os.path.basename(video)):
+                    message.clear()
+                    one_turn_chat[0] += "\n" + show_images
+                break
     message.append({'role': 'user', 'content': textbox_in})
     text_en_out = handler.generate(data, message, temperature=temperature, top_p=top_p, max_output_tokens=max_output_tokens)
 # 2. The operation or tensor which requires cuda are limited in those functions wrapped via spaces.GPU
 # 3. The function can't return tensor or other cuda objects.
+model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F'
 handler = Chat(model_path, load_8bit=False, load_4bit=True)

VideoLLaMA2/videollama2/train.py CHANGED Viewed

@@ -87,7 +87,7 @@ class ModelArguments:
 @dataclass
 class DataArguments:
     # Path Arguments
-    data_path: str = field(default=None, metadata={"help": "Path to the training data."})
     # image_folder: Optional[str] = field(default=None)
     # video_folder: Optional[str] = field(default=None)
     data_folder: Optional[str] = field(default=None)
@@ -105,7 +105,6 @@ class TrainingArguments(transformers.TrainingArguments):
     mm_projector_lr: Optional[float] = None
     freeze_mm_mlp_adapter: bool = field(default=False)
     remove_unused_columns: bool = field(default=False)
-    cache_dir: Optional[str] = field(default=None)
     # Training Data Arguments
     group_by_modality_length: bool = field(default=False)
     model_max_length: int = field(
@@ -153,23 +152,14 @@ def preprocess_plain(
             {'role': 'user', 'content': modal_token},
             {'role': 'assistant', 'content': source[1]['value']}
         ]
-        conversation = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False)
-        # 2. tokenize conversations
-        input_ids.append(tokenizer_multimodal_token(conversation, tokenizer, modal_token, return_tensors='pt'))
-        # 3. make targets
-        targets.append(copy.deepcopy(input_ids[-1]))
-        instruction = tokenizer.apply_chat_template(message[:1], tokenize=False, add_generation_prompt=True)
-        instruction_len = len(tokenizer_multimodal_token(instruction, tokenizer, modal_token, return_tensors='pt'))
-        targets[-1][:instruction_len] = IGNORE_INDEX
-        # print("instruction: ----------------")
-        # print(instruction)
-        # print("conversation: ----------------")
-        # print(conversation)
-        # print("training targets: ----------------")
-        # print(tokenizer.decode(targets[-1][instruction_len:]))
-        # print(input_ids[-1])
-        # print(targets[-1])
     return dict(input_ids=input_ids, labels=targets)
@@ -251,7 +241,10 @@ class LazySupervisedDataset(Dataset):
                  tokenizer: transformers.PreTrainedTokenizer,
                  data_args: DataArguments):
         super(LazySupervisedDataset, self).__init__()
-        list_data_dict = json.load(open(data_path, "r"))
         rank0_print("Formatting inputs...Skip in lazy mode")
         self.tokenizer = tokenizer
@@ -340,8 +333,7 @@ class LazySupervisedDataset(Dataset):
             data_dict['video'] = video
         elif self.data_args.is_multimodal:
             # image does not exist in the data, but the model is multimodal
-            crop_size = self.data_args.image_processor.crop_size
-            data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
         return data_dict
@@ -429,18 +421,14 @@ def train(attn_implementation=None):
                 bnb_4bit_quant_storage=compute_dtype,
             )
         ))
-    config = transformers.AutoConfig.from_pretrained(model_args.model_path, trust_remote_code=True)
-    if 'gemma2' in model_args.model_type:
-        config._attn_implementation = 'eager'
-    else:
-        config._attn_implementation = attn_implementation
     if model_args.vision_tower is not None:
         model = VLLMs[model_args.model_type].from_pretrained(
             model_args.model_path,
             config=config,
-            cache_dir=training_args.cache_dir,
             torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
             do_sample=True,
             **bnb_model_from_pretrained_args
@@ -452,7 +440,6 @@ def train(attn_implementation=None):
         model = transformers.LlamaForCausalLM.from_pretrained(
             model_args.model_path,
             config=config,
-            cache_dir=training_args.cache_dir,
             torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
             do_sample=True,
             **bnb_model_from_pretrained_args
@@ -496,7 +483,6 @@ def train(attn_implementation=None):
     tokenizer = transformers.AutoTokenizer.from_pretrained(
         model_args.model_path,
-        cache_dir=training_args.cache_dir,
         model_max_length=training_args.model_max_length,
         padding_side="right",
         use_fast=True,
@@ -512,6 +498,8 @@ def train(attn_implementation=None):
         vision_tower = model.get_vision_tower()
         vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
         data_args.image_processor = vision_tower.image_processor
         data_args.video_processor = vision_tower.video_processor if hasattr(vision_tower, "video_processor") else vision_tower.image_processor
@@ -581,4 +569,4 @@ def train(attn_implementation=None):
 if __name__ == "__main__":
-    train()

 @dataclass
 class DataArguments:
     # Path Arguments
+    data_path: List[str] = field(default=None, metadata={"help": "Path to the training data."})
     # image_folder: Optional[str] = field(default=None)
     # video_folder: Optional[str] = field(default=None)
     data_folder: Optional[str] = field(default=None)
     mm_projector_lr: Optional[float] = None
     freeze_mm_mlp_adapter: bool = field(default=False)
     remove_unused_columns: bool = field(default=False)
     # Training Data Arguments
     group_by_modality_length: bool = field(default=False)
     model_max_length: int = field(
             {'role': 'user', 'content': modal_token},
             {'role': 'assistant', 'content': source[1]['value']}
         ]
+        conversation = " ".join([sentence['value'] for sentence in source])
+        input_id = tokenizer_multimodal_token(conversation, tokenizer, modal_token, return_tensors='pt')
+        target = copy.deepcopy(input_id)
+        target[input_id == MODAL_INDEX_MAP[modal_token]] = IGNORE_INDEX
+        input_ids.append(input_id)
+        targets.append(target)
     return dict(input_ids=input_ids, labels=targets)
                  tokenizer: transformers.PreTrainedTokenizer,
                  data_args: DataArguments):
         super(LazySupervisedDataset, self).__init__()
+        list_data_dict = []
+        for dp in data_path:
+            _datas = json.load(open(dp, "r"))
+            list_data_dict.extend(_datas)
         rank0_print("Formatting inputs...Skip in lazy mode")
         self.tokenizer = tokenizer
             data_dict['video'] = video
         elif self.data_args.is_multimodal:
             # image does not exist in the data, but the model is multimodal
+            data_dict['image'] = torch.zeros(3, self.data_args.image_size, self.data_args.image_size)
         return data_dict
                 bnb_4bit_quant_storage=compute_dtype,
             )
         ))
+    config = VLLMConfigs[model_args.model_type].from_pretrained(model_args.model_path, trust_remote_code=True)
+    config._attn_implementation = attn_implementation
     if model_args.vision_tower is not None:
         model = VLLMs[model_args.model_type].from_pretrained(
             model_args.model_path,
             config=config,
             torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
             do_sample=True,
             **bnb_model_from_pretrained_args
         model = transformers.LlamaForCausalLM.from_pretrained(
             model_args.model_path,
             config=config,
             torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
             do_sample=True,
             **bnb_model_from_pretrained_args
     tokenizer = transformers.AutoTokenizer.from_pretrained(
         model_args.model_path,
         model_max_length=training_args.model_max_length,
         padding_side="right",
         use_fast=True,
         vision_tower = model.get_vision_tower()
         vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
+        data_args.image_size = vision_tower.image_size
         data_args.image_processor = vision_tower.image_processor
         data_args.video_processor = vision_tower.video_processor if hasattr(vision_tower, "video_processor") else vision_tower.image_processor
 if __name__ == "__main__":
+    train("flash_attention_2")

VideoLLaMA2/videollama2/train_flash_attn.py DELETED Viewed

@@ -1,12 +0,0 @@
-# Adopted from https://github.com/haotian-liu/LLaVA. Below is the original copyright:
-# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
-# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
-# Make it more memory efficient by monkey patching the LLaMA model with FlashAttn.
-import sys
-sys.path.append('./')
-from videollama2.train import train
-if __name__ == "__main__":
-    train(attn_implementation="flash_attention_2")