--- base_model: - mistralai/Mistral-7B-Instruct-v0.2 library_name: transformers license: mit pipeline_tag: video-text-to-text --- # VideoChat2-TPO This model is based on the paper [Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment](https://huggingface.co/papers/2412.19326). ## 🏃 Installation ``` pip install -r requirements.txt python app.py ``` ## 🔧 Usage ``` from transformers import AutoModel, AutoTokenizer from tokenizer import MultimodalLlamaTokenizer model_path = "OpenGVLab/VideoChat-TPO" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False,) model = AutoModel.from_pretrained(model_path, trust_remote_code=True, _tokenizer=self.tokenizer).eval() ```